# | include: false
import numpy as np
import pandas as pd
256852) np.random.seed(
Pandas (Need)
What to expect in this chapter
This section will introduce you to dataframes. Data frames are one of the most powerful ways to store and manipulate data. They ‘look’ similar to spreadsheets, but don’t be fooled; dataframes are far more versatile and powerful. The Python package dedicated to dataframes is Pandas, which is built on top of NumPy. If you plan to work with data Pandas is one of the must-have tools in your toolkit.
Import Pandas and NumPy using:
python import pandas as pd import numpy as np
You will also need the
xlrd
package to read and write.xls
files. Please install it usingbash conda install -c conda-forge -y xlrd
Download the following data sets and move them into your active folder.
Format | ||
---|---|---|
1. | Test data for a dummy class | dummy-class_1-of-2.xlsx dummy-class-2-of-2.xlsx |
2. | Data for another dummy class | another-dummy-class.xlsx |
1 What is a dataframe
An easy way to get started with dataframes is to read in some data and see what it looks like. I have prepared a dummy set of classrom data for this. You should have already downloaded it into the relevant folder in your Learning Portfolio.
1.1 Reading data from csv
or excel
files.
One of the most common file formats used to hold data that you will encounter are Excel files (.xls
or .xlsx
), csv
(comma-separated-values) files or text files (.txt
or .dat
). Pandas offer ways to read all these files and (many) more. Let’s read in the data in dummy-class.xlsx
.
= pd.read_excel("dummy-class-1-of-2.xlsx", skiprows = 1) df_class
I have used the option skiprows=1
because the first row of the Excel file contains a description that is not data. skiprows
is one of many options we can specify.
I have used the variable df_class
to hold the dataframe. With Jupyter you can see the dataframe in a nicely formatted output by simply running:
df_class
Unnamed: 0 | Name | Student No | Major | Gender | Test 2 (20%) | Test 1 (30%) | |
---|---|---|---|---|---|---|---|
0 | 0 | Braiden Henson | A3028967J | PHY | M | ‘18.96’ | 20.205 |
1 | 1 | Gustavo Vang | A1282849W | CHM | F | ‘17.44’ | 13.470 |
2 | 2 | Ronin Christian | A5408925A | PHY | M | ‘15.56’ | 18.366 |
3 | 3 | Owen Anderson | A6973859L | LS | F | ‘16.36’ | 18.366 |
4 | 4 | Kyla Young | A5410124H | PHY | M | NaN | 15.306 |
5 | 5 | Wyatt Oliver | A9568373Q | PHY | M | ‘14.088’ | 12.246 |
6 | 6 | Essence Bauer | A6824244G | LS | M | ‘16.72’ | 16.530 |
7 | 7 | Maryjane Sandoval | A9194090U | LS | F | ‘16.4’ | 18.981 |
8 | 8 | Carl Trujillo | A4828364M | LS | NB | ‘13.68’ | 15.306 |
9 | 9 | Halle Fritz | A4607700C | LS | F | ‘9.04’ | 17.754 |
10 | 10 | Marie Hoffman | A7067766E | LS | F | ‘16.88’ | 19.593 |
11 | 11 | Lilianna Kaufman | A5569996J | LS | M | ‘17.0’ | 26.328 |
12 | 12 | Jaxon Chung | A3202548I | PHY | M | ‘16.68’ | 14.082 |
13 | 13 | Zoey Oconnell | A6131593U | LS | F | ‘14.128’ | 22.041 |
14 | 14 | Quentin Kemp | A7653832E | CHM | F | ‘15.72’ | NaN |
15 | 15 | Leo Mayo | A9462811I | PHY | F | ‘14.68’ | 20.817 |
16 | 16 | Camden Williams | A1218599T | CHM | F | ‘17.648’ | 22.653 |
17 | 17 | Sidney Wiggins | A7210476B | PHY | NB | ‘14.68’ | 19.593 |
18 | 18 | Solomon Fletcher | A1512479K | CHM | F | ‘15.0’ | 18.981 |
19 | 19 | Riley Christensen | A7986368Y | CHM | F | ‘16.016’ | 15.306 |
20 | 20 | Malik Becker | A2727061A | PHY | M | NaN | 12.858 |
21 | 21 | Skylar Hensley | A2999472W | CHM | M | ‘17.648’ | 23.877 |
22 | 22 | Braydon Duran | A7116486E | LS | F | ‘14.36’ | 15.918 |
23 | 23 | Jalen Harmon | A6931452S | LS | F | ‘16.88’ | 22.041 |
24 | 24 | Ean Haas | A9649096H | LS | F | ‘14.36’ | 19.593 |
25 | 25 | Carolina Mcmahon | A1643380L | PHY | M | ‘18.08’ | 12.858 |
26 | 26 | Elian Potter | A6787293E | PHY | M | ‘14.36’ | 23.265 |
27 | 27 | Litzy White | A5975988J | PHY | NB | ‘15.92’ | 20.817 |
28 | 28 | Norah Miles | A3699958T | CHM | M | NaN | NaN |
29 | 29 | Mariela Sheppard | A1956366U | LS | NB | ‘11.0’ | 18.366 |
30 | 30 | Isabela Stokes | A1468689D | CHM | M | ‘16.8’ | 18.366 |
31 | 31 | Kathleen Rodriguez | A3217320C | PHY | M | ‘17.76’ | 22.653 |
32 | 32 | Katie Ayers | A6867791C | CBIO | F | ‘10.28’ | 16.530 |
33 | 33 | Tucker Sloan | A4080490P | LS | M | ‘15.04’ | 17.754 |
34 | 34 | Carter Crane | A7667457P | LS | M | ‘13.816’ | 20.817 |
1.2 Characteristics to notice
Okay, now you are bound to think that the above dataframe is no different from a typical spreadsheet. First, however, let me draw your attention to three different characteristics.
- The first is the use of an
index
(leftmost column in bold) for each row (0 1 2 3...
). This simple fact simplifies doing things with your data. I will show you some examples later. - The second is that a column (called a Pandas Series) can only have a single type of data. So, for example, you can have numbers or text; but not both.
- The third is the presence of
NaN
(‘Not-a-Number’) when data is missing. Missing data is a common bane of real data; Pandas offers powerful ways to deal with such missing numbers transparently. I will show more later in this chapter.
For the moment,
Remember that a dataframe is like a spreadsheet but has an additional unique index for each row. This index plays an important role.
2 Quick Peeks
2.1 Head and Tail
Real data sets are bound to have more rows than the dummy class toy data set. You can take a quick peek at the top or bottom of the dataframe by using the head()
and tail()
methods, respectively.
3) df_class.head(
Unnamed: 0 | Name | Student No | Major | Gender | Test 2 (20%) | Test 1 (30%) | |
---|---|---|---|---|---|---|---|
0 | 0 | Braiden Henson | A3028967J | PHY | M | ‘18.96’ | 20.205 |
1 | 1 | Gustavo Vang | A1282849W | CHM | F | ‘17.44’ | 13.470 |
2 | 2 | Ronin Christian | A5408925A | PHY | M | ‘15.56’ | 18.366 |
3) df_class.tail(
Unnamed: 0 | Name | Student No | Major | Gender | Test 2 (20%) | Test 1 (30%) | |
---|---|---|---|---|---|---|---|
32 | 32 | Katie Ayers | A6867791C | CBIO | F | ‘10.28’ | 16.530 |
33 | 33 | Tucker Sloan | A4080490P | LS | M | ‘15.04’ | 17.754 |
34 | 34 | Carter Crane | A7667457P | LS | M | ‘13.816’ | 20.817 |
If you do not specify the number of rows, it defaults to five.
2.2 Size and Shape
We can get the size and shape of the dataframe by using shape
, len()
and columns
.
df_class.shape
(35, 7)
len(df_class)
35
df_class.columns
Index(['Unnamed: 0', 'Name', 'Student No', 'Major', 'Gender', 'Test 2 (20%)',
'Test 1 (30%)'],
dtype='object')
2.3 Summaries (I)
Pandas offer more powerful ways to get more valuable summaries of your dataset with the methods info()
and describe()
.
df_class.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 35 non-null int64
1 Name 35 non-null object
2 Student No 35 non-null object
3 Major 35 non-null object
4 Gender 35 non-null object
5 Test 2 (20%) 32 non-null object
6 Test 1 (30%) 33 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 2.0+ KB
This information is very useful. You can quickly see the various columns, what type of data they contain and how many data points are present. This is a good place to spot missing data like in ‘Test 1’ and ‘Test 2’. The type object
means str
or English-like data. float64
is floating point data (i.e. decimal numbers). We also note that ‘Test 2’ is not a numerical type despite their data having numerical significance. So, one of the first things we should do is convert these to numerical types. I will focus on type conversion a bit later. However, let me ‘parachute’ some code here to continue our discussion.
A quick digression
Pandas do not see ‘Test 2’ data as numbers because its data has additional' '
.
'Test 2 (20%)'].head(3) df_class[
0 '18.96'
1 '17.44'
2 '15.56'
Name: Test 2 (20%), dtype: object
Here, I first replace the '
with nothing and then convert the data using pd.to_numeric()
'Test 2 (20%)'] = df_class['Test 2 (20%)'].str.replace("'", "")
df_class['Test 2 (20%)'] = pd.to_numeric(df_class['Test 2 (20%)']) df_class[
df_class.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 35 non-null int64
1 Name 35 non-null object
2 Student No 35 non-null object
3 Major 35 non-null object
4 Gender 35 non-null object
5 Test 2 (20%) 32 non-null float64
6 Test 1 (30%) 33 non-null float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.0+ KB
Back to the original discussion
Okay, let’s continue discussing ways to peek into the file.
df_class.describe()
Unnamed: 0 Test 2 (20%) Test 1 (30%)
count 35.000000 32.000000 33.000000
mean 17.000000 15.405750 18.534455
std 10.246951 2.229394 3.487934
min 0.000000 9.040000 12.246000
25% 8.500000 14.360000 15.918000
50% 17.000000 15.820000 18.366000
75% 25.500000 16.880000 20.817000
max 34.000000 18.960000 26.328000
describe()
gives us some summary statistics. By default, it only shows information relevant to the numerical columns. But let’s ask for more!
="all") df_class.describe(include
Unnamed: 0 Name ... Test 2 (20%) Test 1 (30%)
count 35.000000 35 ... 32.000000 33.000000
unique NaN 35 ... NaN NaN
top NaN Braiden Henson ... NaN NaN
freq NaN 1 ... NaN NaN
mean 17.000000 NaN ... 15.405750 18.534455
std 10.246951 NaN ... 2.229394 3.487934
min 0.000000 NaN ... 9.040000 12.246000
25% 8.500000 NaN ... 14.360000 15.918000
50% 17.000000 NaN ... 15.820000 18.366000
75% 25.500000 NaN ... 16.880000 20.817000
max 34.000000 NaN ... 18.960000 26.328000
[11 rows x 7 columns]
unique
tells us how many unique entries are present. It also shows information about the top
one with the highest freq
uency. For example, there are four ‘Majors’ and LS
appears 14 times. Gender has three unique values, with M
appearing 16 times.
2.4 Summaries (II)
I can also do the following to get quick summaries.
-
Data for a single column.
'Test 1 (30%)'].mean() df_class[
18.534454545454548
-
Data for multiple columns.
'Test 1 (30%)', 'Test 2 (20%)']].mean() df_class[[
Test 1 (30%) 18.534455 Test 2 (20%) 15.405750 dtype: float64
-
Show me all the values
'Major'].values df_class[
array(['PHY', 'CHM', 'PHY', 'LS', 'PHY', 'PHY', 'LS', 'LS', 'LS', 'LS', 'LS', 'LS', 'PHY', 'LS', 'CHM', 'PHY', 'CHM', 'PHY', 'CHM', 'CHM', 'PHY', 'CHM', 'LS', 'LS', 'LS', 'PHY', 'PHY', 'PHY', 'CHM', 'LS', 'CHM', 'PHY', 'CBIO', 'LS', 'LS'], dtype=object)
-
'Major'].unique() df_class[
array(['PHY', 'CHM', 'LS', 'CBIO'], dtype=object)
-
'Major'].value_counts() df_class[
Major LS 14 PHY 12 CHM 8 CBIO 1 Name: count, dtype: int64
We can typecast these outputs to more familiar lists or dictionaries by:
'Major'].value_counts().to_dict() df_class[
{'LS': 14, 'PHY': 12, 'CHM': 8, 'CBIO': 1}
'Major'].value_counts().to_list() df_class[
[14, 12, 8, 1]
The following is useful too:
'Major'].value_counts().index.to_list() df_class[
['LS', 'PHY', 'CHM', 'CBIO']
3 Housekeeping
It is always a good idea to housekeep (wrangle) your data so that you fix issues right at the start.
3.1 Dropping unnecessary columns
Let me start by dropping the unnecessary column Unnamed: 0
.
=['Unnamed: 0']) df_class.drop(columns
Unnamed: 0 | Name | Student No | Major | Gender | Test 2 (20%) | Test 1 (30%) | |
---|---|---|---|---|---|---|---|
0 | 0 | Braiden Henson | A3028967J | PHY | M | 18.96 | 20.205 |
1 | 1 | Gustavo Vang | A1282849W | CHM | F | 17.44 | 13.470 |
2 | 2 | Ronin Christian | A5408925A | PHY | M | 15.56 | 18.366 |
3 | 3 | Owen Anderson | A6973859L | LS | F | 16.36 | 18.366 |
4 | 4 | Kyla Young | A5410124H | PHY | M | NaN | 15.306 |
Okay, it seems to have worked but, if you look at df_class
, you will see nothing has changed!
df_class.head()
Unnamed: 0 | Name | Student No | Major | Gender | Test 2 (20%) | Test 1 (30%) | |
---|---|---|---|---|---|---|---|
0 | 0 | Braiden Henson | A3028967J | PHY | M | 18.96 | 20.205 |
1 | 1 | Gustavo Vang | A1282849W | CHM | F | 17.44 | 13.470 |
2 | 2 | Ronin Christian | A5408925A | PHY | M | 15.56 | 18.366 |
3 | 3 | Owen Anderson | A6973859L | LS | F | 16.36 | 18.366 |
4 | 4 | Kyla Young | A5410124H | PHY | M | NaN | 15.306 |
What is happening is that the drop()
command drops, but gives us a new dataframe without changing the original. There are two ways to change our original dataframe.
One is simply to update the variable.
= df_class.drop(columns=['Unnamed: 0']) df_class
Another is to set the option inplace=True
so that drop()
makes the modifications to the original dataframe, inplace.
=['Unnamed: 0'], inplace=True) df_class.drop(columns
If you do this, there will be no output displayed. Note also that not all methods offer the option inplace.
Remember to update the original dataframe if you make changes by either updating the variable or using inplace
I find this behaviour of not modifying the original dataframe until we explicitly say so very useful. I can mess with the dataframe as I like and only update when I know I am ready.
3.2 Reorganising columns
I like to reorganise my columns so that the student number is the first column. I can do it like this:
= ['Student No', 'Name', 'Major', 'Gender',
columns_to_keep 'Test 1 (30%)', 'Test 2 (20%)']
= df_class[columns_to_keep] df_class
Incidentally, I could have done the reorganisation and dropping with this single step. However, this can get tedious if you have a lot of columns.
3.3 Renaming columns
Let’s now rename some of the columns. This is very easy; we just have to supply a dictionary with the columns we want to update. The changes I want are:
= {'Student No': 'MATRIC_NO',
new_column_info 'Test 1 (30%)': 'Test 1',
'Test 2 (20%)': 'Test 2'}
=new_column_info, inplace=True) df_class.rename(columns
df_class.head()
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | |
---|---|---|---|---|---|---|
0 | A3028967J | Braiden Henson | PHY | M | 20.205 | 18.96 |
1 | A1282849W | Gustavo Vang | CHM | F | 13.470 | 17.44 |
2 | A5408925A | Ronin Christian | PHY | M | 18.366 | 15.56 |
3 | A6973859L | Owen Anderson | LS | F | 18.366 | 16.36 |
4 | A5410124H | Kyla Young | PHY | M | 15.306 | NaN |
3.4 Setting the index
At the moment, the indices are just numbers from 0 to 34. You can check this with
df_class.index
RangeIndex(start=0, stop=35, step=1)
However, I want to change it so I can refer to my rows using the MATRIC_NO
(I will show you why this is useful later).
'MATRIC_NO', drop=False, inplace=True) df_class.set_index(
I am asking Pandas to use the values of the column ‘MATRIC_NO’ as indices for the rows. I put drop=False
to prevent Pandas from dropping the column because I still like to keep it.
df_class.head()
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | |
---|---|---|---|---|---|---|
MATRIC_NO | ||||||
A3028967J | A3028967J | Braiden Henson | PHY | M | 20.205 | 18.96 |
A1282849W | A1282849W | Gustavo Vang | CHM | F | 13.470 | 17.44 |
A5408925A | A5408925A | Ronin Christian | PHY | M | 18.366 | 15.56 |
A6973859L | A6973859L | Owen Anderson | LS | F | 18.366 | 16.36 |
A5410124H | A5410124H | Kyla Young | PHY | M | 15.306 | NaN |
3.5 Applying changes
There are a few things I like to change with the data.
- Replacing the abbreviation for the majors with full names,
- Replacing the abbreviation for the ‘M’, ‘F’, ‘NB’ with their full forms.
- The
Test 2
column has all its numbers between inverted commas (e.g. ‘16.36’ ). I like to remove these inverted commas.
The first two are very simple. But before we do anything, let’s get a list of all the unique items in the columns Majors
and Gender
.
'Major'].unique() df_class[
array(['PHY', 'CHM', 'LS', 'CBIO'], dtype=object)
'Gender'].unique() df_class[
array(['M', 'F', 'NB'], dtype=object)
Now that we know the unique values, let’s make the changes.
= {
replace_info 'PHY': 'Physics',
'CHM': 'Chemistry',
'LS': 'Life Sciences',
'CBIO': 'Comp. Biology',
'F': 'Female',
'M': 'Male',
'NB': 'Non-binary'
}
=replace_info, inplace=True) df_class.replace(to_replace
There are a few ways to do the third. Let me show you the method that is the most versatile and uses the apply()
method.
def clean(text):
try:
return text.replace("'", "")
except AttributeError:
# This will handle the NaN of the missing data
return text
'Test 2'] = df_class['Test 2'].apply(clean) df_class[
Here we use the apply()
method to apply my function clean
to the column ‘Test 2’. The missing numbers in this column are a bit annoying because they are replaced with NaN
, which is a float, while everything else is a string (i.e. English). So, I have included a try-except
statement to get around this. Again, I know the exact error because Python screamed it at me when I did not account for the error. The apply()
method is super useful and can be used to do more complicated things. This is good enough for now.
Let’s see what our handiwork looks like:
df_class.head()
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | |
---|---|---|---|---|---|---|
MATRIC_NO | ||||||
A3028967J | A3028967J | Braiden Henson | Physics | Male | 20.205 | 18.96 |
A1282849W | A1282849W | Gustavo Vang | Chemistry | Female | 13.470 | 17.44 |
A5408925A | A5408925A | Ronin Christian | Physics | Male | 18.366 | 15.56 |
A6973859L | A6973859L | Owen Anderson | Life Sciences | Female | 18.366 | 16.36 |
A5410124H | A5410124H | Kyla Young | Physics | Male | 15.306 | NaN |
3.6 Changing Column Type
Let’s now change the column type for some of our columns so that they are more representative of the data they carry. First, remember that we have a problem with column ‘Test 2’ in that it is in English and not in number format viz.
df_class.info()
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, A3028967J to A7667457P
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MATRIC_NO 35 non-null object
1 Name 35 non-null object
2 Major 35 non-null object
3 Gender 35 non-null object
4 Test 1 33 non-null float64
5 Test 2 32 non-null float64
dtypes: float64(2), object(4)
memory usage: 1.9+ KB
Changing type is done with a dictionary and is just as easy. I will also change the type of Major
and Gender
as they hold categorical data (i.e. data that are limited to a fixed number of options).
= {'Major': 'category',
new_type_info 'Gender': 'category',
'Test 2': 'float'}
# atype does NOT have an inplace option
= df_class.astype(new_type_info) df_class
Let’s check if this worked.
df_class.info()
<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, A3028967J to A7667457P
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MATRIC_NO 35 non-null object
1 Name 35 non-null object
2 Major 35 non-null category
3 Gender 35 non-null category
4 Test 1 33 non-null float64
5 Test 2 32 non-null float64
dtypes: category(2), float64(2), object(2)
memory usage: 1.8+ KB
Yup, looks good.
3.7 Adding a new column
We should create a new column called Total
to hold the total score. This is laughably simple.
"Total"] = df_class["Test 1"] + df_class["Test 2"] df_class[
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | Total | |
---|---|---|---|---|---|---|---|
MATRIC_NO | |||||||
A3028967J | A3028967J | Braiden Henson | Physics | Male | 20.205 | 18.96 | 39.165 |
A1282849W | A1282849W | Gustavo Vang | Chemistry | Female | 13.470 | 17.44 | 30.910 |
A5408925A | A5408925A | Ronin Christian | Physics | Male | 18.366 | 15.56 | 33.926 |
A6973859L | A6973859L | Owen Anderson | Life Sciences | Female | 18.366 | 16.36 | 34.726 |
A5410124H | A5410124H | Kyla Young | Physics | Male | 15.306 | NaN | NaN |
3.8 Saving to file
Let’s export our dataframe as an Excel file (many other options are possible!).
'finalised_scores.xlsx', index=False) df_class.to_excel(
I have used the index=False
to tell Pandas not to write the index column.
3.9 The recipe so far
Ideally, what should follow now is fixing any missing data. However, I will postpone that discussion to the next chapter and move on to another important thing to do with dataframes. Before that, however, let me show you the full recipe we have so far been building.
= pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)
df_class
#------------------ Drop and reorganise columns -----------------#
= ['Student No', 'Name', 'Major', 'Gender',
columns_to_keep 'Test 1 (30%)', 'Test 2 (20%)']
= df_class[columns_to_keep]
df_class
#------------------------- Rename columns ------------------------#
= {'Student No': 'MATRIC_NO',
new_column_info 'Test 1 (30%)': 'Test 1',
'Test 2 (20%)': 'Test 2'}
=new_column_info, inplace=True)
df_class.rename(columns
#--------------------- Set index to MATRIC_NO --------------------#
'MATRIC_NO', drop=False, inplace=True)
df_class.set_index(
#-------------------------- Rename stuff -------------------------#
= {
replace_info 'PHY': 'Physics',
'CHM': 'Chemistry',
'LS': 'Life Sciences',
'CBIO': 'Comp. Biology',
'F': 'Female',
'M': 'Male',
'NB': 'Non-binary'
}
=replace_info, inplace=True)
df_class.replace(to_replace
#---------------- Remove the ' ' from column Test 2 --------------#
def clean(text):
'''
Function to remove ' ' from column 'Test 2'.
To be applied using apply()
'''
try:
return text.replace("'", "")
except AttributeError:
# This will handle the NaN of the missing data
return text
'Test 2'] = df_class['Test 2'].apply(clean)
df_class[
#--------------- Convert column Test 2 to type float -------------#
= {'Major': 'category',
new_type_info 'Gender': 'category',
'Test 2': 'float'}
= df_class.astype(new_type_info)
df_class
#------------------------ Add a new column -----------------------#
"Total"] = df_class["Test 1"] + df_class["Test 2"]
df_class[
#------------------------- Export the file -----------------------#
'finalised_scores.xlsx', index=False)
df_class.to_excel(
df_class.head()
4 Asking questions
Now we are ready to start probing our dataset for interesting tidbits. But first, we need to know ourselves around the dataframe.
4.1 Locating data
Pandas offer us two ways (loc
and iloc
) to access and loc
ate the data in our dataframe.
Using iloc
We use iloc
by specifying an index (like with a NumPy array). So, the following will show the contents from rows 11, 12, and 13 of columns 3 and 4 (Remember that Python is zero-indexed!).
10:13, 2:4] df_class.iloc[
Major | Gender | |
---|---|---|
MATRIC_NO | ||
A7067766E | Life Sciences | Female |
A5569996J | Life Sciences | Male |
A3202548I | Physics | Male |
Using loc
We use loc
by specifying a names of rows (i.e. the indexes) or a mask. I will explain the mask part in a bit; for the moment, here is how to get the same output as above. Please note that the ‘numbers’ corresponding to rows from 11 to 14, in this case, are ‘names’.
"A7067766E", "A5569996J", "A3202548I"], ["Major", "Gender"]] df_class.loc[[
Major | Gender | |
---|---|---|
MATRIC_NO | ||
A7067766E | Life Sciences | Female |
A5569996J | Life Sciences | Male |
A3202548I | Physics | Male |
4.2 Asking questions with masks
loc
is most powerful with a mask (like NumPy) array. You can create a mask of (True
and False
) by asking a simple question. Here are some examples.
-
"Major"] == "Chemistry" df_class[
MATRIC_NO A3028967J False A1282849W True A5408925A False A6973859L False A5410124H False A9568373Q False A6824244G False A9194090U False A4828364M False A4607700C False A7067766E False A5569996J False A3202548I False A6131593U False A7653832E True A9462811I False A1218599T True A7210476B False A1512479K True A7986368Y True A2727061A False A2999472W True A7116486E False A6931452S False A9649096H False A1643380L False A6787293E False A5975988J False A3699958T True A1956366U False A1468689D True A3217320C False A6867791C False A4080490P False A7667457P False Name: Major, dtype: bool
as you can see, the ‘answer’ is a bunch of
True
andFalse
values for the whole dataframe. You can then apply theseTrue
andFalse
values to mask the values that areFalse
like this:"Major"] == "Chemistry"] df_class.loc[df_class[
MATRIC_NO Name Major Gender Test 1 Test 2 Total MATRIC_NO A1282849W A1282849W Gustavo Vang Chemistry Female 13.470 17.440 30.910 A7653832E A7653832E Quentin Kemp Chemistry Female NaN 15.720 NaN A1218599T A1218599T Camden Williams Chemistry Female 22.653 17.648 40.301 A1512479K A1512479K Solomon Fletcher Chemistry Female 18.981 15.000 33.981 A7986368Y A7986368Y Riley Christensen Chemistry Female 15.306 16.016 31.322 A2999472W A2999472W Skylar Hensley Chemistry Male 23.877 17.648 41.525 A3699958T A3699958T Norah Miles Chemistry Male NaN NaN NaN A1468689D A1468689D Isabela Stokes Chemistry Male 18.366 16.800 35.166 -
You can invert a mask with the
~
like this.~(df_class["Major"] == "Life Sciences")] df_class.loc[
MATRIC_NO Name Major Gender Test 1 Test 2 Total MATRIC_NO A3028967J A3028967J Braiden Henson Physics Male 20.205 18.960 39.165 A1282849W A1282849W Gustavo Vang Chemistry Female 13.470 17.440 30.910 A5408925A A5408925A Ronin Christian Physics Male 18.366 15.560 33.926 A5410124H A5410124H Kyla Young Physics Male 15.306 NaN NaN A9568373Q A9568373Q Wyatt Oliver Physics Male 12.246 14.088 26.334 A3202548I A3202548I Jaxon Chung Physics Male 14.082 16.680 30.762 A7653832E A7653832E Quentin Kemp Chemistry Female NaN 15.720 NaN A9462811I A9462811I Leo Mayo Physics Female 20.817 14.680 35.497 A1218599T A1218599T Camden Williams Chemistry Female 22.653 17.648 40.301 A7210476B A7210476B Sidney Wiggins Physics Non-binary 19.593 14.680 34.273 A1512479K A1512479K Solomon Fletcher Chemistry Female 18.981 15.000 33.981 A7986368Y A7986368Y Riley Christensen Chemistry Female 15.306 16.016 31.322 A2727061A A2727061A Malik Becker Physics Male 12.858 NaN NaN A2999472W A2999472W Skylar Hensley Chemistry Male 23.877 17.648 41.525 A1643380L A1643380L Carolina Mcmahon Physics Male 12.858 18.080 30.938 A6787293E A6787293E Elian Potter Physics Male 23.265 14.360 37.625 A5975988J A5975988J Litzy White Physics Non-binary 20.817 15.920 36.737 A3699958T A3699958T Norah Miles Chemistry Male NaN NaN NaN A1468689D A1468689D Isabela Stokes Chemistry Male 18.366 16.800 35.166 A3217320C A3217320C Kathleen Rodriguez Physics Male 22.653 17.760 40.413 A6867791C A6867791C Katie Ayers Comp. Biology Female 16.530 10.280 26.810 -
Let’s create another mask for all the Chemistry students who identify as Female.
# Create mask = (df_class["Major"] == "Chemistry") & (df_class["Gender"] == 'Female') mask # Using mask "Name", "Major", "Gender"]] df_class.loc[mask, [
Name Major Gender MATRIC_NO A1282849W Gustavo Vang Chemistry Female A7653832E Quentin Kemp Chemistry Female A1218599T Camden Williams Chemistry Female A1512479K Solomon Fletcher Chemistry Female A7986368Y Riley Christensen Chemistry Female -
Let’s do another one. How about all students who scored less than 15% on Test 1?
= df_class["Test 1"] < 15 mask "Name", "Major", "Test 1"]] df_class.loc[mask, [
Name Major Test 1 MATRIC_NO A1282849W Gustavo Vang Chemistry 13.470 A9568373Q Wyatt Oliver Physics 12.246 A3202548I Jaxon Chung Physics 14.082 A2727061A Malik Becker Physics 12.858 A1643380L Carolina Mcmahon Physics 12.858
4.3 The magic of groupby()
Let me show you how to use the mesmerising, extremely powerful function groupby()
. It is absurd how simple some things become with the right tools.
Here are some examples.
-
What are the means of the scores for the various test, according to Major?
=["Major"]).mean(numeric_only=True) df_class.groupby(by
Test 1 Test 2 Total Major Chemistry 18.7755 16.610286 35.534167 Comp. Biology 16.5300 10.280000 26.810000 Life Sciences 19.2420 14.690286 33.932286 Physics 17.7555 16.076800 34.567000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. We use numeric_only=True
to tell Pandas not to try evaluating a mean for non-numeric (‘Gender’ in this case) columns. However, I find this a tad annoying and non-intuitive for other (seeagg()
below) actions. So, I prefer to explicitly ask for the columns I want.=["Major"])[['Test 1', 'Test 2', 'Total']].mean() df_class.groupby(by
-
Let’s group-by
Major
andGender
.=["Major", "Gender"])[['Test 1', 'Test 2', 'Total']].mean() df_class.groupby(by
Test 1 Test 2 Total Major Gender Chemistry Female 17.602500 16.364800 34.128500 Male 21.121500 17.224000 38.345500 Non-binary NaN NaN NaN Comp. Biology Female 16.530000 10.280000 26.810000 Male NaN NaN NaN Non-binary NaN NaN NaN Life Sciences Female 19.285875 14.801000 34.086875 Male 20.357250 15.644000 36.001250 Non-binary 16.836000 12.340000 29.176000 Physics Female 20.817000 14.680000 35.497000 Male 16.871000 16.498286 34.166143 Non-binary 20.205000 15.300000 35.505000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. -
If you don’t want the grouping parameters as indices, you can use
as_index=False
.=["Major", "Gender"], as_index=False)[['Test 1', 'Test 2', 'Total']].mean() df_class.groupby(by
Major Gender Test 1 Test 2 Total 0 Chemistry Female 17.602500 16.364800 34.128500 1 Chemistry Male 21.121500 17.224000 38.345500 2 Chemistry Non-binary NaN NaN NaN 3 Comp. Biology Female 16.530000 10.280000 26.810000 4 Comp. Biology Male NaN NaN NaN 5 Comp. Biology Non-binary NaN NaN NaN 6 Life Sciences Female 19.285875 14.801000 34.086875 7 Life Sciences Male 20.357250 15.644000 36.001250 8 Life Sciences Non-binary 16.836000 12.340000 29.176000 9 Physics Female 20.817000 14.680000 35.497000 10 Physics Male 16.871000 16.498286 34.166143 11 Physics Non-binary 20.205000 15.300000 35.505000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. -
If you are only interested in a specific column, you can do this:
=["Major", "Gender"])["Test 1"].mean() df_class.groupby(by
Test 1 Major Gender Chemistry Female 17.602500 Male 21.121500 Non-binary NaN Comp. Biology Female 16.530000 Male NaN Non-binary NaN Life Sciences Female 19.285875 Male 20.357250 Non-binary 16.836000 Physics Female 20.817000 Male 16.871000 Non-binary 20.205000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. -
Other ‘in-built’ function you can use are
count()
,sum()
,mean()
,median()
,min()
,max()
,mode()
,std()
,var()
. However, you can far more out ofgroupby()
by usingagg()
to apply any function. Let me show you how to apply some functions from NumPy.=["Major", "Gender"])[['Test 1', 'Test 2', 'Total']].agg([np.mean, np.std]).round(2) df_class.groupby(by
Test 1 Test 2 Total mean std mean std mean std Major Gender Chemistry Female 17.60 4.07 16.36 1.14 34.13 4.33 Male 21.12 3.90 17.22 0.60 38.35 4.50 Non-binary NaN NaN NaN NaN NaN NaN Comp. Biology Female 16.53 NaN 10.28 NaN 26.81 NaN Male NaN NaN NaN NaN NaN NaN Non-binary NaN NaN NaN NaN NaN NaN Life Sciences Female 19.29 2.07 14.80 2.61 34.09 3.84 Male 20.36 4.37 15.64 1.49 36.00 4.95 Non-binary 16.84 2.16 12.34 1.90 29.18 0.27 Physics Female 20.82 NaN 14.68 NaN 35.50 NaN Male 16.87 4.35 16.50 1.89 34.17 5.15 Non-binary 20.20 0.87 15.30 0.88 35.50 1.74 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. :1: FutureWarning: The provided callable <function mean at 0x7fc8efe08f40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string “mean” instead. :1: FutureWarning: The provided callable <function std at 0x7fc8efe09080> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string “std” instead. -
You can even combine
groupby()
with a mask. For example, let’s see how all the Female students are doing.= df_class['Gender'] == 'Female' mask =['Major']).mean(numeric_only=True) df_class[mask].groupby(by
Test 1 Test 2 Total Major Chemistry 17.602500 16.3648 34.128500 Comp. Biology 16.530000 10.2800 26.810000 Life Sciences 19.285875 14.8010 34.086875 Physics 20.817000 14.6800 35.497000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. Just for comparison,
=["Gender", "Major"]).mean(numeric_only=True) df_class.groupby(by
Test 1 Test 2 Total Gender Major Female Chemistry 17.602500 16.364800 34.128500 Comp. Biology 16.530000 10.280000 26.810000 Life Sciences 19.285875 14.801000 34.086875 Physics 20.817000 14.680000 35.497000 Male Chemistry 21.121500 17.224000 38.345500 Comp. Biology NaN NaN NaN Life Sciences 20.357250 15.644000 36.001250 Physics 16.871000 16.498286 34.166143 Non-binary Chemistry NaN NaN NaN Comp. Biology NaN NaN NaN Life Sciences 16.836000 12.340000 29.176000 Physics 20.205000 15.300000 35.505000 :1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.