Pandas (Need)

What to expect in this chapter

This section will introduce you to dataframes. Data frames are one of the most powerful ways to store and manipulate data. They ‘look’ similar to spreadsheets, but don’t be fooled; dataframes are far more versatile and powerful. The Python package dedicated to dataframes is Pandas, which is built on top of NumPy. If you plan to work with data Pandas is one of the must-have tools in your toolkit.

Before proceeding…

Import Pandas and NumPy using: python import pandas as pd import numpy as np
You will also need the xlrd package to read and write .xls files. Please install it using bash conda install -c conda-forge -y xlrd
Download the following data sets and move them into your active folder.

		Format
1.	Test data for a dummy class	`dummy-class_1-of-2.xlsx` `dummy-class-2-of-2.xlsx`
2.	Data for another dummy class	`another-dummy-class.xlsx`

1 What is a dataframe

An easy way to get started with dataframes is to read in some data and see what it looks like. I have prepared a dummy set of classrom data for this. You should have already downloaded it into the relevant folder in your Learning Portfolio.

1.1 Reading data from `csv` or `excel` files.

One of the most common file formats used to hold data that you will encounter are Excel files (.xls or .xlsx), csv(comma-separated-values) files or text files (.txt or .dat). Pandas offer ways to read all these files and (many) more. Let’s read in the data in dummy-class.xlsx.

df_class = pd.read_excel("dummy-class-1-of-2.xlsx", skiprows = 1)

I have used the option skiprows=1 because the first row of the Excel file contains a description that is not data. skiprows is one of many options we can specify.

I have used the variable df_class to hold the dataframe. With Jupyter you can see the dataframe in a nicely formatted output by simply running:

Code
Output

df_class

	Unnamed: 0	Name	Student No	Major	Gender	Test 2 (20%)	Test 1 (30%)
0	0	Braiden Henson	A3028967J	PHY	M	‘18.96’	20.205
1	1	Gustavo Vang	A1282849W	CHM	F	‘17.44’	13.470
2	2	Ronin Christian	A5408925A	PHY	M	‘15.56’	18.366
3	3	Owen Anderson	A6973859L	LS	F	‘16.36’	18.366
4	4	Kyla Young	A5410124H	PHY	M	NaN	15.306
5	5	Wyatt Oliver	A9568373Q	PHY	M	‘14.088’	12.246
6	6	Essence Bauer	A6824244G	LS	M	‘16.72’	16.530
7	7	Maryjane Sandoval	A9194090U	LS	F	‘16.4’	18.981
8	8	Carl Trujillo	A4828364M	LS	NB	‘13.68’	15.306
9	9	Halle Fritz	A4607700C	LS	F	‘9.04’	17.754
10	10	Marie Hoffman	A7067766E	LS	F	‘16.88’	19.593
11	11	Lilianna Kaufman	A5569996J	LS	M	‘17.0’	26.328
12	12	Jaxon Chung	A3202548I	PHY	M	‘16.68’	14.082
13	13	Zoey Oconnell	A6131593U	LS	F	‘14.128’	22.041
14	14	Quentin Kemp	A7653832E	CHM	F	‘15.72’	NaN
15	15	Leo Mayo	A9462811I	PHY	F	‘14.68’	20.817
16	16	Camden Williams	A1218599T	CHM	F	‘17.648’	22.653
17	17	Sidney Wiggins	A7210476B	PHY	NB	‘14.68’	19.593
18	18	Solomon Fletcher	A1512479K	CHM	F	‘15.0’	18.981
19	19	Riley Christensen	A7986368Y	CHM	F	‘16.016’	15.306
20	20	Malik Becker	A2727061A	PHY	M	NaN	12.858
21	21	Skylar Hensley	A2999472W	CHM	M	‘17.648’	23.877
22	22	Braydon Duran	A7116486E	LS	F	‘14.36’	15.918
23	23	Jalen Harmon	A6931452S	LS	F	‘16.88’	22.041
24	24	Ean Haas	A9649096H	LS	F	‘14.36’	19.593
25	25	Carolina Mcmahon	A1643380L	PHY	M	‘18.08’	12.858
26	26	Elian Potter	A6787293E	PHY	M	‘14.36’	23.265
27	27	Litzy White	A5975988J	PHY	NB	‘15.92’	20.817
28	28	Norah Miles	A3699958T	CHM	M	NaN	NaN
29	29	Mariela Sheppard	A1956366U	LS	NB	‘11.0’	18.366
30	30	Isabela Stokes	A1468689D	CHM	M	‘16.8’	18.366
31	31	Kathleen Rodriguez	A3217320C	PHY	M	‘17.76’	22.653
32	32	Katie Ayers	A6867791C	CBIO	F	‘10.28’	16.530
33	33	Tucker Sloan	A4080490P	LS	M	‘15.04’	17.754
34	34	Carter Crane	A7667457P	LS	M	‘13.816’	20.817

1.2 Characteristics to notice

Okay, now you are bound to think that the above dataframe is no different from a typical spreadsheet. First, however, let me draw your attention to three different characteristics.

The first is the use of an index (leftmost column in bold) for each row (0 1 2 3...). This simple fact simplifies doing things with your data. I will show you some examples later.
The second is that a column (called a Pandas Series) can only have a single type of data. So, for example, you can have numbers or text; but not both.
The third is the presence of NaN (‘Not-a-Number’) when data is missing. Missing data is a common bane of real data; Pandas offers powerful ways to deal with such missing numbers transparently. I will show more later in this chapter.

For the moment,

Remember

Remember that a dataframe is like a spreadsheet but has an additional unique index for each row. This index plays an important role.

2 Quick Peeks

2.1 Head and Tail

Real data sets are bound to have more rows than the dummy class toy data set. You can take a quick peek at the top or bottom of the dataframe by using the head() and tail() methods, respectively.

df_class.head(3)

	Unnamed: 0	Name	Student No	Major	Gender	Test 2 (20%)	Test 1 (30%)
0	0	Braiden Henson	A3028967J	PHY	M	‘18.96’	20.205
1	1	Gustavo Vang	A1282849W	CHM	F	‘17.44’	13.470
2	2	Ronin Christian	A5408925A	PHY	M	‘15.56’	18.366

df_class.tail(3)

	Unnamed: 0	Name	Student No	Major	Gender	Test 2 (20%)	Test 1 (30%)
32	32	Katie Ayers	A6867791C	CBIO	F	‘10.28’	16.530
33	33	Tucker Sloan	A4080490P	LS	M	‘15.04’	17.754
34	34	Carter Crane	A7667457P	LS	M	‘13.816’	20.817

If you do not specify the number of rows, it defaults to five.

2.2 Size and Shape

We can get the size and shape of the dataframe by using shape, len() and columns.

df_class.shape

(35, 7)

len(df_class)

df_class.columns

Index(['Unnamed: 0', 'Name', 'Student No', 'Major', 'Gender', 'Test 2 (20%)',
       'Test 1 (30%)'],
      dtype='object')

2.3 Summaries (I)

Pandas offer more powerful ways to get more valuable summaries of your dataset with the methods info() and describe().

df_class.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    35 non-null     int64  
 1   Name          35 non-null     object 
 2   Student No    35 non-null     object 
 3   Major         35 non-null     object 
 4   Gender        35 non-null     object 
 5   Test 2 (20%)  32 non-null     object 
 6   Test 1 (30%)  33 non-null     float64
dtypes: float64(1), int64(1), object(5)
memory usage: 2.0+ KB

This information is very useful. You can quickly see the various columns, what type of data they contain and how many data points are present. This is a good place to spot missing data like in ‘Test 1’ and ‘Test 2’. The type object means str or English-like data. float64 is floating point data (i.e. decimal numbers). We also note that ‘Test 2’ is not a numerical type despite their data having numerical significance. So, one of the first things we should do is convert these to numerical types. I will focus on type conversion a bit later. However, let me ‘parachute’ some code here to continue our discussion.

A quick digression

Pandas do not see ‘Test 2’ data as numbers because its data has additional' '.

df_class['Test 2 (20%)'].head(3)

0    '18.96'
1    '17.44'
2    '15.56'
Name: Test 2 (20%), dtype: object

Here, I first replace the ' with nothing and then convert the data using pd.to_numeric()

df_class['Test 2 (20%)'] = df_class['Test 2 (20%)'].str.replace("'", "")
df_class['Test 2 (20%)'] = pd.to_numeric(df_class['Test 2 (20%)'])

df_class.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    35 non-null     int64  
 1   Name          35 non-null     object 
 2   Student No    35 non-null     object 
 3   Major         35 non-null     object 
 4   Gender        35 non-null     object 
 5   Test 2 (20%)  32 non-null     float64
 6   Test 1 (30%)  33 non-null     float64
dtypes: float64(2), int64(1), object(4)
memory usage: 2.0+ KB

Back to the original discussion

Okay, let’s continue discussing ways to peek into the file.

df_class.describe()

       Unnamed: 0  Test 2 (20%)  Test 1 (30%)
count   35.000000     32.000000     33.000000
mean    17.000000     15.405750     18.534455
std     10.246951      2.229394      3.487934
min      0.000000      9.040000     12.246000
25%      8.500000     14.360000     15.918000
50%     17.000000     15.820000     18.366000
75%     25.500000     16.880000     20.817000
max     34.000000     18.960000     26.328000

describe() gives us some summary statistics. By default, it only shows information relevant to the numerical columns. But let’s ask for more!

df_class.describe(include="all")

        Unnamed: 0            Name  ... Test 2 (20%) Test 1 (30%)
count    35.000000              35  ...    32.000000    33.000000
unique         NaN              35  ...          NaN          NaN
top            NaN  Braiden Henson  ...          NaN          NaN
freq           NaN               1  ...          NaN          NaN
mean     17.000000             NaN  ...    15.405750    18.534455
std      10.246951             NaN  ...     2.229394     3.487934
min       0.000000             NaN  ...     9.040000    12.246000
25%       8.500000             NaN  ...    14.360000    15.918000
50%      17.000000             NaN  ...    15.820000    18.366000
75%      25.500000             NaN  ...    16.880000    20.817000
max      34.000000             NaN  ...    18.960000    26.328000

[11 rows x 7 columns]

unique tells us how many unique entries are present. It also shows information about the top one with the highest frequency. For example, there are four ‘Majors’ and LS appears 14 times. Gender has three unique values, with M appearing 16 times.

2.4 Summaries (II)

I can also do the following to get quick summaries.

Data for a single column.

df_class['Test 1 (30%)'].mean()

18.534454545454548

Data for multiple columns.

df_class[['Test 1 (30%)', 'Test 2 (20%)']].mean()

Test 1 (30%)    18.534455
Test 2 (20%)    15.405750
dtype: float64

Show me all the values

df_class['Major'].values

array(['PHY', 'CHM', 'PHY', 'LS', 'PHY', 'PHY', 'LS', 'LS', 'LS', 'LS',
       'LS', 'LS', 'PHY', 'LS', 'CHM', 'PHY', 'CHM', 'PHY', 'CHM', 'CHM',
       'PHY', 'CHM', 'LS', 'LS', 'LS', 'PHY', 'PHY', 'PHY', 'CHM', 'LS',
       'CHM', 'PHY', 'CBIO', 'LS', 'LS'], dtype=object)

df_class['Major'].unique()

array(['PHY', 'CHM', 'LS', 'CBIO'], dtype=object)

df_class['Major'].value_counts()

Major
LS      14
PHY     12
CHM      8
CBIO     1
Name: count, dtype: int64

We can typecast these outputs to more familiar lists or dictionaries by:

df_class['Major'].value_counts().to_dict()

{'LS': 14, 'PHY': 12, 'CHM': 8, 'CBIO': 1}

df_class['Major'].value_counts().to_list()

[14, 12, 8, 1]

The following is useful too:

df_class['Major'].value_counts().index.to_list()

['LS', 'PHY', 'CHM', 'CBIO']

3 Housekeeping

It is always a good idea to housekeep (wrangle) your data so that you fix issues right at the start.

3.1 Dropping unnecessary columns

Let me start by dropping the unnecessary column Unnamed: 0.

df_class.drop(columns=['Unnamed: 0'])

	Unnamed: 0	Name	Student No	Major	Gender	Test 2 (20%)	Test 1 (30%)
0	0	Braiden Henson	A3028967J	PHY	M	18.96	20.205
1	1	Gustavo Vang	A1282849W	CHM	F	17.44	13.470
2	2	Ronin Christian	A5408925A	PHY	M	15.56	18.366
3	3	Owen Anderson	A6973859L	LS	F	16.36	18.366
4	4	Kyla Young	A5410124H	PHY	M	NaN	15.306

Okay, it seems to have worked but, if you look at df_class, you will see nothing has changed!

df_class.head()

	Unnamed: 0	Name	Student No	Major	Gender	Test 2 (20%)	Test 1 (30%)
0	0	Braiden Henson	A3028967J	PHY	M	18.96	20.205
1	1	Gustavo Vang	A1282849W	CHM	F	17.44	13.470
2	2	Ronin Christian	A5408925A	PHY	M	15.56	18.366
3	3	Owen Anderson	A6973859L	LS	F	16.36	18.366
4	4	Kyla Young	A5410124H	PHY	M	NaN	15.306

What is happening is that the drop() command drops, but gives us a new dataframe without changing the original. There are two ways to change our original dataframe.

One is simply to update the variable.

df_class = df_class.drop(columns=['Unnamed: 0'])

Another is to set the option inplace=True so that drop() makes the modifications to the original dataframe, inplace.

df_class.drop(columns=['Unnamed: 0'], inplace=True)

If you do this, there will be no output displayed. Note also that not all methods offer the option inplace.

Remember

Remember to update the original dataframe if you make changes by either updating the variable or using inplace

I find this behaviour of not modifying the original dataframe until we explicitly say so very useful. I can mess with the dataframe as I like and only update when I know I am ready.

3.2 Reorganising columns

I like to reorganise my columns so that the student number is the first column. I can do it like this:

columns_to_keep = ['Student No', 'Name', 'Major', 'Gender',
                   'Test 1 (30%)', 'Test 2 (20%)']

df_class = df_class[columns_to_keep]

Incidentally, I could have done the reorganisation and dropping with this single step. However, this can get tedious if you have a lot of columns.

3.3 Renaming columns

Let’s now rename some of the columns. This is very easy; we just have to supply a dictionary with the columns we want to update. The changes I want are:

new_column_info = {'Student No': 'MATRIC_NO',
                   'Test 1 (30%)': 'Test 1',
                   'Test 2 (20%)': 'Test 2'}

df_class.rename(columns=new_column_info, inplace=True)

df_class.head()

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2
0	A3028967J	Braiden Henson	PHY	M	20.205	18.96
1	A1282849W	Gustavo Vang	CHM	F	13.470	17.44
2	A5408925A	Ronin Christian	PHY	M	18.366	15.56
3	A6973859L	Owen Anderson	LS	F	18.366	16.36
4	A5410124H	Kyla Young	PHY	M	15.306	NaN

3.4 Setting the index

At the moment, the indices are just numbers from 0 to 34. You can check this with

df_class.index

RangeIndex(start=0, stop=35, step=1)

However, I want to change it so I can refer to my rows using the MATRIC_NO (I will show you why this is useful later).

df_class.set_index('MATRIC_NO', drop=False, inplace=True)

I am asking Pandas to use the values of the column ‘MATRIC_NO’ as indices for the rows. I put drop=False to prevent Pandas from dropping the column because I still like to keep it.

df_class.head()

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2
MATRIC_NO
A3028967J	A3028967J	Braiden Henson	PHY	M	20.205	18.96
A1282849W	A1282849W	Gustavo Vang	CHM	F	13.470	17.44
A5408925A	A5408925A	Ronin Christian	PHY	M	18.366	15.56
A6973859L	A6973859L	Owen Anderson	LS	F	18.366	16.36
A5410124H	A5410124H	Kyla Young	PHY	M	15.306	NaN

3.5 Applying changes

There are a few things I like to change with the data.

Replacing the abbreviation for the majors with full names,
Replacing the abbreviation for the ‘M’, ‘F’, ‘NB’ with their full forms.
The Test 2 column has all its numbers between inverted commas (e.g. ‘16.36’ ). I like to remove these inverted commas.

The first two are very simple. But before we do anything, let’s get a list of all the unique items in the columns Majors and Gender.

df_class['Major'].unique()

array(['PHY', 'CHM', 'LS', 'CBIO'], dtype=object)

df_class['Gender'].unique()

array(['M', 'F', 'NB'], dtype=object)

Now that we know the unique values, let’s make the changes.

replace_info = {
    'PHY': 'Physics',
    'CHM': 'Chemistry',
    'LS': 'Life Sciences',
    'CBIO': 'Comp. Biology',
    'F': 'Female',
    'M': 'Male',
    'NB': 'Non-binary'
}

df_class.replace(to_replace=replace_info, inplace=True)

There are a few ways to do the third. Let me show you the method that is the most versatile and uses the apply() method.

def clean(text):
    try:
        return text.replace("'", "")
    except AttributeError:
        # This will handle the NaN of the missing data
        return text

df_class['Test 2'] = df_class['Test 2'].apply(clean)

Here we use the apply() method to apply my function clean to the column ‘Test 2’. The missing numbers in this column are a bit annoying because they are replaced with NaN, which is a float, while everything else is a string (i.e. English). So, I have included a try-except statement to get around this. Again, I know the exact error because Python screamed it at me when I did not account for the error. The apply() method is super useful and can be used to do more complicated things. This is good enough for now.

Let’s see what our handiwork looks like:

df_class.head()

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2
MATRIC_NO
A3028967J	A3028967J	Braiden Henson	Physics	Male	20.205	18.96
A1282849W	A1282849W	Gustavo Vang	Chemistry	Female	13.470	17.44
A5408925A	A5408925A	Ronin Christian	Physics	Male	18.366	15.56
A6973859L	A6973859L	Owen Anderson	Life Sciences	Female	18.366	16.36
A5410124H	A5410124H	Kyla Young	Physics	Male	15.306	NaN

3.6 Changing Column Type

Let’s now change the column type for some of our columns so that they are more representative of the data they carry. First, remember that we have a problem with column ‘Test 2’ in that it is in English and not in number format viz.

df_class.info()

<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, A3028967J to A7667457P
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   MATRIC_NO  35 non-null     object 
 1   Name       35 non-null     object 
 2   Major      35 non-null     object 
 3   Gender     35 non-null     object 
 4   Test 1     33 non-null     float64
 5   Test 2     32 non-null     float64
dtypes: float64(2), object(4)
memory usage: 1.9+ KB

Changing type is done with a dictionary and is just as easy. I will also change the type of Major and Gender as they hold categorical data (i.e. data that are limited to a fixed number of options).

new_type_info = {'Major': 'category',
                 'Gender': 'category',
                 'Test 2': 'float'}

# atype does NOT have an inplace option
df_class = df_class.astype(new_type_info)

Let’s check if this worked.

df_class.info()

<class 'pandas.core.frame.DataFrame'>
Index: 35 entries, A3028967J to A7667457P
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   MATRIC_NO  35 non-null     object  
 1   Name       35 non-null     object  
 2   Major      35 non-null     category
 3   Gender     35 non-null     category
 4   Test 1     33 non-null     float64 
 5   Test 2     32 non-null     float64 
dtypes: category(2), float64(2), object(2)
memory usage: 1.8+ KB

Yup, looks good.

3.7 Adding a new column

We should create a new column called Total to hold the total score. This is laughably simple.

df_class["Total"] = df_class["Test 1"] + df_class["Test 2"]

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2	Total
MATRIC_NO
A3028967J	A3028967J	Braiden Henson	Physics	Male	20.205	18.96	39.165
A1282849W	A1282849W	Gustavo Vang	Chemistry	Female	13.470	17.44	30.910
A5408925A	A5408925A	Ronin Christian	Physics	Male	18.366	15.56	33.926
A6973859L	A6973859L	Owen Anderson	Life Sciences	Female	18.366	16.36	34.726
A5410124H	A5410124H	Kyla Young	Physics	Male	15.306	NaN	NaN

3.8 Saving to file

Let’s export our dataframe as an Excel file (many other options are possible!).

df_class.to_excel('finalised_scores.xlsx', index=False)

I have used the index=False to tell Pandas not to write the index column.

3.9 The recipe so far

Ideally, what should follow now is fixing any missing data. However, I will postpone that discussion to the next chapter and move on to another important thing to do with dataframes. Before that, however, let me show you the full recipe we have so far been building.

df_class = pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)

#------------------ Drop and reorganise columns  -----------------#
columns_to_keep = ['Student No', 'Name', 'Major', 'Gender',
                   'Test 1 (30%)', 'Test 2 (20%)']

df_class = df_class[columns_to_keep]

#------------------------- Rename columns ------------------------#
new_column_info = {'Student No': 'MATRIC_NO',
                   'Test 1 (30%)': 'Test 1',
                   'Test 2 (20%)': 'Test 2'}

df_class.rename(columns=new_column_info, inplace=True)

#--------------------- Set index to MATRIC_NO --------------------#
df_class.set_index('MATRIC_NO', drop=False, inplace=True)

#-------------------------- Rename stuff -------------------------#
replace_info = {
    'PHY': 'Physics',
    'CHM': 'Chemistry',
    'LS': 'Life Sciences',
    'CBIO': 'Comp. Biology',
    'F': 'Female',
    'M': 'Male',
    'NB': 'Non-binary'
}

df_class.replace(to_replace=replace_info, inplace=True)

#---------------- Remove the ' ' from column Test 2 --------------#


def clean(text):
    '''
    Function to remove ' ' from column 'Test 2'.
    To be applied using apply()
    '''
    try:
        return text.replace("'", "")
    except AttributeError:
        # This will handle the NaN of the missing data
        return text


df_class['Test 2'] = df_class['Test 2'].apply(clean)

#--------------- Convert column Test 2 to type float -------------#
new_type_info = {'Major': 'category',
                 'Gender': 'category',
                 'Test 2': 'float'}

df_class = df_class.astype(new_type_info)

#------------------------ Add a new column -----------------------#
df_class["Total"] = df_class["Test 1"] + df_class["Test 2"]

#------------------------- Export the file -----------------------#
df_class.to_excel('finalised_scores.xlsx', index=False)

df_class.head()

4 Asking questions

Now we are ready to start probing our dataset for interesting tidbits. But first, we need to know ourselves around the dataframe.

4.1 Locating data

Pandas offer us two ways (loc and iloc) to access and locate the data in our dataframe.

Using `iloc`

We use iloc by specifying an index (like with a NumPy array). So, the following will show the contents from rows 11, 12, and 13 of columns 3 and 4 (Remember that Python is zero-indexed!).

df_class.iloc[10:13, 2:4]

	Major	Gender
MATRIC_NO
A7067766E	Life Sciences	Female
A5569996J	Life Sciences	Male
A3202548I	Physics	Male

Using `loc`

We use loc by specifying a names of rows (i.e. the indexes) or a mask. I will explain the mask part in a bit; for the moment, here is how to get the same output as above. Please note that the ‘numbers’ corresponding to rows from 11 to 14, in this case, are ‘names’.

df_class.loc[["A7067766E", "A5569996J", "A3202548I"], ["Major", "Gender"]]

	Major	Gender
MATRIC_NO
A7067766E	Life Sciences	Female
A5569996J	Life Sciences	Male
A3202548I	Physics	Male

4.2 Asking questions with masks

loc is most powerful with a mask (like NumPy) array. You can create a mask of (True and False) by asking a simple question. Here are some examples.

df_class["Major"] == "Chemistry"

MATRIC_NO
A3028967J    False
A1282849W     True
A5408925A    False
A6973859L    False
A5410124H    False
A9568373Q    False
A6824244G    False
A9194090U    False
A4828364M    False
A4607700C    False
A7067766E    False
A5569996J    False
A3202548I    False
A6131593U    False
A7653832E     True
A9462811I    False
A1218599T     True
A7210476B    False
A1512479K     True
A7986368Y     True
A2727061A    False
A2999472W     True
A7116486E    False
A6931452S    False
A9649096H    False
A1643380L    False
A6787293E    False
A5975988J    False
A3699958T     True
A1956366U    False
A1468689D     True
A3217320C    False
A6867791C    False
A4080490P    False
A7667457P    False
Name: Major, dtype: bool

as you can see, the ‘answer’ is a bunch of True and False values for the whole dataframe. You can then apply these True and False values to mask the values that are False like this:

df_class.loc[df_class["Major"] == "Chemistry"]

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2	Total
MATRIC_NO
A1282849W	A1282849W	Gustavo Vang	Chemistry	Female	13.470	17.440	30.910
A7653832E	A7653832E	Quentin Kemp	Chemistry	Female	NaN	15.720	NaN
A1218599T	A1218599T	Camden Williams	Chemistry	Female	22.653	17.648	40.301
A1512479K	A1512479K	Solomon Fletcher	Chemistry	Female	18.981	15.000	33.981
A7986368Y	A7986368Y	Riley Christensen	Chemistry	Female	15.306	16.016	31.322
A2999472W	A2999472W	Skylar Hensley	Chemistry	Male	23.877	17.648	41.525
A3699958T	A3699958T	Norah Miles	Chemistry	Male	NaN	NaN	NaN
A1468689D	A1468689D	Isabela Stokes	Chemistry	Male	18.366	16.800	35.166

You can invert a mask with the ~ like this.

df_class.loc[~(df_class["Major"] == "Life Sciences")]

	MATRIC_NO	Name	Major	Gender	Test 1	Test 2	Total
MATRIC_NO
A3028967J	A3028967J	Braiden Henson	Physics	Male	20.205	18.960	39.165
A1282849W	A1282849W	Gustavo Vang	Chemistry	Female	13.470	17.440	30.910
A5408925A	A5408925A	Ronin Christian	Physics	Male	18.366	15.560	33.926
A5410124H	A5410124H	Kyla Young	Physics	Male	15.306	NaN	NaN
A9568373Q	A9568373Q	Wyatt Oliver	Physics	Male	12.246	14.088	26.334
A3202548I	A3202548I	Jaxon Chung	Physics	Male	14.082	16.680	30.762
A7653832E	A7653832E	Quentin Kemp	Chemistry	Female	NaN	15.720	NaN
A9462811I	A9462811I	Leo Mayo	Physics	Female	20.817	14.680	35.497
A1218599T	A1218599T	Camden Williams	Chemistry	Female	22.653	17.648	40.301
A7210476B	A7210476B	Sidney Wiggins	Physics	Non-binary	19.593	14.680	34.273
A1512479K	A1512479K	Solomon Fletcher	Chemistry	Female	18.981	15.000	33.981
A7986368Y	A7986368Y	Riley Christensen	Chemistry	Female	15.306	16.016	31.322
A2727061A	A2727061A	Malik Becker	Physics	Male	12.858	NaN	NaN
A2999472W	A2999472W	Skylar Hensley	Chemistry	Male	23.877	17.648	41.525
A1643380L	A1643380L	Carolina Mcmahon	Physics	Male	12.858	18.080	30.938
A6787293E	A6787293E	Elian Potter	Physics	Male	23.265	14.360	37.625
A5975988J	A5975988J	Litzy White	Physics	Non-binary	20.817	15.920	36.737
A3699958T	A3699958T	Norah Miles	Chemistry	Male	NaN	NaN	NaN
A1468689D	A1468689D	Isabela Stokes	Chemistry	Male	18.366	16.800	35.166
A3217320C	A3217320C	Kathleen Rodriguez	Physics	Male	22.653	17.760	40.413
A6867791C	A6867791C	Katie Ayers	Comp. Biology	Female	16.530	10.280	26.810

Let’s create another mask for all the Chemistry students who identify as Female.

# Create mask
mask = (df_class["Major"] == "Chemistry") & (df_class["Gender"] == 'Female')

# Using mask
df_class.loc[mask, ["Name", "Major", "Gender"]]

	Name	Major	Gender
MATRIC_NO
A1282849W	Gustavo Vang	Chemistry	Female
A7653832E	Quentin Kemp	Chemistry	Female
A1218599T	Camden Williams	Chemistry	Female
A1512479K	Solomon Fletcher	Chemistry	Female
A7986368Y	Riley Christensen	Chemistry	Female

Cool right!

Let’s do another one. How about all students who scored less than 15% on Test 1?

mask = df_class["Test 1"] < 15
df_class.loc[mask, ["Name", "Major", "Test 1"]]

	Name	Major	Test 1
MATRIC_NO
A1282849W	Gustavo Vang	Chemistry	13.470
A9568373Q	Wyatt Oliver	Physics	12.246
A3202548I	Jaxon Chung	Physics	14.082
A2727061A	Malik Becker	Physics	12.858
A1643380L	Carolina Mcmahon	Physics	12.858

4.3 The magic of `groupby()`

Let me show you how to use the mesmerising, extremely powerful function groupby(). It is absurd how simple some things become with the right tools.

Here are some examples.

What are the means of the scores for the various test, according to Major?

df_class.groupby(by=["Major"]).mean(numeric_only=True)

	Test 1	Test 2	Total
Major
Chemistry	18.7755	16.610286	35.534167
Comp. Biology	16.5300	10.280000	26.810000
Life Sciences	19.2420	14.690286	33.932286
Physics	17.7555	16.076800	34.567000

:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. We use numeric_only=True to tell Pandas not to try evaluating a mean for non-numeric (‘Gender’ in this case) columns. However, I find this a tad annoying and non-intuitive for other (see agg() below) actions. So, I prefer to explicitly ask for the columns I want.

df_class.groupby(by=["Major"])[['Test 1', 'Test 2', 'Total']].mean()

Let’s group-by Major and Gender.

df_class.groupby(by=["Major", "Gender"])[['Test 1', 'Test 2', 'Total']].mean()

		Test 1	Test 2	Total
Major	Gender
Chemistry	Female	17.602500	16.364800	34.128500
	Male	21.121500	17.224000	38.345500
	Non-binary	NaN	NaN	NaN
Comp. Biology	Female	16.530000	10.280000	26.810000
	Male	NaN	NaN	NaN
	Non-binary	NaN	NaN	NaN
Life Sciences	Female	19.285875	14.801000	34.086875
	Male	20.357250	15.644000	36.001250
	Non-binary	16.836000	12.340000	29.176000
Physics	Female	20.817000	14.680000	35.497000
	Male	16.871000	16.498286	34.166143
	Non-binary	20.205000	15.300000	35.505000

If you don’t want the grouping parameters as indices, you can use as_index=False.

df_class.groupby(by=["Major", "Gender"], as_index=False)[['Test 1', 'Test 2', 'Total']].mean()

	Major	Gender	Test 1	Test 2	Total
0	Chemistry	Female	17.602500	16.364800	34.128500
1	Chemistry	Male	21.121500	17.224000	38.345500
2	Chemistry	Non-binary	NaN	NaN	NaN
3	Comp. Biology	Female	16.530000	10.280000	26.810000
4	Comp. Biology	Male	NaN	NaN	NaN
5	Comp. Biology	Non-binary	NaN	NaN	NaN
6	Life Sciences	Female	19.285875	14.801000	34.086875
7	Life Sciences	Male	20.357250	15.644000	36.001250
8	Life Sciences	Non-binary	16.836000	12.340000	29.176000
9	Physics	Female	20.817000	14.680000	35.497000
10	Physics	Male	16.871000	16.498286	34.166143
11	Physics	Non-binary	20.205000	15.300000	35.505000

If you are only interested in a specific column, you can do this:

df_class.groupby(by=["Major", "Gender"])["Test 1"].mean()

		Test 1
Major	Gender
Chemistry	Female	17.602500
	Male	21.121500
	Non-binary	NaN
Comp. Biology	Female	16.530000
	Male	NaN
	Non-binary	NaN
Life Sciences	Female	19.285875
	Male	20.357250
	Non-binary	16.836000
Physics	Female	20.817000
	Male	16.871000
	Non-binary	20.205000

Other ‘in-built’ function you can use are count(), sum(), mean(), median(), min(), max(), mode(), std(), var(). However, you can far more out of groupby() by using agg() to apply any function. Let me show you how to apply some functions from NumPy.

df_class.groupby(by=["Major", "Gender"])[['Test 1', 'Test 2', 'Total']].agg([np.mean, np.std]).round(2)

		Test 1		Test 2		Total
		mean	std	mean	std	mean	std
Major	Gender
Chemistry	Female	17.60	4.07	16.36	1.14	34.13	4.33
	Male	21.12	3.90	17.22	0.60	38.35	4.50
	Non-binary	NaN	NaN	NaN	NaN	NaN	NaN
Comp. Biology	Female	16.53	NaN	10.28	NaN	26.81	NaN
	Male	NaN	NaN	NaN	NaN	NaN	NaN
	Non-binary	NaN	NaN	NaN	NaN	NaN	NaN
Life Sciences	Female	19.29	2.07	14.80	2.61	34.09	3.84
	Male	20.36	4.37	15.64	1.49	36.00	4.95
	Non-binary	16.84	2.16	12.34	1.90	29.18	0.27
Physics	Female	20.82	NaN	14.68	NaN	35.50	NaN
	Male	16.87	4.35	16.50	1.89	34.17	5.15
	Non-binary	20.20	0.87	15.30	0.88	35.50	1.74

:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. :1: FutureWarning: The provided callable <function mean at 0x7fc8efe08f40> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string “mean” instead. :1: FutureWarning: The provided callable <function std at 0x7fc8efe09080> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string “std” instead.

You can even combine groupby() with a mask. For example, let’s see how all the Female students are doing.

mask = df_class['Gender'] == 'Female'
df_class[mask].groupby(by=['Major']).mean(numeric_only=True)

	Test 1	Test 2	Total
Major
Chemistry	17.602500	16.3648	34.128500
Comp. Biology	16.530000	10.2800	26.810000
Life Sciences	19.285875	14.8010	34.086875
Physics	20.817000	14.6800	35.497000

Just for comparison,

df_class.groupby(by=["Gender", "Major"]).mean(numeric_only=True)

		Test 1	Test 2	Total
Gender	Major
Female	Chemistry	17.602500	16.364800	34.128500
	Comp. Biology	16.530000	10.280000	26.810000
	Life Sciences	19.285875	14.801000	34.086875
	Physics	20.817000	14.680000	35.497000
Male	Chemistry	21.121500	17.224000	38.345500
	Comp. Biology	NaN	NaN	NaN
	Life Sciences	20.357250	15.644000	36.001250
	Physics	16.871000	16.498286	34.166143
Non-binary	Chemistry	NaN	NaN	NaN
	Comp. Biology	NaN	NaN	NaN
	Life Sciences	16.836000	12.340000	29.176000
	Physics	20.205000	15.300000	35.505000