Pandas (Nice)

What to expect in this chapter

In this chapter, I will quickly introduce you to a fantastic package called Seaborn that will make visualisation of data in the form of dataframes a breeze. Then, I will quickly show you a few other useful things, like how to merge two dataframes as this will be helpful if you have data spread over multiple files.

1 Seaborn

In this chapter, I will introduce Seaborn, a plotting package that extends Matplotlib for dataframes. Seaborn can achieve things in a few lines of code that might take you many with Matplotlib. However, I am just going to give you a taste of things. You can find more information on the seaborn website.

You will have to install the package and import it as:

import seaborn as sns

I will be lazy and reload the dummy classroom data from the file (finalised_scores.xlsx) we wrote.

df = pd.read_excel('finalised_scores.xlsx')

1.1 Setting the context

Seaborn can help tweak our plots based on where we want to use them. This is done with set_context(). The options we have are paper, talk and poster. You can also set a theme (i.e. plotting style).

sns.set_context("paper")
sns.set_style("darkgrid")

1.2 Some Examples

  1. We use catplot() is used for categorical data. Show me a bar chart for (the numbers in) column Total based on the categorical variables in column Major. Split the plots into columns according to the categories in Gender.

    sns.catplot(data=df, x="Major", y="Total", col="Gender", kind="bar")

    plt.show()

    Try the following:

    • Swap x and y
    • Use row instead of col
    • Use hue instead of col
  2. sns.catplot(data=df, x="Major", y="Total", col="Gender", kind="box")

    Try the following:

    • Swap x and y.
    • Use row instead of col.
    • Use hue instead of col.
  3. Use displot() for plotting distributions. Show me the histogram for the scores in Total separated out based on the Major. Dodge the bars so that they do not overlap.

    sns.displot(data=df, x='Total', hue='Major', multiple="dodge")

    Try the following:

    • Use row or col instead of hue.
    • Use stack instead of dodge.

2 A few more things

2.1 Merging dataframes

In the last chapter, I used pd.concat() to join two dataframes. pd.concat() is nice because you can give it a list of dataframes to join in one go. However, I now would like to take you through another way of combining the two dummy class dataframes with less hassle using pd.merge(). I have shown the full code below. However, here is the part that deals with pd.merge().

Understanding pd.merge()

df_combined = pd.merge(
    left=df_class_1,           # dataframe 1
    right=df_class_2,          # dataframe 2
    how='outer',               # Join without losing any data
    left_on='Student No',      # Use to align dataframe 1
    right_on='Student No'      # Use to align dataframe 2
)

In the last chapter, we set the dataframe index to the MATRIC_NO in both dataframes so that we could use it to align the various rows during pd.concat(). However, if you use pd.merge(), we can specify two dataframes (using left= and right=) and two columns in these dataframes that can be used for alignment. This is what left_on= and right_on= do.

pd.merge() also has a parameter how= that decides how these two data sets are combined. You must be careful with this one. Here are the

Option Effect
left Keep all rows of the left dataframe. If the right has missing on values, fill the right part with NaN
right Keep all rows of the right dataframe. If the left has missing on values, fill the left part with NaN
inner (default) Keep only those rows with common left and right data.
outer Keep all rows of both dataframes. Fill any missing values with NaN

Convince yourself of how how= works by trying the various options. Here are the results you should get for df_combined.shape.

Option Effect
left (35,8)
right (31,8)
inner (31,8)
outer (35,8)

The full recipe

from matplotlib import pyplot as plt

df_class_1 = pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)
df_class_2 = pd.read_excel('dummy-class-2-of-2.xlsx')

# Combine the two datasets
df_combined = pd.merge(
    left=df_class_1,       # dataframe 1
    right=df_class_2,      # dataframe 2
    how='outer',           # Join without losing any data
    left_on='Student No',  # Use to align dataframe 1
    right_on='Student No'  # Use to align dataframe 2
)

# Rename columns
df_combined.rename(
    columns={'Student No': 'MATRIC_NO',
             'Test 1 (30%)': 'Test 1',
             'Test 2 (20%)': 'Test 2',
             'Test 3 (50%)': 'Test 3'},
    inplace=True
)

# Reorgnise/drop columns
df_combined = df_combined[['MATRIC_NO',
                           'Name', 'Major', 'Gender',
                           'Test 1', 'Test 2', 'Test 3']
                          ]

# Replace text with long forms
df_combined.replace(
    to_replace={
        'PHY': 'Physics',
        'CHM': 'Chemistry',
        'LS': 'Life Sciences',
        'CBIO': 'Comp. Biology',
        'F': 'Female',
        'M': 'Male',
        'NB': 'Non-binary'
    }, inplace=True
)

# Remove the ' from Test 2


def clean(text):
    '''
    Function to remove ' ' from column 'Test 2'.
    To be applied using apply()
    '''
    try:
        return text.replace("'", "")
    except AttributeError:
        # This will handle the NaN of the missing data
        return text


df_combined['Test 2'] = df_combined['Test 2'].apply(clean)

# Update column types
df_combined = df_combined.astype({
    'Gender': 'category',
    'Major': 'category',
    'Test 2': 'float'
}
)

# df_combined.fillna(0, inplace=True)                      # Fix missing scores
df_combined["Total"] = df_combined[['Test 1', 'Test 2', 'Test 3']].sum(axis=1)
df_combined = df_combined.round(2)
df_combined.to_excel('finalised_scores.xlsx', index=False)
df_combined.head()

df_combined.boxplot(by='Major',
                    column=['Test 1', 'Test 2', 'Test 3', 'Total'],
                    vert=False, figsize=(8, 6))
plt.show()

2.2 isin()



df_class = pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)

#------------------ Drop and reorganise columns  -----------------#
columns_to_keep = ['Student No', 'Name', 'Major', 'Gender',
                   'Test 1 (30%)', 'Test 2 (20%)']

df_class = df_class[columns_to_keep]

#------------------------- Rename columns ------------------------#
new_column_info = {'Student No': 'MATRIC_NO',
                   'Test 1 (30%)': 'Test 1',
                   'Test 2 (20%)': 'Test 2'}

df_class.rename(columns=new_column_info, inplace=True)

#--------------------- Set index to MATRIC_NO --------------------#
df_class.set_index('MATRIC_NO', drop=False, inplace=True)

#-------------------------- Rename stuff -------------------------#
replace_info = {
    'PHY': 'Physics',
    'CHM': 'Chemistry',
    'LS': 'Life Sciences',
    'CBIO': 'Comp. Biology',
    'F': 'Female',
    'M': 'Male',
    'NB': 'Non-binary'
}

df_class.replace(to_replace=replace_info, inplace=True)

#---------------- Remove the ' ' from column Test 2 --------------#


def clean(text):
    '''
    Function to remove ' ' from column 'Test 2'.
    To be applied using apply()
    '''
    try:
        return text.replace("'", "")
    except AttributeError:
        # This will handle the NaN of the missing data
        return text


df_class['Test 2'] = df_class['Test 2'].apply(clean)

#--------------- Convert column Test 2 to type float -------------#
new_type_info = {'Major': 'category',
                 'Gender': 'category',
                 'Test 2': 'float'}

df_class = df_class.astype(new_type_info)

#------------------------ Add a new column -----------------------#
df_class["Total"] = df_class["Test 1"] + df_class["Test 2"]

#------------------------- Export the file -----------------------#
df_class.to_excel('finalised_scores.xlsx', index=False)

df_class.head()

Let’s say we want to find out if ‘Ronin Christian’ and ‘Maryjane Sandoval’ are in the class and how they are doing. Let me show you another way to generate a mask using the method isin(), which queries the whole dataframe.

mask = df_class.isin(['Maryjane Sandoval', 'Ronin Christian'])
df_class[mask]
MATRIC_NO Name Major Gender Test 1 Test 2 Total
MATRIC_NO
A3028967J NaN NaN NaN NaN NaN NaN NaN
A1282849W NaN NaN NaN NaN NaN NaN NaN
A5408925A NaN Ronin Christian NaN NaN NaN NaN NaN
A6973859L NaN NaN NaN NaN NaN NaN NaN
A5410124H NaN NaN NaN NaN NaN NaN NaN
A9568373Q NaN NaN NaN NaN NaN NaN NaN
A6824244G NaN NaN NaN NaN NaN NaN NaN
A9194090U NaN Maryjane Sandoval NaN NaN NaN NaN NaN
A4828364M NaN NaN NaN NaN NaN NaN NaN
A4607700C NaN NaN NaN NaN NaN NaN NaN
A7067766E NaN NaN NaN NaN NaN NaN NaN
A5569996J NaN NaN NaN NaN NaN NaN NaN
A3202548I NaN NaN NaN NaN NaN NaN NaN
A6131593U NaN NaN NaN NaN NaN NaN NaN
A7653832E NaN NaN NaN NaN NaN NaN NaN
A9462811I NaN NaN NaN NaN NaN NaN NaN
A1218599T NaN NaN NaN NaN NaN NaN NaN
A7210476B NaN NaN NaN NaN NaN NaN NaN
A1512479K NaN NaN NaN NaN NaN NaN NaN
A7986368Y NaN NaN NaN NaN NaN NaN NaN
A2727061A NaN NaN NaN NaN NaN NaN NaN
A2999472W NaN NaN NaN NaN NaN NaN NaN
A7116486E NaN NaN NaN NaN NaN NaN NaN
A6931452S NaN NaN NaN NaN NaN NaN NaN
A9649096H NaN NaN NaN NaN NaN NaN NaN
A1643380L NaN NaN NaN NaN NaN NaN NaN
A6787293E NaN NaN NaN NaN NaN NaN NaN
A5975988J NaN NaN NaN NaN NaN NaN NaN
A3699958T NaN NaN NaN NaN NaN NaN NaN
A1956366U NaN NaN NaN NaN NaN NaN NaN
A1468689D NaN NaN NaN NaN NaN NaN NaN
A3217320C NaN NaN NaN NaN NaN NaN NaN
A6867791C NaN NaN NaN NaN NaN NaN NaN
A4080490P NaN NaN NaN NaN NaN NaN NaN
A7667457P NaN NaN NaN NaN NaN NaN NaN

Since this output is overwhelming, we should use any() as any() will pick out the non-NaN locations. The axis option will allow us to specify if to apply it to rows or columns(I can never remember which is which, so I just try 0 or 1).

df_class[mask].any(axis=1)
MATRIC_NO
A3028967J    False
A1282849W    False
A5408925A     True
A6973859L    False
A5410124H    False
A9568373Q    False
A6824244G    False
A9194090U     True
A4828364M    False
A4607700C    False
A7067766E    False
A5569996J    False
A3202548I    False
A6131593U    False
A7653832E    False
A9462811I    False
A1218599T    False
A7210476B    False
A1512479K    False
A7986368Y    False
A2727061A    False
A2999472W    False
A7116486E    False
A6931452S    False
A9649096H    False
A1643380L    False
A6787293E    False
A5975988J    False
A3699958T    False
A1956366U    False
A1468689D    False
A3217320C    False
A6867791C    False
A4080490P    False
A7667457P    False
dtype: bool

We can reuse the above as a mask.

df_class[df_class[mask].any(axis=1)]
MATRIC_NO Name Major Gender Test 1 Test 2 Total
MATRIC_NO
A5408925A A5408925A Ronin Christian Physics Male 18.366 15.56 33.926
A9194090U A9194090U Maryjane Sandoval Life Sciences Female 18.981 16.40 35.381

A shorter way is to use any() to probe the mask directly.

df_class[mask.any(axis=1)]
Back to top