import seaborn as sns
Pandas (Nice)
What to expect in this chapter
In this chapter, I will quickly introduce you to a fantastic package called Seaborn that will make visualisation of data in the form of dataframes a breeze. Then, I will quickly show you a few other useful things, like how to merge two dataframes as this will be helpful if you have data spread over multiple files.
1 Seaborn
In this chapter, I will introduce Seaborn, a plotting package that extends Matplotlib for dataframes. Seaborn can achieve things in a few lines of code that might take you many with Matplotlib. However, I am just going to give you a taste of things. You can find more information on the seaborn website.
You will have to install the package and import it as:
I will be lazy and reload the dummy classroom data from the file (finalised_scores.xlsx
) we wrote.
= pd.read_excel('finalised_scores.xlsx') df
1.1 Setting the context
Seaborn can help tweak our plots based on where we want to use them. This is done with set_context()
. The options we have are paper
, talk
and poster
. You can also set a theme (i.e. plotting style).
"paper")
sns.set_context("darkgrid") sns.set_style(
1.2 Some Examples
-
We use
catplot()
is used forcat
egorical data. Show me a bar chart for (the numbers in) columnTotal
based on the categorical variables in columnMajor
. Split the plots into columns according to the categories inGender
.=df, x="Major", y="Total", col="Gender", kind="bar") sns.catplot(data
plt.show()
Try the following:
- Swap
x
andy
- Use
row
instead ofcol
- Use
hue
instead ofcol
- Swap
-
=df, x="Major", y="Total", col="Gender", kind="box") sns.catplot(data
Try the following:
- Swap
x
andy
. - Use
row
instead ofcol
. - Use
hue
instead ofcol
.
- Swap
-
Use
displot()
for plotting distributions. Show me the histogram for the scores inTotal
separated out based on theMajor
. Dodge the bars so that they do not overlap.=df, x='Total', hue='Major', multiple="dodge") sns.displot(data
Try the following:
- Use
row
orcol
instead ofhue
. - Use
stack
instead ofdodge
.
- Use
2 A few more things
2.1 Merging dataframes
In the last chapter, I used pd.concat()
to join two dataframes. pd.concat()
is nice because you can give it a list of dataframes to join in one go. However, I now would like to take you through another way of combining the two dummy class dataframes with less hassle using pd.merge()
. I have shown the full code below. However, here is the part that deals with pd.merge()
.
Understanding pd.merge()
= pd.merge(
df_combined =df_class_1, # dataframe 1
left=df_class_2, # dataframe 2
right='outer', # Join without losing any data
how='Student No', # Use to align dataframe 1
left_on='Student No' # Use to align dataframe 2
right_on )
In the last chapter, we set the dataframe index to the MATRIC_NO
in both dataframes so that we could use it to align the various rows during pd.concat()
. However, if you use pd.merge()
, we can specify two dataframes (using left=
and right=
) and two columns in these dataframes that can be used for alignment. This is what left_on=
and right_on=
do.
pd.merge()
also has a parameter how=
that decides how these two data sets are combined. You must be careful with this one. Here are the
Option | Effect |
---|---|
left |
Keep all rows of the left dataframe. If the right has missing on values, fill the right part with NaN |
right |
Keep all rows of the right dataframe. If the left has missing on values, fill the left part with NaN |
inner |
(default) Keep only those rows with common left and right data. |
outer |
Keep all rows of both dataframes. Fill any missing values with NaN |
Convince yourself of how how=
works by trying the various options. Here are the results you should get for df_combined.shape
.
Option | Effect |
---|---|
left |
(35,8) |
right |
(31,8) |
inner |
(31,8) |
outer |
(35,8) |
The full recipe
from matplotlib import pyplot as plt
= pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)
df_class_1 = pd.read_excel('dummy-class-2-of-2.xlsx')
df_class_2
# Combine the two datasets
= pd.merge(
df_combined =df_class_1, # dataframe 1
left=df_class_2, # dataframe 2
right='outer', # Join without losing any data
how='Student No', # Use to align dataframe 1
left_on='Student No' # Use to align dataframe 2
right_on
)
# Rename columns
df_combined.rename(={'Student No': 'MATRIC_NO',
columns'Test 1 (30%)': 'Test 1',
'Test 2 (20%)': 'Test 2',
'Test 3 (50%)': 'Test 3'},
=True
inplace
)
# Reorgnise/drop columns
= df_combined[['MATRIC_NO',
df_combined 'Name', 'Major', 'Gender',
'Test 1', 'Test 2', 'Test 3']
]
# Replace text with long forms
df_combined.replace(={
to_replace'PHY': 'Physics',
'CHM': 'Chemistry',
'LS': 'Life Sciences',
'CBIO': 'Comp. Biology',
'F': 'Female',
'M': 'Male',
'NB': 'Non-binary'
=True
}, inplace
)
# Remove the ' from Test 2
def clean(text):
'''
Function to remove ' ' from column 'Test 2'.
To be applied using apply()
'''
try:
return text.replace("'", "")
except AttributeError:
# This will handle the NaN of the missing data
return text
'Test 2'] = df_combined['Test 2'].apply(clean)
df_combined[
# Update column types
= df_combined.astype({
df_combined 'Gender': 'category',
'Major': 'category',
'Test 2': 'float'
}
)
# df_combined.fillna(0, inplace=True) # Fix missing scores
"Total"] = df_combined[['Test 1', 'Test 2', 'Test 3']].sum(axis=1)
df_combined[= df_combined.round(2)
df_combined 'finalised_scores.xlsx', index=False)
df_combined.to_excel(
df_combined.head()
='Major',
df_combined.boxplot(by=['Test 1', 'Test 2', 'Test 3', 'Total'],
column=False, figsize=(8, 6))
vert plt.show()
2.2 isin()
= pd.read_excel('dummy-class-1-of-2.xlsx', skiprows=1)
df_class
#------------------ Drop and reorganise columns -----------------#
= ['Student No', 'Name', 'Major', 'Gender',
columns_to_keep 'Test 1 (30%)', 'Test 2 (20%)']
= df_class[columns_to_keep]
df_class
#------------------------- Rename columns ------------------------#
= {'Student No': 'MATRIC_NO',
new_column_info 'Test 1 (30%)': 'Test 1',
'Test 2 (20%)': 'Test 2'}
=new_column_info, inplace=True)
df_class.rename(columns
#--------------------- Set index to MATRIC_NO --------------------#
'MATRIC_NO', drop=False, inplace=True)
df_class.set_index(
#-------------------------- Rename stuff -------------------------#
= {
replace_info 'PHY': 'Physics',
'CHM': 'Chemistry',
'LS': 'Life Sciences',
'CBIO': 'Comp. Biology',
'F': 'Female',
'M': 'Male',
'NB': 'Non-binary'
}
=replace_info, inplace=True)
df_class.replace(to_replace
#---------------- Remove the ' ' from column Test 2 --------------#
def clean(text):
'''
Function to remove ' ' from column 'Test 2'.
To be applied using apply()
'''
try:
return text.replace("'", "")
except AttributeError:
# This will handle the NaN of the missing data
return text
'Test 2'] = df_class['Test 2'].apply(clean)
df_class[
#--------------- Convert column Test 2 to type float -------------#
= {'Major': 'category',
new_type_info 'Gender': 'category',
'Test 2': 'float'}
= df_class.astype(new_type_info)
df_class
#------------------------ Add a new column -----------------------#
"Total"] = df_class["Test 1"] + df_class["Test 2"]
df_class[
#------------------------- Export the file -----------------------#
'finalised_scores.xlsx', index=False)
df_class.to_excel(
df_class.head()
Let’s say we want to find out if ‘Ronin Christian’ and ‘Maryjane Sandoval’ are in the class and how they are doing. Let me show you another way to generate a mask using the method isin()
, which queries the whole dataframe.
= df_class.isin(['Maryjane Sandoval', 'Ronin Christian'])
mask df_class[mask]
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | Total | |
---|---|---|---|---|---|---|---|
MATRIC_NO | |||||||
A3028967J | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1282849W | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A5408925A | NaN | Ronin Christian | NaN | NaN | NaN | NaN | NaN |
A6973859L | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A5410124H | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A9568373Q | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A6824244G | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A9194090U | NaN | Maryjane Sandoval | NaN | NaN | NaN | NaN | NaN |
A4828364M | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A4607700C | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7067766E | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A5569996J | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A3202548I | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A6131593U | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7653832E | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A9462811I | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1218599T | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7210476B | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1512479K | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7986368Y | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A2727061A | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A2999472W | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7116486E | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A6931452S | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A9649096H | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1643380L | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A6787293E | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A5975988J | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A3699958T | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1956366U | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A1468689D | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A3217320C | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A6867791C | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A4080490P | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A7667457P | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Since this output is overwhelming, we should use any()
as any()
will pick out the non-NaN
locations. The axis
option will allow us to specify if to apply it to rows or columns(I can never remember which is which, so I just try 0 or 1).
any(axis=1) df_class[mask].
MATRIC_NO
A3028967J False
A1282849W False
A5408925A True
A6973859L False
A5410124H False
A9568373Q False
A6824244G False
A9194090U True
A4828364M False
A4607700C False
A7067766E False
A5569996J False
A3202548I False
A6131593U False
A7653832E False
A9462811I False
A1218599T False
A7210476B False
A1512479K False
A7986368Y False
A2727061A False
A2999472W False
A7116486E False
A6931452S False
A9649096H False
A1643380L False
A6787293E False
A5975988J False
A3699958T False
A1956366U False
A1468689D False
A3217320C False
A6867791C False
A4080490P False
A7667457P False
dtype: bool
We can reuse the above as a mask.
any(axis=1)] df_class[df_class[mask].
MATRIC_NO | Name | Major | Gender | Test 1 | Test 2 | Total | |
---|---|---|---|---|---|---|---|
MATRIC_NO | |||||||
A5408925A | A5408925A | Ronin Christian | Physics | Male | 18.366 | 15.56 | 33.926 |
A9194090U | A9194090U | Maryjane Sandoval | Life Sciences | Female | 18.981 | 16.40 | 35.381 |
A shorter way is to use any()
to probe the mask directly.
any(axis=1)] df_class[mask.