4.5 A step farther with Pandas: An example
Let’s explore some advanced features of Pandas. For this, download the data ‘Graduate Employment Survey - NTU, NUS, SIT, SMU & SUTD’ from the Data.gov.sg website. We look at the median salaries of graduates from the different universities in Singapore.
Remember to read the descriptions of the data set so that you can know setup the data to your needs.
First explore a bit
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
= pd.read_csv('graduate-employment-survey-ntu-nus-sit-smu-sutd.csv', encoding='latin-1')
df
df.columns# Index(['year', 'university', 'school', 'degree', 'employment_rate_overall',
# 'employment_rate_ft_perm', 'basic_monthly_mean', 'basic_monthly_median',
# 'gross_monthly_mean', 'gross_monthly_median',
# 'gross_mthly_25_percentile', 'gross_mthly_75_percentile'],
# dtype='object')
df.info()# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 570 entries, 0 to 569
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 year 570 non-null int64
# 1 university 570 non-null object
# 2 school 570 non-null object
# 3 degree 570 non-null object
# 4 employment_rate_overall 570 non-null object
# 5 employment_rate_ft_perm 570 non-null object
# 6 basic_monthly_mean 570 non-null object
# 7 basic_monthly_median 570 non-null object
# 8 gross_monthly_mean 570 non-null object
# 9 gross_monthly_median 570 non-null object
# 10 gross_mthly_25_percentile 570 non-null object
# 11 gross_mthly_75_percentile 570 non-null object
# dtypes: int64(1), object(11)
# memory usage: 53.6+ KB
Lets analyse
# Clean up the dataset
# 1. Shorter Names
# 2. 'na' repalced with np.nan
# 3. 'basic_monthly_median' changed to type float
= {'Nanyang Technological University': 'NTU',
new_names 'National University of Singapore': 'NUS',
'Singapore Management University': 'SMU',
'Singapore Institute of Technology': 'SIT',
'Singapore University of Technology and Design': 'SUTD'}
=True)
df.replace(new_names, inplace'na', np.nan, inplace=True)
df.replace(= df.astype({'basic_monthly_median': 'float64'})
df
= df.groupby(['year', 'university'], as_index=True).agg(
df_group 'basic_monthly_median': ['count', 'min', 'max', 'mean', 'std']}
{
)
df_group# basic_monthly_median
# count min max mean std
# year university
# 2013 NTU 32 2500.0 3500.0 3009.562500 214.472336
# NUS 33 2650.0 4800.0 3119.030303 473.869476
# SMU 12 2800.0 5000.0 3493.750000 710.963255
# 2014 NTU 33 2600.0 3700.0 3074.242424 212.820058
# NUS 34 2550.0 5000.0 3265.647059 505.622322
# SIT 24 2400.0 3355.0 2899.583333 270.446156
# SMU 12 2850.0 4900.0 3624.166667 653.615238
# 2015 NTU 33 2800.0 4184.0 3153.939394 275.998748
# NUS 36 2850.0 4600.0 3389.055556 427.336531
# SIT 25 2300.0 3400.0 2958.800000 264.434459
# SMU 12 2900.0 4766.0 3630.500000 575.909794
# SUTD 3 3450.0 3700.0 3589.333333 127.441490
# 2016 NTU 34 2400.0 4000.0 3231.323529 282.366377
# NUS 34 2800.0 4775.0 3391.911765 445.547016
# SIT 26 2250.0 4200.0 3122.692308 406.010420
# SMU 12 3000.0 4950.0 3713.916667 597.020246
# SUTD 3 3500.0 4000.0 3700.000000 264.575131
# 2017 NTU 35 2800.0 4750.0 3313.057143 338.092308
# NUS 37 2025.0 5000.0 3445.486486 524.884518
# SIT 28 2300.0 3800.0 3102.678571 342.873160
# SMU 12 3000.0 5450.0 3849.500000 675.129819
# SUTD 3 3600.0 4000.0 3800.000000 200.000000
# Use a pivot table
= pd.pivot_table(df,
df_pivot =['year'],
index='university',
columns='basic_monthly_median',
values='mean')
aggfunc
'fivethirtyeight')
plt.style.use(= df_pivot.plot(kind='bar', figsize=(10, 4), rot=0)
ax =ax.set_ylabel('Median Salary')
_=ax.legend(loc=(1.025,.35))
_
plt.tight_layout() plt.show()