4.5 A step farther with Pandas: An example

Let’s explore some advanced features of Pandas. For this, download the data ‘Graduate Employment Survey - NTU, NUS, SIT, SMU & SUTD’ from the Data.gov.sg website. We look at the median salaries of graduates from the different universities in Singapore.

Remember to read the descriptions of the data set so that you can know setup the data to your needs.

First explore a bit

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

df = pd.read_csv('graduate-employment-survey-ntu-nus-sit-smu-sutd.csv', encoding='latin-1')

df.columns
# Index(['year', 'university', 'school', 'degree', 'employment_rate_overall',
#        'employment_rate_ft_perm', 'basic_monthly_mean', 'basic_monthly_median',
#        'gross_monthly_mean', 'gross_monthly_median',
#        'gross_mthly_25_percentile', 'gross_mthly_75_percentile'],
#       dtype='object')
df.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 570 entries, 0 to 569
# Data columns (total 12 columns):
#  #   Column                     Non-Null Count  Dtype 
# ---  ------                     --------------  ----- 
#  0   year                       570 non-null    int64 
#  1   university                 570 non-null    object
#  2   school                     570 non-null    object
#  3   degree                     570 non-null    object
#  4   employment_rate_overall    570 non-null    object
#  5   employment_rate_ft_perm    570 non-null    object
#  6   basic_monthly_mean         570 non-null    object
#  7   basic_monthly_median       570 non-null    object
#  8   gross_monthly_mean         570 non-null    object
#  9   gross_monthly_median       570 non-null    object
#  10  gross_mthly_25_percentile  570 non-null    object
#  11  gross_mthly_75_percentile  570 non-null    object
# dtypes: int64(1), object(11)
# memory usage: 53.6+ KB

Lets analyse

# Clean up the dataset
# 1. Shorter Names
# 2. 'na' repalced with np.nan
# 3. 'basic_monthly_median' changed to type float

new_names = {'Nanyang Technological University': 'NTU',
             'National University of Singapore': 'NUS',
             'Singapore Management University': 'SMU',
             'Singapore Institute of Technology': 'SIT',
             'Singapore University of Technology and Design': 'SUTD'}

df.replace(new_names, inplace=True)
df.replace('na', np.nan, inplace=True)
df = df.astype({'basic_monthly_median': 'float64'})

df_group = df.groupby(['year', 'university'], as_index=True).agg(
    {'basic_monthly_median': ['count', 'min', 'max', 'mean', 'std']}
)

df_group
#                 basic_monthly_median                                         
#                                count     min     max         mean         std
# year university                                                              
# 2013 NTU                          32  2500.0  3500.0  3009.562500  214.472336
#      NUS                          33  2650.0  4800.0  3119.030303  473.869476
#      SMU                          12  2800.0  5000.0  3493.750000  710.963255
# 2014 NTU                          33  2600.0  3700.0  3074.242424  212.820058
#      NUS                          34  2550.0  5000.0  3265.647059  505.622322
#      SIT                          24  2400.0  3355.0  2899.583333  270.446156
#      SMU                          12  2850.0  4900.0  3624.166667  653.615238
# 2015 NTU                          33  2800.0  4184.0  3153.939394  275.998748
#      NUS                          36  2850.0  4600.0  3389.055556  427.336531
#      SIT                          25  2300.0  3400.0  2958.800000  264.434459
#      SMU                          12  2900.0  4766.0  3630.500000  575.909794
#      SUTD                          3  3450.0  3700.0  3589.333333  127.441490
# 2016 NTU                          34  2400.0  4000.0  3231.323529  282.366377
#      NUS                          34  2800.0  4775.0  3391.911765  445.547016
#      SIT                          26  2250.0  4200.0  3122.692308  406.010420
#      SMU                          12  3000.0  4950.0  3713.916667  597.020246
#      SUTD                          3  3500.0  4000.0  3700.000000  264.575131
# 2017 NTU                          35  2800.0  4750.0  3313.057143  338.092308
#      NUS                          37  2025.0  5000.0  3445.486486  524.884518
#      SIT                          28  2300.0  3800.0  3102.678571  342.873160
#      SMU                          12  3000.0  5450.0  3849.500000  675.129819
#      SUTD                          3  3600.0  4000.0  3800.000000  200.000000
# Use a pivot table
df_pivot = pd.pivot_table(df,
                          index=['year'],
                          columns='university',
                          values='basic_monthly_median',
                          aggfunc='mean')

plt.style.use('fivethirtyeight')
ax = df_pivot.plot(kind='bar', figsize=(10, 4), rot=0)
_=ax.set_ylabel('Median Salary')
_=ax.legend(loc=(1.025,.35))
plt.tight_layout()
plt.show()