4.4 The Convenience of Dataframes

4.4.1 (Mini Exercise) Method 1: Using a Spreadsheet

Imagine that you have administered two tests to your class of ten (10) students. You now need to combine the results to calculate the total score. Given below are the details of the class and tests.

Description Max Score Weight No. absent Filename Download link
Test 01 25 35% 1 test01.xlsx Link
Test 02 35 65% 3 test02.xlsx Link
Students details - - - class.xlsx Link

If you like you can download a single zip file containing all three files.

What you have to do

  1. Download the three Excel files and place them in a folder on the Desktop.
    • Please name the folder ‘workshop’ so that everyone has the same name.
  2. Use your favourite tool (it can be any tool) to combine the data for Test 01 and Test 02 to get the final score for each student.

Please bear in mind:

  • There are absent students.
  • Max Score indicates the highest score possible for that test.
  • The tests have different weights.

\[\text{final score} = \text{score for test 1}\times\left(\dfrac{35}{25} \right) + \text{score for test 2}\times\left(\dfrac{65}{35}\right) \]

4.4.2 Method 2: Using a Dataframes

Lets solve the above problem using a dataframe! Because dataframes uses an index, we can easily add different dataframes without having to worry about alignment or missing rows!

import pandas as pd

df_class_data = pd.read_excel('class.xlsx')
df_test_1_data = pd.read_excel('test01.xlsx')
df_test_2_data = pd.read_excel('test02.xlsx')

for df in [df_class_data, df_test_1_data, df_test_2_data]:
    df.set_index(keys='Student No',inplace=True)

# Consolidate scores. Adjust for the maximum score and weights
# Absent students get zero
df_class_data['Test 01'] = df_test_1_data['Score']/25*35       # New column
df_class_data['Test 02'] = df_test_2_data['Score']/35*65       # New column
df_class_data.fillna(0, inplace=True)

df_class_data['Total'] = df_class_data['Test 01'] + df_class_data['Test 02']
df_class_data = df_class_data.round(decimals=2)
df_class_data.to_excel('class_w_scores.xlsx', index=False)

df_class_data.describe()
#          Test 01    Test 02     Total
# count  10.000000  10.000000  10.00000
# mean   20.311000  24.963000  45.27500
# std     7.272026  17.449038  22.23347
# min     0.000000   0.000000   0.00000
# 25%    21.295000   7.297500  30.04250
# 50%    22.170000  35.575000  56.50500
# 75%    23.587500  35.905000  59.87750
# max    24.820000  40.780000  63.83000