# | include: false
import numpy as np
import pandas as pd
256852) np.random.seed(
Storing Data (Need)
![](https://imgs.xkcd.com/comics/2018_cve_list.png)
You need to load the NumPy package to use NumPy arrays. Please import it using an alias as follows:
import numpy as np
What to expect in this chapter
We are learning Python as a tool to help us understand science and solve problems related to science. To do this, we must interact with information/data and transform them to yield a solution. For this, it is essential to have ways to store and manipulate data easily and efficiently beyond the simple variables we have encountered so far. Python offers a variety of ways to store and manipulate data. You have already met the list and dictionary in a previous chapter. However, there are several more; here is a (non-comprehensive) list.
- Lists
- Numpy arrays
- Dictionaries
- Tuples
- Dataframes
- Classes
In these chapters on basics, I will only discuss Python lists, Numpy arrays, dictionaries and tuples. If you want to learn about dataframes please look at the Data Processing basket in the Applications part. Classes are an advanced topic that I will touch on in the Nice chapter.
I cannot emphasize how important it is for you to understand how to store, retrieve and modify data in programming. This is because these abstract structures will influence how you think about data1. This will ultimately aid (or hinder) your ability to conjure up algorithms to solve problems.
1 Lists, Arrays & Dictionaries
1.1 Let’s compare
Let me show you how to store the same information (in this case, some superhero data) using lists, arrays and dictionaries.
Python Lists
= ["Black Widow", "Iron Man", "Doctor Strange"]
py_super_names = ["Natasha Romanoff", "Tony Stark", "Stephen Strange"] py_real_names
Numpy Arrays
= np.array(["Black Widow", "Iron Man", "Doctor Strange"])
np_super_names = np.array(["Natasha Romanoff", "Tony Stark", "Stephen Strange"]) np_real_names
Dictionary
= {
superhero_info "Natasha Romanoff": "Black Widow",
"Tony Stark": "Iron Man",
"Stephen Strange": "Doctor Strange"
}
Notice:
- Dictionaries use a key and an associated value separated by a
:
- The dictionary very elegantly holds the real and superhero names in one structure while we need two lists (or arrays) for the same data.
- For lists and arrays, the order matters. I.e. ‘Iron Man’ must be in the same position as ‘Tony Stark’ for things to work.
Lists (and arrays) offer many features that dictionaries don’t and vice versa. I will demonstrate these in a bit. Which data storage strategy to choose will depend on the problem you are trying to solve. More on this later; for the moment…
There are three basic ways of storing data:
- lists,
- NumPy arrays and
- dictionaries.
By the way,
- I added
py
andnp
in front of the variable for clarity. You can choose any name for the variables (provided that they are not a Python keyword likefor
,if
). - I am being lazy; when I say ‘arrays’, I mean ‘NumPy arrays’, and when I say ‘lists’, I mean ‘Python lists’.
1.2 Accessing data from a list (or array)
To access data from lists (and arrays), we need to use an index corresponding to the data’s position. Python is a zero-indexed language, meaning it starts counting at 0. So if you want to access a particular element in the list (or array), you need to specify the relevant index starting from zero. The image below shows the relationship between the position and index.
= ["Black Widow", "Iron Man", "Doctor Strange"]
py_super_names = ["Natasha Romanoff", "Tony Stark", "Stephen Strange"] py_real_names
-
0] py_real_names[
'Natasha Romanoff'
-
0] py_super_names[
'Black Widow'
-
Using a negative index allows us to count from the back of the list. For instance, using the index -1 will give the last element. This is super useful because we can easily access the last element without knowing the list size.
2] # Forward indexing py_super_names[# We need to know the size # beforehand for this to work.
'Doctor Strange'
-1] # Reverse indexing py_super_names[
'Doctor Strange'
Data in lists (and arrays) must be accessed using a zero-based index.
1.3 Accessing data from a dictionary
Dictionaries hold data (values) paired with a key. i.e. you can access the value (in this case, the superhero name) using the real name as a key. Here is how it works:
= {
superhero_info "Natasha Romanoff": "Black Widow",
"Tony Stark": "Iron Man",
"Stephen Strange": "Doctor Strange"
}
"Natasha Romanoff"] superhero_info[
'Black Widow'
Remember that dictionaries have a key-value structure.
If you want, you can access all the keys and all the values as follows:
superhero_info.keys()
dict_keys(['Natasha Romanoff', 'Tony Stark', 'Stephen Strange'])
superhero_info.values()
dict_values(['Black Widow', 'Iron Man', 'Doctor Strange'])
1.4 Higher dimensional lists
Unlike with a dictionary, we needed two lists to store the corresponding real and superhero names. An obvious way around the need to have two lists is to have a 2D list (or array) as follows.
= [['Natasha Romanoff', 'Black Widow'],
py_superhero_info 'Tony Stark', 'Iron Man'],
['Stephen Strange', 'Doctor Strange']] [
2 Lists vs. Arrays
Lists and arrays have some similarities but more differences. It is important to know these to make full use of these differences. So, let me now show you a few quick examples of using lists and arrays. These will allow you to appreciate the versatility that each offers.
2.1 Size
Often, you need to know how many elements there are in lists or arrays. We can use the len()
function for this purpose for both lists and arrays. However, arrays also offer other options.
= [[1, "A"], [2, "B"], [3, "C"], [4, "D"],
py_list_2d 5, "E"], [6, "F"], [7, "G"], [8, "H"],
[9, "I"], [10, "J"]]
[
= np.array(py_list_2d) # Reusing the Python list
np_array_2d # to create a NEW
# NumPy array
len(py_list_2d)
len(np_array_2d)
np_array_2d.shape
Lists
10
Arrays
10
(10, 2)
Notice the absence of brackets ( )
in shape
above. This is because shape
is not a function. Instead, it is a property or attribute of the NumPy array.
2.2 Arrays are fussy about type
Please recall the previous discussion about data types (e.g., int
, float
, str
). One prominent difference between lists and arrays is that arrays insist on having only a single data type; lists are more accommodating. Consider the following example and notice how the numbers are converted to English (' '
) when we create the NumPy array.
= [1, 1.5, 'A']
py_list = np.array(py_list) np_array
py_list np_array
Lists
[1, 1.5, 'A']
Arrays
array(['1', '1.5', 'A'], dtype='<U32')
When dealing with datasets with both numbers and text, you must be mindful of this restriction. However, this is just an annoyance and not a problem as we can easily change type (typecast) using the ‘hidden’ function astypes()
. More about this in a later chapter. For the moment,
Remember that NumPy arrays tolerate only a single type.
2.3 Adding a number
= [1, 2, 3, 4, 5]
py_list = np.array(py_list) # Reusing the Python list
np_array # to create a NEW
# NumPy array
+ 10 np_array
Lists
+ 10 # Won't work! py_list
Arrays
array([11, 12, 13, 14, 15])
2.4 Adding another list
= [1, 2, 3, 4, 5]
py_list_1 = [10, 20, 30, 40, 50]
py_list_2
= np.array(py_list_1)
np_array_1 = np.array(py_list_2) np_array_2
+ py_list_2
py_list_1 + np_array_2 np_array_1
Lists
[1, 2, 3, 4, 5, 10, 20, 30, 40, 50]
Arrays
array([11, 22, 33, 44, 55])
So, adding lists causes them to grow while adding arrays is an element-wise operation.
2.5 Multiplying by a Number
= [1, 2, 3, 4, 5]
py_list = np.array(py_list) np_array
*2
py_list*2 np_array
Lists
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Arrays
array([ 2, 4, 6, 8, 10])
So multiplying by a number makes a list grow, whereas an array multiplies its elements by the number!
2.6 Squaring
= [1, 2, 3, 4, 5]
py_list = np.array(py_list) np_array
**2 np_array
Lists
**2 # Won't work! py_list
Arrays
array([ 1, 4, 9, 16, 25])
2.7 Asking questions
= [1, 2, 3, 4, 5]
py_list = np.array(py_list) np_array
== 3 # Works, but what IS the question?
py_list == 3
np_array > 3 np_array
Lists
-
False
-
> 3 # Won't work! py_list
Arrays
-
array([False, False, True, False, False])
-
array([False, False, False, True, True])
2.8 Mathematics
= [1, 2, 3, 4, 5]
py_list = np.array(py_list) np_array
sum(py_list) # sum() is a base Python function
max(py_list) # max() is a base Python function
min(py_list) # min() is a base Python function
sum()
np_array.max()
np_array.min()
np_array.
np_array.mean() np_array.std()
Lists
-
15
-
5
-
1
-
sum() # Won't work! py_list.
Arrays
-
15
-
5
-
1
-
3.0
-
1.4142135623730951
(roughly speaking) an operation on a list works on the whole list. In contrast, an operation on an array works on the individual elements of the array.
Footnotes
For example, think of how easy it is to do row or column manipulations of data when put into a spreadsheet format↩︎