04| Pool Sampling
Introduction
Let’s investigate a medical testing strategy that proved invaluable in coping with the huge number of COVID-19 tests required in the past pandemic. Specifically, we will be studying the strategy called Group or Pool sampling (Wikipedia). It was first proposed by the economist Robert Dorfman in 1943 to test syphilis in soldiers!
To get started, please download the following two files:
-
- This file contains a python function
test(genome)
that accepts a genome as a string (e.g.,ATGAGAAT
…) and returnsTrue
if there is an infection orFalse
if not. - This function can only test one sample at a time.
- There are no false positives.
- The test is 100% accurate.
- This file contains a python function
-
- This file contains (simulated ) genomics data for 500 people.
- The files are named in the format
person_###.txt
where###
is a identifying number. - Just for your information, each person’s file contains a million bases.
Tasks
Getting started
- To get started, import the function
test()
fromtest_kit.py
.
test()
is a straightforward function. However, you are not meant to understand how this function works. You just need to be able to use it. So, you can treat the functiontest()
as a black-box. - Confirm that Person 27 is infected and that person 17 and 37 are not infected. To use
test()
on a person, you need to:- Read the person’s genomics data from her file and
- Pass this genomics data into
test()
Create a function
Create a function
get_genome(person_id)
that will take an (integer)person_id
as input and return the corresponding genome data as output.Use
get_genome()
to applytest()
to the first 100 people. (I.e. fromperson_00000.txt
toperson_00099.txt
)
You should see persons 1, 7, 8, 27, 47, 57, 62, 63, and 78 show infections.
Random testing
- Use
test()
on 100 random people to estimate the infection rate in this population.
Print your result as a percentage.
Pool Sampling
Write some Python code to join/combine the genomics data of person 1 and person 2. I.e. given
genome_1
andgenome_2
, you should end up withgenome_1genome_2
What we are trying to simulate here is the mixing of blood samples.Write a function called
pool(list_of_id)
that accepts a list of integers and returns (a string) of the joined genomes of the people identified by the integers.Apply
test()
to the combined genomes of 20 and 21 and 21 and 22. Are the results as expected?For the first 100 persons, use
pool
andtest
in groups of 10 to determine those infected. Please keep track of the number of tests you have performed and print it at the end.Repeat the previous part with groups of 5.
Optimisation
Let’s try to detect as many infections as possible using the least number of tests. Pick any grouping you like, and apply pool sampling to determine the number of infected people in the population of 500. Keep track of the number of tests you have performed.