04| Pool Sampling

Introduction

Let’s investigate a medical testing strategy that proved invaluable in coping with the huge number of COVID-19 tests required in the past pandemic. Specifically, we will be studying the strategy called Group or Pool sampling (Wikipedia). It was first proposed by the economist Robert Dorfman in 1943 to test syphilis in soldiers!

To get started, please download the following two files:

  1. test_kit.py

    • This file contains a python function test(genome) that accepts a genome as a string (e.g., ATGAGAAT…) and returns True if there is an infection or False if not.
    • This function can only test one sample at a time.
    • There are no false positives.
    • The test is 100% accurate.
  2. person-files.zip

    • This file contains (simulated ) genomics data for 500 people.
    • The files are named in the format person_###.txt where ### is a identifying number.
    • Just for your information, each person’s file contains a million bases.

Tasks

Getting started

  1. To get started, import the function test() from test_kit.py.
    test() is a straightforward function. However, you are not meant to understand how this function works. You just need to be able to use it. So, you can treat the function test() as a black-box.
  2. Confirm that Person 27 is infected and that person 17 and 37 are not infected. To use test() on a person, you need to:
    1. Read the person’s genomics data from her file and
    2. Pass this genomics data into test()

Create a function

  1. Create a function get_genome(person_id) that will take an (integer) person_id as input and return the corresponding genome data as output.

  2. Use get_genome() to apply test() to the first 100 people. (I.e. from person_00000.txt to person_00099.txt)
    You should see persons 1, 7, 8, 27, 47, 57, 62, 63, and 78 show infections.

Random testing

  1. Use test() on 100 random people to estimate the infection rate in this population.
    Print your result as a percentage.

Pool Sampling

  1. Write some Python code to join/combine the genomics data of person 1 and person 2. I.e. given genome_1 and genome_2, you should end up with genome_1genome_2 What we are trying to simulate here is the mixing of blood samples.

  2. Write a function called pool(list_of_id) that accepts a list of integers and returns (a string) of the joined genomes of the people identified by the integers.

  3. Apply test() to the combined genomes of 20 and 21 and 21 and 22. Are the results as expected?

  4. For the first 100 persons, use pool and test in groups of 10 to determine those infected. Please keep track of the number of tests you have performed and print it at the end.

  5. Repeat the previous part with groups of 5.

Optimisation

Let’s try to detect as many infections as possible using the least number of tests. Pick any grouping you like, and apply pool sampling to determine the number of infected people in the population of 500. Keep track of the number of tests you have performed.

Back to top