metadata

title: Tutorial - Fitness estimation from pooled growth and NGS
emoji: 🧮
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Inferring competitive fitness from NGS data
tags:
  - biology
  - sequencing

Tutorial – Fitness estimation from pooled growth

This is the repository for the interactive tutorial, acccessed here. Non-interactive text from the tutorial is below.

Multiplex growth curves

How do strains grow when competing with each other?

That's given by the Lotka–Volterra competition model. For two strains:

$\frac{dn_{wt}}{dt} = w_{wt} n_{wt} \left( 1 - \frac{n_{wt} + n_{1}}{K} \right)$

$\frac{dn_1}{dt} = w_{1} n_1 \left( 1 - \frac{n_{wt} + n_{1}}{K} \right)$

$n_i(t)$: abundance of species (or strain) $i$ at time $t$.
$w_i$: intrinsic (exponential) growth rate of species $i$.
$K$: carrying capacity.

We can generalize to many strains. For each one:

$\frac{dn_i}{dt} = w_{i} n_i \left( 1 - \frac{\Sigma_j n_j}{K} \right)$

It's not possible to algebraically integrate these equations, since they are circularly dependent on each other. But we can numerically integrate, to simulate multiplexed growth curves.

Removing time dependence

It can be difficult to get absolute fitness out of these curves, because when the pool approaches the carrying capacity, all the strains growth rates mutually affect each other.

However, if we're only interested in the relative fitness of multiplexed strains relative to a reference (e.g. wild-type) strain, then we can make this simplification:

$\frac{dn_{i}}{dt} / \frac{dn_{wt}}{dt} = \frac{dn_{i}}{dn_{wt}} = \frac{w_{i} n_i}{w_{wt} n_{wt}}$

The interdependency term cancels out, and time is removed, with the reference strain's growth acting as the clock. Unlike the time-dependent Lotka-Volterra equations, this has a closed-form integral:

$\log n_i(t) = \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)} + \log{n_i(0)}$

So now the log of the number of cells of a mutant at any moment ($n_1(t)$) is dependent only on its inoculum ($n_1(0)$), how much the reference strain has grown (i.e. fold-expansion, $\frac{n_{wt}(t)}{n_{wt}(0)}$), and the ratio of fitness between the mutant and the reference ($\frac{n_{wt}(t)}{n_{wt}(0)}$).

Read counts from next-generation sequencing

But we don't actually measure the number of cells directly. Instead, we're measuring the number of reads (or UMIs) which represent a random sampling of the population followed by molecular biology handling and uneven sequencing per lane which decouples the relative abundances for each timepoint.

Below, you can simulate read counts for technical replicates of the growth curves above. The simulation:

Randomly samples a defined fraction of the cell population (without replacement, i.e. the Hypergeometric distribution). Smaller samples from smaller populations are noisier.
Calculates the resulting proportional representation of every strain in every sample.
Multiplies that proportion by read depth.
Randomly samples sequencing read counts resulting from variations in library construction and other stochasticity, according to the Negative Binomial distribution, an established noise model for sequencing counts.

Accounting for sequencing subsampling per sample

Each sequencing sample $s$ could be over- or under-sampling the population relative to the first timepoint by some factor $\phi_s$.

$\log \frac{c_i(t)}{c_i(0)} = \log \phi_s\frac{n_i(t)}{n_i(0)} = \log \phi_s + \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)}$

Variables:

$c_i(t)$: Read (or UMI) count of strain $i$ at time $t$
$\phi_s$: The ratio of sampling depth at time $t$ to that at time $0$ for sample $s$

The factor $\phi_s$ is the ratio of the ratio of read counts between samples and the ratio of cell counts between samples for any strain (assuming each strain is sampled without bias):

$\log \phi_s = \log \frac{c_i(t)}{c_i(0)} - \log \frac{n_i(0)}{n_i(0)}$

We can get rid of the nuisance parameter $\phi_s$ (which is difficult to measure becuase we don't know the true number of cells for each strain and sample) using the following trick.

We have the equation for read counts for mutant $i$ (same as above):

$\log \frac{c_i(t)}{c_i(0)} = \log \phi_s + \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)}$

And for the reference strain (relative fitness is 1):

$\log \frac{c_{wt}(t)}{c_{wt}(0)} = \log \phi_s + \log \frac{n_{wt}(t)}{n_{wt}(0)}$

We can make $\phi_s$ disappear by taking the difference:

$\log \frac{c_i(t)}{c_i(0)} - \log \frac{c_{wt}(t)}{c_{wt}(0)} = \frac{w_i}{w_{wt}} \log \frac{n_{wt}(t)}{n_{wt}(0)} - \log \frac{n_{wt}(t)}{n_{wt}(0)}$

This is equivalent to:

$\log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(\frac{w_i}{w_{wt}} - 1 \right) \log \frac{n_{wt}(t)}{n_{wt}(0)}$

So the ratio of the count ratio of a strain to the reference strain at time t to the count ratio of a strain to the reference strain at time 0 is dependent only on the relative fitness and the true fold-expansion of the reference strain.

Plotting the ratio of the count ratio of a strain to the reference strain at time t to the count ratio of a strain to the reference strain at time 0 should give a straight line (on a log-log) plot, with intercept 0 and gradient equal to the relative fitness minus 1.

Using spike-in counts

But we don't actually know the true fold-expansion of the reference strain, since it's not directly observed. However, a non-growing fitness-zero control can help, such as a heat-killed strain or a spike-in plasmid.

We start with the equation before,

$\log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(\frac{w_i}{w_{wt}} - 1 \right) \log \frac{n_{wt}(t)}{n_{wt}(0)}$

But for the fitness-zero control, $w_{spike} = 0$, so:

$\log \left( \frac{c_{spike}(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_{spike}(0)} \right) = -\log \frac{n_{wt}(t)}{n_{wt}(0)}$

This means that, although we don't know how the reference strain grows directly, its growth is given from the ratio of the spike counts to the reference counts, normalized to the same ratio at time 0.

This leaves us with the overall equation:

$\log \left( \frac{c_i(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_i(0)} \right) = \left(1 - \frac{w_i}{w_{wt}} \right) \log \left( \frac{c_{spike}(t)}{c_{wt}(t)}\frac{c_{wt}(0)}{c_{spike}(0)} \right)$

If we plot the left hand side against the right, we should get a straight line for each strain with intercept zero and gradient $1 - \frac{w_i}{w_{wt}}$.