Spaces:
Running
A newer version of the Gradio SDK is available:
5.42.0
title: Tutorial - Fitness estimation from pooled growth and NGS
emoji: ๐งฎ
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Inferring competitive fitness from NGS data
tags:
- biology
- sequencing
Tutorial โ Fitness estimation from pooled growth
This is the repository for the interactive tutorial, acccessed here. Non-interactive text from the tutorial is below.
Multiplex growth curves
How do strains grow when competing with each other?
That's given by the LotkaโVolterra competition model. For two strains:
dtdnwtโโ=wwtโnwtโ(1โKnwtโ+n1โโ)
dtdn1โโ=w1โn1โ(1โKnwtโ+n1โโ)
- $n_i(t)$: abundance of species (or strain) $i$ at time $t$.
- $w_i$: intrinsic (exponential) growth rate of species $i$.
- $K$: carrying capacity.
We can generalize to many strains. For each one:
dtdniโโ=wiโniโ(1โKฮฃjโnjโโ)
It's not possible to algebraically integrate these equations, since they are circularly dependent on each other. But we can numerically integrate, to simulate multiplexed growth curves.
Removing time dependence
It can be difficult to get absolute fitness out of these curves, because when the pool approaches the carrying capacity, all the strains growth rates mutually affect each other.
However, if we're only interested in the relative fitness of multiplexed strains relative to a reference (e.g. wild-type) strain, then we can make this simplification:
dtdniโโ/dtdnwtโโ=dnwtโdniโโ=wwtโnwtโwiโniโโ
The interdependency term cancels out, and time is removed, with the reference strain's growth acting as the clock. Unlike the time-dependent Lotka-Volterra equations, this has a closed-form integral:
logniโ(t)=wwtโwiโโlognwtโ(0)nwtโ(t)โ+logniโ(0)
So now the log of the number of cells of a mutant at any moment ($n_1(t)$) is dependent only on its inoculum ($n_1(0)$), how much the reference strain has grown (i.e. fold-expansion, $\frac{n_{wt}(t)}{n_{wt}(0)}$), and the ratio of fitness between the mutant and the reference ($\frac{n_{wt}(t)}{n_{wt}(0)}$).
Read counts from next-generation sequencing
But we don't actually measure the number of cells directly. Instead, we're measuring the number of reads (or UMIs) which represent a random sampling of the population followed by molecular biology handling and uneven sequencing per lane which decouples the relative abundances for each timepoint.
Below, you can simulate read counts for technical replicates of the growth curves above. The simulation:
- Randomly samples a defined fraction of the cell population (without replacement, i.e. the Hypergeometric distribution). Smaller samples from smaller populations are noisier.
- Calculates the resulting proportional representation of every strain in every sample.
- Multiplies that proportion by read depth.
- Randomly samples sequencing read counts resulting from variations in library construction and other stochasticity, according to the Negative Binomial distribution, an established noise model for sequencing counts.
Accounting for sequencing subsampling per sample
Each sequencing sample $s$ could be over- or under-sampling the population relative to the first timepoint by some factor $\phi_s$.
logciโ(0)ciโ(t)โ=logฯsโniโ(0)niโ(t)โ=logฯsโ+wwtโwiโโlognwtโ(0)nwtโ(t)โ
Variables:
- $c_i(t)$: Read (or UMI) count of strain $i$ at time $t$
- $\phi_s$: The ratio of sampling depth at time $t$ to that at time $0$ for sample $s$
The factor $\phi_s$ is the ratio of the ratio of read counts between samples and the ratio of cell counts between samples for any strain (assuming each strain is sampled without bias):
logฯsโ=logciโ(0)ciโ(t)โโlogniโ(0)niโ(0)โ
We can get rid of the nuisance parameter $\phi_s$ (which is difficult to measure becuase we don't know the true number of cells for each strain and sample) using the following trick.
We have the equation for read counts for mutant $i$ (same as above):
logciโ(0)ciโ(t)โ=logฯsโ+wwtโwiโโlognwtโ(0)nwtโ(t)โ
And for the reference strain (relative fitness is 1):
logcwtโ(0)cwtโ(t)โ=logฯsโ+lognwtโ(0)nwtโ(t)โ
We can make $\phi_s$ disappear by taking the difference:
logciโ(0)ciโ(t)โโlogcwtโ(0)cwtโ(t)โ=wwtโwiโโlognwtโ(0)nwtโ(t)โโlognwtโ(0)nwtโ(t)โ
This is equivalent to:
log(cwtโ(t)ciโ(t)โciโ(0)cwtโ(0)โ)=(wwtโwiโโโ1)lognwtโ(0)nwtโ(t)โ
So the ratio of the count ratio of a strain to the reference strain at time t to the count ratio of a strain to the reference strain at time 0 is dependent only on the relative fitness and the true fold-expansion of the reference strain.
Plotting the ratio of the count ratio of a strain to the reference strain at time t to the count ratio of a strain to the reference strain at time 0 should give a straight line (on a log-log) plot, with intercept 0 and gradient equal to the relative fitness minus 1.
Using spike-in counts
But we don't actually know the true fold-expansion of the reference strain, since it's not directly observed. However, a non-growing fitness-zero control can help, such as a heat-killed strain or a spike-in plasmid.
We start with the equation before,
log(cwtโ(t)ciโ(t)โciโ(0)cwtโ(0)โ)=(wwtโwiโโโ1)lognwtโ(0)nwtโ(t)โ
But for the fitness-zero control, $w_{spike} = 0$, so:
log(cwtโ(t)cspikeโ(t)โcspikeโ(0)cwtโ(0)โ)=โlognwtโ(0)nwtโ(t)โ
This means that, although we don't know how the reference strain grows directly, its growth is given from the ratio of the spike counts to the reference counts, normalized to the same ratio at time 0.
This leaves us with the overall equation:
log(cwtโ(t)ciโ(t)โciโ(0)cwtโ(0)โ)=(1โwwtโwiโโ)log(cwtโ(t)cspikeโ(t)โcspikeโ(0)cwtโ(0)โ)
If we plot the left hand side against the right, we should get a straight line for each strain with intercept zero and gradient $1 - \frac{w_i}{w_{wt}}$.