Transformers
PyTorch
DNA
Genomics
GDTB
ReadLengthSequences

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

This model is trained on 11M+ read length sequences. By read length, we mean DNA sequences selected from a realistic data distribution of the sequencing 'reads' you get from a sequencing-by-synthesis (e.g. Illumina) sequencing machine, which have a median sequence length of 128bp.

The dataset was created from read-length sequences subselected from genomes downloaded from the GTDB taxonomy "release214" (https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/). The GTDB taxonomy provides a phylogenetic organization of a wide diversity of prokaryotic genomes from both cultured and environmental origin.

To prepare the dataset we downloaded the entire GTDB taxonomy tree and randomly selected one genome from each Order in the tree. We split these genome IDs 80-10-10 into Train-Validation-Test sets, ensuring low sequence similarity and distant homology between sets to maximize generalization and minimize leakage. We then downloaded the full genome sequence and chunked each genome into read length sequences as in LookingGlass v1, described in this paper "https://www.nature.com/articles/s41467-022-30070-8".

These read length sequences were then filtered to remove any sequences with more than 10% unknown nucleotides in the sequences. These unknown tokens in dataset were shown by the token "N". This resulted in a dataset of size 11M+ Read Length Sequences.

We developed a custom dataloader by which each batch selected sequences from each genome in the training set, ensuring a broad sequence diversity in each batch. The training dataset was ordered according to the following process: Each genome(.fna file) contains arbitrary number of read length sequences which will be picked in 300 sequences of exclusive batches from all genomes and arranged sequentially for one after other. This is done so we present data to the training module in a structured way that presents higher biodiversity in each batch, by this we tend to learn more generalized features. This approach avoids overfitting to one genome in early batches, and then forgetting about it later as the learning proceeds.

This dataset is named: Ordered Split Ordered Batch (OSOB): image/jpg All dataset preperation method is described in this notebook: https://github.com/Hoarfrost-Lab/LookingGlassv2.0/blob/main/notebooks/DataCleaner.ipynb

Here is a notebook on how the load the model into memory: https://github.com/Hoarfrost-Lab/LookingGlassv2.0/blob/main/notebooks/How%20to%20Load%20LookingGlass(LOLBERT)%20model%20with%20custom%20Tokenizer.ipynb

Note: You will need the tokenizer (merges and vocab) file to be present on your local to be able to load the model properly.

Here is a Validation loss trend for training over 100 epochs:

image/png

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for HoarfrostLab/LookingGlass-2

Finetuned
(833)
this model