|
--- |
|
license: cc-by-sa-4.0 |
|
tags: |
|
- DNA |
|
- biology |
|
- genomics |
|
- protein |
|
- kmer |
|
- cancer |
|
- gleason-grade-group |
|
--- |
|
## Project Description |
|
This repository contains the trained model for our paper: **Fine-tuning a Sentence Transformer for DNA & Protein tasks** that is currently under review at BMC Bioinformatics. This model, called **simcse-dna**; is based on the original implementation of **SimCSE [1]**. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks. |
|
|
|
### Prerequisites |
|
----------- |
|
Please see the original [SimCSE](https://github.com/princeton-nlp/SimCSE) for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580). |
|
|
|
### Usage |
|
|
|
Run the following code to get the sentence embeddings: |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# Import trained model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna") |
|
model = AutoModel.from_pretrained("dsfsi/simcse-dna") |
|
|
|
|
|
#sentences is your list of n DNA tokens of size 6 |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
|
|
# Get the embeddings |
|
with torch.no_grad(): |
|
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output |
|
|
|
|
|
``` |
|
The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification. |
|
|
|
## Performance on evaluation tasks |
|
|
|
Find out more about the datasets and access in the paper **(TBA)** |
|
|
|
**Table:** Accuracy scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. |
|
|
|
| Model | Embed. | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| LR | Proposed | _0.65 ± 0.01_ | _0.67 ± 0.0_ | _0.85 ± 0.01_ | _0.64 ± 0.01_ | _0.80 ± 0.0_ | _0.49 ± 0.0_ | _0.33 ± 0.0_ | _0.70 ± 0.01_ | |
|
| | DNABERT | 0.62 ± 0.01 | 0.65 ± 0.0 | 0.84 ± 0.04 | 0.69 ± 0.01 | 0.85 ± 0.01 | 0.49 ± 0.0 | 0.33 ± 0.0 | 0.60 ± 0.01 | |
|
| | NT | **0.66 ± 0.0** | **0.67 ± 0.0** | 0.84 ± 0.01 | **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| LGBM | Proposed | _0.64 ± 0.01_ | _0.66 ± 0.0_ | _0.90 ± 0.02_ | _0.61 ± 0.01_ | _0.78 ± 0.0_ | _0.49 ± 0.0_ | _0.33 ± 0.0_ | _0.81 ± 0.01_ | |
|
| | DNABERT | 0.62 ± 0.01 | 0.65 ± 0.01 | 0.90 ± 0.02 | 0.65 ± 0.01 | 0.83 ± 0.0 | 0.49 ± 0.0 | 0.33 ± 0.0 | 0.75 ± 0.01 | |
|
| | NT | 0.63 ± 0.01 | 0.66 ± 0.0 | **0.91 ± 0.02**| 0.72 ± 0.0 | **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| 0.97 ± 0.0 | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| XGB | Proposed | _0.60 ± 0.01_ | _0.62 ± 0.0_ | _0.90 ± 0.02_ | _0.60 ± 0.0_ | _0.77 ± 0.0_ | _0.49 ± 0.0_ | _0.33 ± 0.0_ | _0.85 ± 0.01_ | |
|
| | DNABERT | 0.59 ± 0.01 | 0.62 ± 0.01 | 0.90 ± 0.01 | 0.64 ± 0.01 | 0.82 ± 0.01 | 0.49 ± 0.0 | 0.33 ± 0.0 | 0.79 ± 0.01 | |
|
| | NT | 0.61 ± 0.01 | 0.64 ± 0.0 | 0.90 ± 0.02 | **0.89 ± 0.03**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| 0.98 ± 0.0 | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| RF | Proposed | _0.61 ± 0.0_ | _0.66 ± 0.01_ | _0.90 ± 0.02_ | _0.61 ± 0.01_ | _0.77 ± 0.0_ | _0.49 ± 0.0_ | _0.33 ± 0.0_ | _0.86 ± 0.0_ | |
|
| | DNABERT | 0.60 ± 0.0 | 0.66 ± 0.01 | 0.90 ± 0.02 | 0.63 ± 0.01 | 0.82 ± 0.0 | 0.49 ± 0.0 | 0.33 ± 0.0 | 0.81 ± 0.01 | |
|
| | NT | 0.62 ± 0.01 | **0.67 ± 0.01**| 0.90 ± 0.01 | 0.71 ± 0.01 | **0.85 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| 0.97 ± 0.0 | |
|
|
|
|
|
**Table:** F1-scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method. |
|
|
|
| Model | Embed. | T1 | T2 | T3 | T4 | T5 | T6 | T7 | T8 | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| LR | Proposed | **_0.78 ± 0.0_** | **_0.80 ± 0.01_** | _0.20 ± 0.05_ | _0.64 ± 0.01_ | _0.79 ± 0.0_ | _0.13 ± 0.37_ | _0.16 ± 0.0_ | _0.70 ± 0.01_ | |
|
| | DNABERT | 0.75 ± 0.01 | 0.78 ± 0.0 | 0.47 ± 0.09 | 0.69 ± 0.01 | 0.84 ± 0.01 | 0.13 ± 0.37 | 0.16 ± 0.0 | 0.59 ± 0.01 | |
|
| | NT | 0.56 ± 0.01 | 0.54 ± 0.0 | **0.78 ± 0.01**| **0.73 ± 0.0** | **0.85 ± 0.01**| **0.81 ± 0.0** | **0.62 ± 0.01**| **0.99 ± 0.0** | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| LGBM | Proposed | _0.76 ± 0.01_ | _0.79 ± 0.0_ | _0.60 ± 0.11_ | _0.63 ± 0.01_ | _0.77 ± 0.0_ | _0.47 ± 0.20_ | _0.26 ± 0.04_ | _0.82 ± 0.0_ | |
|
| | DNABERT | 0.74 ± 0.0 | 0.78 ± 0.0 | 0.60 ± 0.08 | 0.66 ± 0.01 | 0.82 ± 0.01 | 0.47 ± 0.20 | 0.26 ± 0.04 | 0.75 ± 0.01 | |
|
| | NT | 0.59 ± 0.01 | 0.56 ± 0.0 | **0.89 ± 0.02**| **0.72 ± 0.01**| **0.85 ± 0.0** | **0.80 ± 0.0** | **0.59 ± 0.01**| **0.97 ± 0.0** | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| XGB | Proposed | _0.72 ± 0.01_ | _0.75 ± 0.0_ | _0.59 ± 0.08_ | _0.60 ± 0.0_ | _0.76 ± 0.0_ | _0.47 ± 0.20_ | _0.26 ± 0.04_ | _0.85 ± 0.01_ | |
|
| | DNABERT | 0.71 ± 0.01 | 0.75 ± 0.01 | 0.58 ± 0.05 | 0.64 ± 0.01 | 0.82 ± 0.01 | 0.47 ± 0.20 | 0.26 ± 0.04 | 0.79 ± 0.01 | |
|
| | NT | 0.59 ± 0.01 | 0.57 ± 0.01 | 0.72 ± 0.01 | **0.85 ± 0.01**| **0.85 ± 0.01**| **0.81 ± 0.01**| **0.60 ± 0.01**| **0.9893 ± 0.0** | |
|
|-------|-----------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------| |
|
| RF | Proposed | _0.73 ± 0.0_ | _0.79 ± 0.0_ | _0.58 ± 0.08_ | _0.61 ± 0.01_ | _0.75 ± 0.0_ | _0.53 ± 0.17_ | _0.24 ± 0.05_ | _0.86 ± 0.0_ | |
|
| | DNABERT | 0.72 ± 0.0 | 0.79 ± 0.0 | 0.59 ± 0.09 | 0.63 ± 0.01 | 0.80 ± 0.01 | 0.53 ± 0.17 | 0.24 ± 0.05 | 0.82 ± 0.01 | |
|
| | NT | 0.59 ± 0.01 | 0.56 ± 0.01 | **0.89 ± 0.02**| **0.71 ± 0.01**| **0.84 ± 0.0** | **0.79 ± 0.0** | **0.55 ± 0.01**| **0.97 ± 0.0** | |
|
|
|
## Authors |
|
----------- |
|
|
|
* Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes |
|
* Contact details : [email protected] |
|
|
|
## Citation |
|
----------- |
|
Bibtex Reference **TBA** |
|
|
|
### References |
|
|
|
<a id="1">[1]</a> |
|
Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021). |