simcse-dna / README.md

Update README.md

4f36d2b verified 27 days ago

7.81 kB

	---
	license: cc-by-sa-4.0
	tags:
	- DNA
	- biology
	- genomics
	- protein
	- kmer
	- cancer
	- gleason-grade-group
	---
	## Project Description
	This repository contains the trained model for our paper: Fine-tuning a Sentence Transformer for DNA & Protein tasks that is currently under review at BMC Bioinformatics. This model, called simcse-dna; is based on the original implementation of SimCSE [1]. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.

	### Prerequisites
	-----------
	Please see the original [SimCSE](https://github.com/princeton-nlp/SimCSE) for installation details. The model will als be hosted on Zenodo (DOI: 10.5281/zenodo.11046580).

	### Usage

	Run the following code to get the sentence embeddings:

	```python

	import torch
	from transformers import AutoModel, AutoTokenizer

	# Import trained model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("dsfsi/simcse-dna")
	model = AutoModel.from_pretrained("dsfsi/simcse-dna")


	#sentences is your list of n DNA tokens of size 6
	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

	# Get the embeddings
	with torch.no_grad():
	embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output


	```
	The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.

	## Performance on evaluation tasks

	Find out more about the datasets and access in the paper (TBA)

	Table: Accuracy scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method.

	\| Model \| Embed. \| T1 \| T2 \| T3 \| T4 \| T5 \| T6 \| T7 \| T8 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| LR \| Proposed \| _0.65 ± 0.01_ \| _0.67 ± 0.0_ \| _0.85 ± 0.01_ \| _0.64 ± 0.01_ \| _0.80 ± 0.0_ \| _0.49 ± 0.0_ \| _0.33 ± 0.0_ \| _0.70 ± 0.01_ \|
	\| \| DNABERT \| 0.62 ± 0.01 \| 0.65 ± 0.0 \| 0.84 ± 0.04 \| 0.69 ± 0.01 \| 0.85 ± 0.01 \| 0.49 ± 0.0 \| 0.33 ± 0.0 \| 0.60 ± 0.01 \|
	\| \| NT \| 0.66 ± 0.0 \| 0.67 ± 0.0 \| 0.84 ± 0.01 \| 0.73 ± 0.0 \| 0.85 ± 0.01\| 0.81 ± 0.0 \| 0.62 ± 0.01\| 0.99 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| LGBM \| Proposed \| _0.64 ± 0.01_ \| _0.66 ± 0.0_ \| _0.90 ± 0.02_ \| _0.61 ± 0.01_ \| _0.78 ± 0.0_ \| _0.49 ± 0.0_ \| _0.33 ± 0.0_ \| _0.81 ± 0.01_ \|
	\| \| DNABERT \| 0.62 ± 0.01 \| 0.65 ± 0.01 \| 0.90 ± 0.02 \| 0.65 ± 0.01 \| 0.83 ± 0.0 \| 0.49 ± 0.0 \| 0.33 ± 0.0 \| 0.75 ± 0.01 \|
	\| \| NT \| 0.63 ± 0.01 \| 0.66 ± 0.0 \| 0.91 ± 0.02\| 0.72 ± 0.0 \| 0.85 ± 0.0 \| 0.80 ± 0.0 \| 0.59 ± 0.01\| 0.97 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| XGB \| Proposed \| _0.60 ± 0.01_ \| _0.62 ± 0.0_ \| _0.90 ± 0.02_ \| _0.60 ± 0.0_ \| _0.77 ± 0.0_ \| _0.49 ± 0.0_ \| _0.33 ± 0.0_ \| _0.85 ± 0.01_ \|
	\| \| DNABERT \| 0.59 ± 0.01 \| 0.62 ± 0.01 \| 0.90 ± 0.01 \| 0.64 ± 0.01 \| 0.82 ± 0.01 \| 0.49 ± 0.0 \| 0.33 ± 0.0 \| 0.79 ± 0.01 \|
	\| \| NT \| 0.61 ± 0.01 \| 0.64 ± 0.0 \| 0.90 ± 0.02 \| 0.89 ± 0.03\| 0.85 ± 0.01\| 0.81 ± 0.01\| 0.60 ± 0.01\| 0.98 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| RF \| Proposed \| _0.61 ± 0.0_ \| _0.66 ± 0.01_ \| _0.90 ± 0.02_ \| _0.61 ± 0.01_ \| _0.77 ± 0.0_ \| _0.49 ± 0.0_ \| _0.33 ± 0.0_ \| _0.86 ± 0.0_ \|
	\| \| DNABERT \| 0.60 ± 0.0 \| 0.66 ± 0.01 \| 0.90 ± 0.02 \| 0.63 ± 0.01 \| 0.82 ± 0.0 \| 0.49 ± 0.0 \| 0.33 ± 0.0 \| 0.81 ± 0.01 \|
	\| \| NT \| 0.62 ± 0.01 \| 0.67 ± 0.01\| 0.90 ± 0.01 \| 0.71 ± 0.01 \| 0.85 ± 0.0 \| 0.79 ± 0.0 \| 0.55 ± 0.01\| 0.97 ± 0.0 \|


	Table: F1-scores (with 95% confidence intervals) across datasets T1–T8 for each model and embedding method.

	\| Model \| Embed. \| T1 \| T2 \| T3 \| T4 \| T5 \| T6 \| T7 \| T8 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| LR \| Proposed \| _0.78 ± 0.0_ \| _0.80 ± 0.01_ \| _0.20 ± 0.05_ \| _0.64 ± 0.01_ \| _0.79 ± 0.0_ \| _0.13 ± 0.37_ \| _0.16 ± 0.0_ \| _0.70 ± 0.01_ \|
	\| \| DNABERT \| 0.75 ± 0.01 \| 0.78 ± 0.0 \| 0.47 ± 0.09 \| 0.69 ± 0.01 \| 0.84 ± 0.01 \| 0.13 ± 0.37 \| 0.16 ± 0.0 \| 0.59 ± 0.01 \|
	\| \| NT \| 0.56 ± 0.01 \| 0.54 ± 0.0 \| 0.78 ± 0.01\| 0.73 ± 0.0 \| 0.85 ± 0.01\| 0.81 ± 0.0 \| 0.62 ± 0.01\| 0.99 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| LGBM \| Proposed \| _0.76 ± 0.01_ \| _0.79 ± 0.0_ \| _0.60 ± 0.11_ \| _0.63 ± 0.01_ \| _0.77 ± 0.0_ \| _0.47 ± 0.20_ \| _0.26 ± 0.04_ \| _0.82 ± 0.0_ \|
	\| \| DNABERT \| 0.74 ± 0.0 \| 0.78 ± 0.0 \| 0.60 ± 0.08 \| 0.66 ± 0.01 \| 0.82 ± 0.01 \| 0.47 ± 0.20 \| 0.26 ± 0.04 \| 0.75 ± 0.01 \|
	\| \| NT \| 0.59 ± 0.01 \| 0.56 ± 0.0 \| 0.89 ± 0.02\| 0.72 ± 0.01\| 0.85 ± 0.0 \| 0.80 ± 0.0 \| 0.59 ± 0.01\| 0.97 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| XGB \| Proposed \| _0.72 ± 0.01_ \| _0.75 ± 0.0_ \| _0.59 ± 0.08_ \| _0.60 ± 0.0_ \| _0.76 ± 0.0_ \| _0.47 ± 0.20_ \| _0.26 ± 0.04_ \| _0.85 ± 0.01_ \|
	\| \| DNABERT \| 0.71 ± 0.01 \| 0.75 ± 0.01 \| 0.58 ± 0.05 \| 0.64 ± 0.01 \| 0.82 ± 0.01 \| 0.47 ± 0.20 \| 0.26 ± 0.04 \| 0.79 ± 0.01 \|
	\| \| NT \| 0.59 ± 0.01 \| 0.57 ± 0.01 \| 0.72 ± 0.01 \| 0.85 ± 0.01\| 0.85 ± 0.01\| 0.81 ± 0.01\| 0.60 ± 0.01\| 0.9893 ± 0.0 \|
	\|-------\|-----------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|----------------\|
	\| RF \| Proposed \| _0.73 ± 0.0_ \| _0.79 ± 0.0_ \| _0.58 ± 0.08_ \| _0.61 ± 0.01_ \| _0.75 ± 0.0_ \| _0.53 ± 0.17_ \| _0.24 ± 0.05_ \| _0.86 ± 0.0_ \|
	\| \| DNABERT \| 0.72 ± 0.0 \| 0.79 ± 0.0 \| 0.59 ± 0.09 \| 0.63 ± 0.01 \| 0.80 ± 0.01 \| 0.53 ± 0.17 \| 0.24 ± 0.05 \| 0.82 ± 0.01 \|
	\| \| NT \| 0.59 ± 0.01 \| 0.56 ± 0.01 \| 0.89 ± 0.02\| 0.71 ± 0.01\| 0.84 ± 0.0 \| 0.79 ± 0.0 \| 0.55 ± 0.01\| 0.97 ± 0.0 \|

	## Authors
	-----------

	* Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
	* Contact details : [email protected]

	## Citation
	-----------
	Bibtex Reference TBA

	### References

	<a id="1">[1]</a>
	Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).