mmokoatle commited on
Commit
518724d
·
verified ·
1 Parent(s): 1a2d83b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md CHANGED
@@ -1,3 +1,103 @@
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
4
+ ## Project Description
5
+ This repository contains the trained model for our paper: **Fine-tuning a Sentence Transformer for DNA & Protein tasks** that is currently under review at BMC Bioinformatics. This model, called **simcse-dna**; is based on the original implementation of **SimCSE [1]**. The original model was adapted for DNA downstream tasks by training it on a small sample size k-mer tokens generated from the human reference genome, and can be used to generate sentence embeddings for DNA tasks.
6
+
7
+ ### Prerequisites
8
+ -----------
9
+ Please see the original [SimCSE](https://github.com/princeton-nlp/SimCSE) for installation details. The model will be hosted on Zenodo (DOI: 10.5281/zenodo.11046580). It
10
+ is also available on 🤗 [huggingface](https://huggingface.co/dsfsi/simcse-dna).
11
+
12
+ ### Usage
13
+
14
+ Download the model into a directory then run the following code to get the sentence embeddings:
15
+
16
+ ```python
17
+
18
+ import torch
19
+ from transformers import AutoModel, AutoTokenizer
20
+
21
+ # Import trained model and tokenizer
22
+ tokenizer = AutoTokenizer.from_pretrained("/path/to/model/directory/")
23
+ model = AutoModel.from_pretrained("/path/to/model/directory/")
24
+
25
+
26
+ #sentences is your list of n DNA tokens of size 6
27
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
28
+
29
+ # Get the embeddings
30
+ with torch.no_grad():
31
+ embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
32
+
33
+
34
+ ```
35
+ The retrieved embeddings can be utilized as input for a machine learning classifier to perform classification.
36
+
37
+ ## Performance on evaluation tasks
38
+
39
+ Find out more about the datasets and access in the paper **(TBA)**
40
+
41
+ ### Task 1: Detection of colorectal cancer cases (after oversampling)
42
+
43
+ | | 5-fold Cross Validation accuracy | Test accuracy |
44
+ | --- | --- | ---|
45
+ | LightGBM | 91 | 63 |
46
+ | Random Forest | **94** | **71** |
47
+ | XGBoost | 93 | 66 |
48
+ | CNN | 42 | 52 |
49
+
50
+ | | 5-fold Cross Validation F1 | Test F1 |
51
+ | --- | --- | ---|
52
+ | LightGBM | 91 | 66 |
53
+ | Random Forest | **94** | **72** |
54
+ | XGBoost | 93 | 66 |
55
+ | CNN | 41 | 60 |
56
+
57
+ ### Task 2: Prediction of the Gleason grade group (after oversampling)
58
+
59
+ | | 5-fold Cross Validation accuracy | Test accuracy |
60
+ | --- | --- | ---|
61
+ | LightGBM | 97 | 68 |
62
+ | Random Forest | **98** | **78** |
63
+ | XGBoost |97 | 70 |
64
+ | CNN | 35 | 50 |
65
+
66
+ | | 5-fold Cross Validation F1 | Test F1 |
67
+ | --- | --- | ---|
68
+ | LightGBM | 97 | 70 |
69
+ | Random Forest | **98** | **80** |
70
+ | XGBoost |97 | 70 |
71
+ | CNN | 33 | 59 |
72
+
73
+ ### Task 3: Detection of human TATA sequences (after oversampling)
74
+
75
+ | | 5-fold Cross Validation accuracy | Test accuracy |
76
+ | --- | --- | ---|
77
+ | LightGBM | 98 | 93 |
78
+ | Random Forest | **99** | **96** |
79
+ | XGBoost |**99** | 95 |
80
+ | CNN | 38 | 59 |
81
+
82
+ | | 5-fold Cross Validation F1 | Test F1 |
83
+ | --- | --- | ---|
84
+ | LightGBM | 98 | 92 |
85
+ | Random Forest | **99** | **95** |
86
+ | XGBoost | **99** | 92 |
87
+ | CNN | 58 | 10 |
88
+
89
+
90
+ ## Authors
91
+ -----------
92
+
93
+ * Written by : Mpho Mokoatle, Vukosi Marivate, Darlington Mapiye, Riana Bornman, Vanessa M. Hayes
94
+ * Contact details : [email protected]
95
+
96
+ ## Citation
97
+ -----------
98
+ Bibtex Reference **TBA**
99
+
100
+ ### References
101
+
102
+ <a id="1">[1]</a>
103
+ Gao, Tianyu, Xingcheng Yao, and Danqi Chen. "Simcse: Simple contrastive learning of sentence embeddings." arXiv preprint arXiv:2104.08821 (2021).