Update README.md
Browse files
README.md
CHANGED
@@ -1,38 +1,54 @@
|
|
1 |
-
---
|
2 |
-
metrics:
|
3 |
-
- matthews_correlation
|
4 |
-
- f1
|
5 |
-
tags:
|
6 |
-
- biology
|
7 |
-
- medical
|
8 |
-
---
|
9 |
-
This
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
```
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
```
|
|
|
1 |
+
---
|
2 |
+
metrics:
|
3 |
+
- matthews_correlation
|
4 |
+
- f1
|
5 |
+
tags:
|
6 |
+
- biology
|
7 |
+
- medical
|
8 |
+
---
|
9 |
+
This version of DNABERT2 has been changed to be able to output the attention too, for attention analysis.
|
10 |
+
|
11 |
+
Most of the modifications were done in Bert_Layer.py.
|
12 |
+
It has been modified especially for fine tuning and hasn't been tried for pretraining.
|
13 |
+
Before or next to each modification, you can find "JAANDOUI" so to see al modifications, search for "JAANDOUI".
|
14 |
+
"JAANDOUI TODO" means that if that part is going to be used, maybe something might be missing.
|
15 |
+
|
16 |
+
Now in ```Trainer``` (or ```CustomTrainer``` if overwritten) in ```compute_loss(..)``` when defining the model:
|
17 |
+
```outputs = model(**inputs, return_dict=True, output_attentions=True)```
|
18 |
+
activate the extraction of attention: ```output_attentions=True``` (and ```return_dict=True``` (optional)).
|
19 |
+
You can now extract the attention in ```outputs.attentions```
|
20 |
+
Read more about model outputs here: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/output#transformers.utils.ModelOutput
|
21 |
+
|
22 |
+
To the author of DNABERT2, feel free to use those modifications.
|
23 |
+
|
24 |
+
The official link to DNABERT2 [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
|
25 |
+
](https://arxiv.org/pdf/2306.15006.pdf).
|
26 |
+
|
27 |
+
READ ME OF THE OFFICIAL DNABERT2:
|
28 |
+
We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
|
29 |
+
|
30 |
+
DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
|
31 |
+
|
32 |
+
To load the model from huggingface:
|
33 |
+
```
|
34 |
+
import torch
|
35 |
+
from transformers import AutoTokenizer, AutoModel
|
36 |
+
|
37 |
+
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
|
38 |
+
model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
|
39 |
+
```
|
40 |
+
|
41 |
+
To calculate the embedding of a dna sequence
|
42 |
+
```
|
43 |
+
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
|
44 |
+
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
|
45 |
+
hidden_states = model(inputs)[0] # [1, sequence_length, 768]
|
46 |
+
|
47 |
+
# embedding with mean pooling
|
48 |
+
embedding_mean = torch.mean(hidden_states[0], dim=0)
|
49 |
+
print(embedding_mean.shape) # expect to be 768
|
50 |
+
|
51 |
+
# embedding with max pooling
|
52 |
+
embedding_max = torch.max(hidden_states[0], dim=0)[0]
|
53 |
+
print(embedding_max.shape) # expect to be 768
|
54 |
```
|