jaandoui commited on
Commit
49497cd
·
verified ·
1 Parent(s): f532fa8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -37
README.md CHANGED
@@ -1,38 +1,54 @@
1
- ---
2
- metrics:
3
- - matthews_correlation
4
- - f1
5
- tags:
6
- - biology
7
- - medical
8
- ---
9
- This is the official pre-trained model introduced in [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
10
- ](https://arxiv.org/pdf/2306.15006.pdf).
11
-
12
- We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
13
-
14
- DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
15
-
16
- To load the model from huggingface:
17
- ```
18
- import torch
19
- from transformers import AutoTokenizer, AutoModel
20
-
21
- tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
22
- model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
23
- ```
24
-
25
- To calculate the embedding of a dna sequence
26
- ```
27
- dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
28
- inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
29
- hidden_states = model(inputs)[0] # [1, sequence_length, 768]
30
-
31
- # embedding with mean pooling
32
- embedding_mean = torch.mean(hidden_states[0], dim=0)
33
- print(embedding_mean.shape) # expect to be 768
34
-
35
- # embedding with max pooling
36
- embedding_max = torch.max(hidden_states[0], dim=0)[0]
37
- print(embedding_max.shape) # expect to be 768
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ```
 
1
+ ---
2
+ metrics:
3
+ - matthews_correlation
4
+ - f1
5
+ tags:
6
+ - biology
7
+ - medical
8
+ ---
9
+ This version of DNABERT2 has been changed to be able to output the attention too, for attention analysis.
10
+
11
+ Most of the modifications were done in Bert_Layer.py.
12
+ It has been modified especially for fine tuning and hasn't been tried for pretraining.
13
+ Before or next to each modification, you can find "JAANDOUI" so to see al modifications, search for "JAANDOUI".
14
+ "JAANDOUI TODO" means that if that part is going to be used, maybe something might be missing.
15
+
16
+ Now in ```Trainer``` (or ```CustomTrainer``` if overwritten) in ```compute_loss(..)``` when defining the model:
17
+ ```outputs = model(**inputs, return_dict=True, output_attentions=True)```
18
+ activate the extraction of attention: ```output_attentions=True``` (and ```return_dict=True``` (optional)).
19
+ You can now extract the attention in ```outputs.attentions```
20
+ Read more about model outputs here: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/output#transformers.utils.ModelOutput
21
+
22
+ To the author of DNABERT2, feel free to use those modifications.
23
+
24
+ The official link to DNABERT2 [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome
25
+ ](https://arxiv.org/pdf/2306.15006.pdf).
26
+
27
+ READ ME OF THE OFFICIAL DNABERT2:
28
+ We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development.
29
+
30
+ DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome.
31
+
32
+ To load the model from huggingface:
33
+ ```
34
+ import torch
35
+ from transformers import AutoTokenizer, AutoModel
36
+
37
+ tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
38
+ model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True)
39
+ ```
40
+
41
+ To calculate the embedding of a dna sequence
42
+ ```
43
+ dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
44
+ inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
45
+ hidden_states = model(inputs)[0] # [1, sequence_length, 768]
46
+
47
+ # embedding with mean pooling
48
+ embedding_mean = torch.mean(hidden_states[0], dim=0)
49
+ print(embedding_mean.shape) # expect to be 768
50
+
51
+ # embedding with max pooling
52
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
53
+ print(embedding_max.shape) # expect to be 768
54
  ```