--- metrics: - matthews_correlation - f1 tags: - biology - medical --- This version of DNABERT2 has been changed to be able to output the attention too, for attention analysis. Most of the modifications were done in Bert_Layer.py. It has been modified especially for fine tuning and hasn't been tried for pretraining. Before or next to each modification, you can find "JAANDOUI" so to see al modifications, search for "JAANDOUI". "JAANDOUI TODO" means that if that part is going to be used, maybe something might be missing. Now in ```Trainer``` (or ```CustomTrainer``` if overwritten) in ```compute_loss(..)``` when defining the model: ```outputs = model(**inputs, return_dict=True, output_attentions=True)``` activate the extraction of attention: ```output_attentions=True``` (and ```return_dict=True``` (optional)). You can now extract the attention in ```outputs.attentions``` Read more about model outputs here: https://huggingface.co/docs/transformers/v4.40.2/en/main_classes/output#transformers.utils.ModelOutput To the author of DNABERT2, feel free to use those modifications. The official link to DNABERT2 [DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome ](https://arxiv.org/pdf/2306.15006.pdf). READ ME OF THE OFFICIAL DNABERT2: We sincerely appreciate the MosaicML team for the [MosaicBERT](https://openreview.net/forum?id=5zipcfLC2Z) implementation, which serves as the base of DNABERT-2 development. DNABERT-2 is a transformer-based genome foundation model trained on multi-species genome. To load the model from huggingface: ``` import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) model = AutoModel.from_pretrained("zhihan1996/DNABERT-2-117M", trust_remote_code=True) ``` To calculate the embedding of a dna sequence ``` dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] hidden_states = model(inputs)[0] # [1, sequence_length, 768] # embedding with mean pooling embedding_mean = torch.mean(hidden_states[0], dim=0) print(embedding_mean.shape) # expect to be 768 # embedding with max pooling embedding_max = torch.max(hidden_states[0], dim=0)[0] print(embedding_max.shape) # expect to be 768 ```