File size: 3,770 Bytes
238121a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9166fd2
238121a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
license: mit
---

# Model Card for Omni-DNA

## Requirement

```bash
pip install datasets ai2-olmo
```

## Overview

Omni-DNA is a **cross-modal, multi-task genomic foundation model** designed to generalize across diverse genomic tasks. Unlike previous Genomic Foundation Models (GFMs), which require separate fine-tuning for each task, Omni-DNA leverages **auto-regressive transformer-based training** and **multi-task fine-tuning**, enabling a single model to perform a wide range of genomic tasks with **state-of-the-art** performance.

Omni-DNA models range from **20M to 1B** parameters and support tasks such as **sequence annotation, regulatory element classification, acetylation/methylation prediction, and DNA2Function/DNA2Image mapping**.

## Base Model Details

| Size  | Training Tokens | Layers | Hidden Size | Attention Heads | Context Length |
|-------|----------------|--------|-------------|-----------------|----------------|
| Omni-DNA 20M  | 300B | 8   | 256  | 8  | 250  |
| Omni-DNA 60M  | 300B | 8  | 512  | 8 | 250  |
| Omni-DNA 116M | 300B | 12  | 768 | 16 | 250  |
| Omni-DNA 300M | 300B | 16  | 1024 | 16 | 250  |
| Omni-DNA 700M | 300B | 16  | 1536 | 16 | 250  |
| Omni-DNA 1B   | 300B | 16  | 2048 | 16 | 250  |

## Model Description

<!-- - **Developed by:** Anonymous Authors -->
- **Supported by:** Microsoft Research Asia
- **Model type:** Auto-regressive transformer-based genomic model
- **License:** mit
- **Date cutoff:** 2024
- **Contact:** Research inquiries at `[email protected]`

## Model Sources

- **Paper:** [Omni-DNA: Scaling Auto-Regressive Transformer to Multi-Tasking Genomic Foundation Model](https://arxiv.org/abs/2502.03499)
- **Codebase:** https://github.com/Zehui127/Omni-DNA
- **Dataset:** Pretrained on **300B nucleotides** from multi-species genome datasets

## Capabilities

Omni-DNA is trained to perform **multiple genomic tasks** including:

- **Regulatory Element Classification:** Enhancer/promoter/splice site detection
- **Histone Modification Prediction:** Acetylation and methylation state identification
- **Genomic Function Annotation:** DNA-to-text mapping (DNA2Function)
- **Cross-modal Learning:** DNA-to-image mapping (DNA2Image)
- **Multi-task Learning:** A single model can solve multiple tasks simultaneously

## Usage

```python

import argparse
import json
import os
import re
import torch
from tqdm import tqdm
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

def preprocess_response(response, mask_token="[MASK]"):
    """
    Preprocess the response to extract text after the [MASK] token.

    Args:
        response (str): The raw model output.
        mask_token (str): The token after which the response is extracted.

    Returns:
        str: Processed response text.
    """
    if mask_token in response:
        response = response.split(mask_token, 1)[1]
    response = re.sub(r'^[\sATGC]+', '', response)
    return response

def generate(message, model, tokenizer):
    message = message + "[MASK]"
    tokenized_message = tokenizer(
        [message], return_tensors='pt', return_token_type_ids=False, add_special_tokens=True
    ).to('cuda')
    response = model.generate(**tokenized_message, max_new_tokens=110, do_sample=False)
    reply = tokenizer.batch_decode(response, skip_special_tokens=True)[0]
    return preprocess_response(reply)

model_tokenizer_path = "zehui127/Omni-DNA-DNA2Function"
tokenizer = AutoTokenizer.from_pretrained(model_tokenizer_path)
model = AutoModelForCausalLM.from_pretrained(model_tokenizer_path).to('cuda')
# Define the input dna sequence
dna = "TGCTGGCTTCAGGGGCACAGATGCTAACATTGGAGCGATACAGAGAAGATTAACGTGGCCACTGCGCAAGCATGACATGCAAACTCGTAAAGCATTCTTTTAATTT"
generate(dna, model, tokenizer)
```