| # NT DNA Model | |
| This is the DNA component of a jointly trained NT-ESM2 model pair for DNA-protein analysis. | |
| ## Model Details | |
| - **Model Type**: Nucleotide Transformer (NT) for DNA sequences | |
| - **Training**: Jointly trained with ESM2 protein model | |
| - **Architecture**: Transformer-based language model for DNA | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| # Load model and tokenizer - requires trust_remote_code | |
| model = AutoModelForMaskedLM.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding-dna", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding-dna", trust_remote_code=True) | |
| # Example usage | |
| dna_sequence = "ATCGATCGATCG" | |
| inputs = tokenizer(dna_sequence, return_tensors="pt") | |
| outputs = model(**inputs) | |
| ``` | |
| ## Training Details | |
| - Jointly trained with protein sequences for cross-modal understanding | |
| - Batch size: 8 (based on directory name) | |
| - Context length: 4096 tokens | |
| - Transcript-specific coding sequences | |
| ## Files | |
| - `config.json`: Model configuration | |
| - `model.safetensors`: Model weights | |
| - `tokenizer_config.json`: Tokenizer configuration | |
| - `vocab.txt`: Vocabulary file | |
| - `special_tokens_map.json`: Special tokens mapping | |
| ## Citation | |
| If you use this model, please cite the original NT paper and your joint training work. | |