GENERanno-prokaryote-0.5b-base model
Abouts
In this repository, we present GENERanno, a compact yet powerful genomic foundation model featuring a context length of 8k base pairs with single-nucleotide resolution and 500M parameters, trained on an expansive dataset comprising 715 billion base pairs of prokaryotic DNA. Our evaluations demonstrate that the GENERanno consistently achieves state-of-the-art performance across a wide spectrum of biologically meaningfull tasks, namely the Prokaryotic Gener Tasks (2025-5).
In addition, we present GENERanno-prokaryote-0.5b-cds-annotator-preview, a model meticulously finetuned for metagenomic annotation. Through comprehensive evaluations, GENERanno-cds-annotator achieves superior accuracy compared to traditional HMM-based methods (e.g., GLIMMER3, GeneMarkS2, Prodigal) and recent LLM-based approaches (e.g., GeneLM), while demonstrating exceptional generalization ability on archaeal genomes. The detailed annotation results are provided here.
The code and implementation details are available on Github: https://github.com/GenerTeam/GENERanno.
How to use
Simple example: embedding
import torch
from transformers import AutoTokenizer, AutoModel
# Load the tokenizer and model using the pretrained model name
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base")
model = AutoModel.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base", trust_remote_code=True)
# Get model configuration and maximum sequence length
config = model.config
max_length = config.max_position_embeddings
# Define input sequences
sequences = [
"ATGAGGTGGCAAGAAATGGGCTAC",
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]
# Tokenize the sequences
# The add_special_tokens=True adds special tokens
tokenizer.padding_side = "right"
inputs = tokenizer(
sequences,
add_special_tokens=True,
return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length
)
# Perform a forward pass through the model to obtain the outputs, including hidden states
with torch.inference_mode():
outputs = model(**inputs, output_hidden_states=True)
# Retrieve the hidden states from the last layer
# hidden_states shape: (batch_size, sequence_length, hidden_size)
hidden_states = outputs.hidden_states[-1]
# Option 1: Use the first token (BOS) as the sentence embedding
cls_embeddings = hidden_states[:, 0, :]
# Option 2: Use mean pooling over the token embeddings
# Use the attention mask to take care of the padded tokens
attention_mask = inputs["attention_mask"] # Shape: (batch_size, sequence_length)
# Expand the attention mask dimensions so that it matches the hidden_states dimensions
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
# Sum the token embeddings, taking the mask into account
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
# Compute the average by dividing with the sum of the attention mask
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)
print("BOS Embeddings:", cls_embeddings)
print("Mean Embeddings:", mean_embeddings)
Citation
@article{li2025generanno,
author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
elocation-id = {2025.06.04.656517},
year = {2025},
doi = {10.1101/2025.06.04.656517},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
journal = {bioRxiv}
}
- Downloads last month
- 55