GENERanno-prokaryote-0.5b-base model

Abouts

In this repository, we present GENERanno, a compact yet powerful genomic foundation model featuring a context length of 8k base pairs with single-nucleotide resolution and 500M parameters, trained on an expansive dataset comprising 715 billion base pairs of prokaryotic DNA. Our evaluations demonstrate that the GENERanno consistently achieves state-of-the-art performance across a wide spectrum of biologically meaningfull tasks, namely the Prokaryotic Gener Tasks (2025-5).

In addition, we present GENERanno-prokaryote-0.5b-cds-annotator-preview, a model meticulously finetuned for metagenomic annotation. Through comprehensive evaluations, GENERanno-cds-annotator achieves superior accuracy compared to traditional HMM-based methods (e.g., GLIMMER3, GeneMarkS2, Prodigal) and recent LLM-based approaches (e.g., GeneLM), while demonstrating exceptional generalization ability on archaeal genomes. The detailed annotation results are provided here.

The code and implementation details are available on Github: https://github.com/GenerTeam/GENERanno.

How to use

Simple example: embedding


import torch
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model using the pretrained model name
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base")
model = AutoModel.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base", trust_remote_code=True)

# Get model configuration and maximum sequence length
config = model.config
max_length = config.max_position_embeddings

# Define input sequences
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences
# The add_special_tokens=True adds special tokens
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer
# hidden_states shape: (batch_size, sequence_length, hidden_size)
hidden_states = outputs.hidden_states[-1]

# Option 1: Use the first token (BOS) as the sentence embedding
cls_embeddings = hidden_states[:, 0, :]

# Option 2: Use mean pooling over the token embeddings
# Use the attention mask to take care of the padded tokens
attention_mask = inputs["attention_mask"]  # Shape: (batch_size, sequence_length)
# Expand the attention mask dimensions so that it matches the hidden_states dimensions
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
# Sum the token embeddings, taking the mask into account
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
# Compute the average by dividing with the sum of the attention mask
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

print("BOS Embeddings:", cls_embeddings)
print("Mean Embeddings:", mean_embeddings)

Citation

@article{li2025generanno,
    author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
    title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
    elocation-id = {2025.06.04.656517},
    year = {2025},
    doi = {10.1101/2025.06.04.656517},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
    journal = {bioRxiv}
}
Downloads last month
55
Safetensors
Model size
493M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for GenerTeam/GENERanno-prokaryote-0.5b-base

Finetunes
1 model