BiooBang is an advanced biological language model designed to integrate protein amino acid sequences and mRNA coding sequences (CDS) within a unified framework. Built on a Transformer-based prefix decoder architecture, BiooBang leverages the principles of natural language processing to treat protein and CDS sequences as biological “languages”, enabling self-supervised learning for comprehensive training.

Note

This model was fine-tuned specifically for codon optimization in the HEK293T cell line.

Use Case

The source code corresponding to this model is publicly available on GitHub at: https://github.com/lonelycrab888/BiooBang

After installing BiooBang, you can use:

# ========== generate CDS
import torch
from model.tokenization_UniBioseq import UBSLMTokenizer
from model.modeling_UniBioseq import UniBioseqForCausalLM
from model.UBL_utils import CodonLogitsProcessor

from transformers.generation.logits_process import LogitsProcessorList

tokenizer = UBSLMTokenizer.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T")
model = UniBioseqForCausalLM.from_pretrained("lonelycrab88/BiooBang-1.0-HEK293T", device_map='auto')

protein_prompt = "MASSDKQTSPKPPPSPSPLRNSKFCQSNMRILIS"
input_ids = torch.tensor([tokenizer.encode(input_protein)+[36]]).to(model.device)
max_length = 4*len(input_protein)+6

logits_processor = LogitsProcessorList()
logits_processor.append(CodonLogitsProcessor(input_protein, tokenizer, len(input_protein)))
result = model.generate(input_ids, max_length = max_length, num_beams = 10, logits_processor=logits_processor, low_memory=True, num_return_sequences=1)
result_CDS_tok = tokenizer.decode(result[0][len(input_protein)+3:].tolist()).replace(" ","").upper()

Citing this Work

Please cite our paper:

@article {Zhao2024.10.24.620004,
    author = {Zhao, Heng-Rui and Cheng, Meng-Ting and Zhu, Jinhua and Wang, Hao and Yang, Xiang-Rui and Wang, Bo and Sun, Yuan-Xin and Fang, Ming-Hao and Chen, Enhong and Li, Houqiang and Han, Shu-Jing and Chen, Yuxing and Zhou, Cong-Zhao},
    title = {Integration of protein and coding sequences enables mutual augmentation of the language model},
    elocation-id = {2024.10.24.620004},
    year = {2024},
    doi = {10.1101/2024.10.24.620004},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004},
    eprint = {https://www.biorxiv.org/content/early/2024/10/29/2024.10.24.620004.full.pdf},
    journal = {bioRxiv}
}

Contacts

If you’re interested in other cell lines and open to collaboration, please don’t hesitate to contact us!

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including lonelycrab88/BiooBang-1.0-HEK293T