PyTorch
llama

InstructBioMol: A Multimodal LLM for Biomolecule Understanding and Design

PaperProjectQuickstartCitation

Model Description

InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.

For detailed information, please refer to our paper and code repository.

Released Variants

Model Name Stage Multimodal Description
InstructBioMol-base (This Model) Pretraining Continual pretrained model on molecular sequences, protein sequences, and scientific literature.
InstructBioMol-instruct-stage1 Instruction tuning (stage 1) Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins)
InstructBioMol-instruct Instruction tuning (stage 1 and 2) Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins)

Training Details

Base Architecture: LLaMA-2-7B

Training Data:

​1. ​Molecular Sequences​​:

  • Format: SELFIES
  • Source: PubChem
  • Size: ​​100 million (100M) entries​​

​2. ​Protein Sequences​​:

  • Format: FASTA-like, prefixed with <p> (e.g., <p>M<p>A<p>L<p>W...).
  • Source: UniRef50
  • Size: ​​59 million (59M) entries​​

​3. ​Natural Language Texts​​:

  • Source: Abstracts from ​​PubMed​​, ​​bioRxiv​​, and ​​ChemRxiv​​
  • Size: ​​6 million (6M) abstracts​

Training Objective: Causal language modeling (self-supervised)

Quick Start

from transformers import LlamaForCausalLM, LlamaTokenizer
import torch

model_name = "hicai-zju/InstructBioMol-base"  
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name, device_map="cuda:0")

prompt = "<p>M"  # protein sequence
# prompt = "[C]"  # molecule sequence
# prompt = 'Scientific'  # natural language
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100, 
        temperature=0.7,     
        top_p=0.9,          
        do_sample=True     
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Citation

@article{DBLP:journals/corr/abs-2410-07919,
  author       = {Xiang Zhuang and
                  Keyan Ding and
                  Tianwen Lyu and
                  Yinuo Jiang and
                  Xiaotong Li and
                  Zhuoyi Xiang and
                  Zeyuan Wang and
                  Ming Qin and
                  Kehua Feng and
                  Jike Wang and
                  Qiang Zhang and
                  Huajun Chen},
  title        = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
                  Human Instructions},
  journal      = {CoRR},
  volume       = {abs/2410.07919},
  year         = {2024}
}
Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including hicai-zju/InstructBioMol-base