InstructBioMol
Collection
3 items
•
Updated
Paper • Project • Quickstart • Citation
InstructBioMol is a multimodal large language model that bridges natural language with biomolecules (proteins and small molecules). It achieves any-to-any alignment between natural language, molecules, and proteins through comprehensive instruction tuning.
For detailed information, please refer to our paper and code repository.
Model Name | Stage | Multimodal | Description |
---|---|---|---|
InstructBioMol-base (This Model) | Pretraining | ❎ | Continual pretrained model on molecular sequences, protein sequences, and scientific literature. |
InstructBioMol-instruct-stage1 | Instruction tuning (stage 1) | ✅ | Stage1 instruction-tuned model with biomolecular multimodal processing capabilities. (e.g., 3D molecules/proteins) |
InstructBioMol-instruct | Instruction tuning (stage 1 and 2) | ✅ | Fully instruction-tuned model (stage1 & stage2) with biomolecular multimodal processing capabilities (e.g., 3D molecules/proteins) |
Base Architecture: LLaMA-2-7B
Training Data:
1. Molecular Sequences:
2. Protein Sequences:
<p>
(e.g., <p>M<p>A<p>L<p>W...
). 3. Natural Language Texts:
Training Objective: Causal language modeling (self-supervised)
from transformers import LlamaForCausalLM, LlamaTokenizer
import torch
model_name = "hicai-zju/InstructBioMol-base"
tokenizer = LlamaTokenizer.from_pretrained(model_name)
model = LlamaForCausalLM.from_pretrained(model_name, device_map="cuda:0")
prompt = "<p>M" # protein sequence
# prompt = "[C]" # molecule sequence
# prompt = 'Scientific' # natural language
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
top_p=0.9,
do_sample=True
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
@article{DBLP:journals/corr/abs-2410-07919,
author = {Xiang Zhuang and
Keyan Ding and
Tianwen Lyu and
Yinuo Jiang and
Xiaotong Li and
Zhuoyi Xiang and
Zeyuan Wang and
Ming Qin and
Kehua Feng and
Jike Wang and
Qiang Zhang and
Huajun Chen},
title = {InstructBioMol: Advancing Biomolecule Understanding and Design Following
Human Instructions},
journal = {CoRR},
volume = {abs/2410.07919},
year = {2024}
}