--- license: mit datasets: - ZINC-22 language: - en tags: - molecular-generation - drug-discovery - llama - flash-attention pipeline_tag: text-generation --- # NovoMolGen NovoMolGen is a family of molecular foundation models trained on 1.5 billion ZINC-22 molecules with Llama architectures and FlashAttention. It achieves state-of-the-art performance on both unconstrained and goal-directed molecule generation tasks. ## How to load ```python >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True) >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", trust_remote_code=True) ``` ## Quick-start (FlashAttention + bf16) ```python >>> from accelerate import Accelerator >>> acc = Accelerator(mixed_precision='bf16') >>> model = acc.prepare(model) >>> outputs = model.sample(tokenizer=tokenizer, batch_size=4) >>> print(outputs['SMILES']) ``` ## Transformers-native HF checkpoint (`revision="hf-checkpoint"`) We also publish a Transformers-native checkpoint on the `hf-checkpoint` revision. This version loads directly with `AutoModelForCausalLM` and works out-of-the-box with `.generate(...)`. ```python >>> import torch >>> from transformers import AutoTokenizer, AutoModelForCausalLM >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", revision='hf-checkpoint', device_map='auto') >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_BPE", revision='hf-checkpoint') >>> input_ids = torch.tensor([[tokenizer.bos_token_id]]).expand(4, -1).contiguous().to(model.device) >>> outs = model.generate(input_ids=input_ids, temperature=1.0, max_length=64, do_sample=True, pad_token_id=tokenizer.eos_token_id) >>> molecules = [t.replace(" ", "") for t in tokenizer.batch_decode(outs, skip_special_tokens=True)] ['CCO[C@H](CNC(=O)N(CC(=O)OC(C)(C)C)c1cccc(Br)n1)C(F)(F)F', 'CCn1nnnc1CNc1ncnc(N[C@H]2CCO[C@@H](C)C2)c1C', 'CC(C)(O)CNC(=O)CC[C@H]1C[C@@H](NC(=O)COCC(F)F)C1', 'Cc1ncc(C(=O)N2C[C@H]3[C@H](CNC(=O)c4cnn[nH]4)CCC[C@H]3C2)n1C'] ``` ## Citation ```bibtex @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, year={2025}, eprint={2508.13408}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2508.13408}, } ```