Dataset on HF

ether0 logo

ether0

ether0 is a 24B language model trained to reason in English and output molecular structures as SMILES. It is derived from fine-tuning and reinforcement learning training from Mistral-Small-24B-Instruct-2501. Ask questions in English, but they may also include molecules specified as SMILES. The SMILES do not need to be canonical and may contain stereochemistry information. ether0 has limited support for IUPAC names.

Usage

This model is trained to reason in English and output a molecule. It is NOT a general purpose chat model. It has been trained specifically for these tasks:

  • IUPAC name to SMILES
  • Molecular formula (Hill notation) to SMILES, optionally with constraints on functional groups
  • Modifying solubilities on given molecules (SMILES) by specific LogS, optionally with constraints about scaffolds/groups/similarity
  • Matching pKa to molecules, proposing molecules with a pKa, or modifying molecules to adjust pKa
  • Matching scent/smell to molecules and modifying molecules to adjust scent
  • Matching human cell receptor binding + mode (e.g., agonist) to molecule or modifying a molecule's binding effect. Trained from EveBio
  • ADME properties (e.g., MDDK efflux ratio, LD50)
  • GHS classifications (as words, not codes, like "carcinogen"). For example, "modify this molecule to remove acute toxicity."
  • Quantitative LD50 in mg/kg
  • Proposing 1-step retrosynthesis from likely commercially available reagents
  • Predicting a reaction outcome
  • General natural language description of a specific molecule to that molecule (inverse molecule captioning)
  • Natural product elucidation (formula + organism to SMILES) - e.g, "A molecule with formula C6H12O6 was isolated from Homo sapiens, what could it be?"
  • Matching blood-brain barrier permeability (as a class) or modifying

For example, you can ask "Propose a molecule with a pKa of 9.2" or "Modify CCCCC(O)=OH to increase its pKa by about 1 unit." You cannot ask it "What is the pKa of CCCCC(O)=OH?" If you ask it questions that lie significantly beyond those tasks, it can fail. You can combine properties, although we haven't significantly benchmarked this.

Benchmarks

We tested ether0, along with some experts and frontier models, on a benchmark we developed. The benchmark is made from commonly used tasks - like reaction prediction in USPTO, molecular captioning from PubChem, or predicting GHS classification. The benchmark is different in two ways: all answers are a molecule, and we balanced it so that each task is 25 questions (a reasonable amount for frontier model evals). The tasks generally follow previously reported numbers - e.g., a reaction prediction accuracy of 80% here would be about the same on a withheld split of the USPTO-50k dataset.

The results below are the model weights released in this repo. This is different than the preprint, which has pre-safety mitigation benchmarks.

ether0 benchmarking

Limitations

It does not know general synonyms and it has poor textbook knowledge (e.g. it does not perform especially well on chembench). For best results, input molecules as SMILES: if you input molecules with their common names, the model may reason using the incorrect smiles, resulting in poor results. For example, we have observed that the model often confuses lysine and glutamic acid if you ask questions using their common names, but should correctly reason about their chemistry if you provide their structures as SMILES.

Training details

We first pre-trained Mistral-Small-24B-Instruct-2501 via mostly incorrect reasoning traces from DeepSeek r1 to elicit reasoning and follow the new tokens/templates. Next, we used independent rounds of specialists trained with GRPO and verifiable rewards on one of the above tasks. We then aggregated and filtered reasoning traces (correct answers with reasoning) from the specialists to again fine-tune Mistral-Small-24B-Instruct-2501. Then, we did GRPO over all tasks. This last model was then put through safety post-training.

ether0 training info

See our preprint for details on data and training process.

Safety

We performed refusal post-training for compounds listed on OPCW schedules 1 and 2. We also post-trained ether0 to refuse questions about standard malicious topics like making explosives or poisons. As the model knows pharmacokinetics, it can modulate toxicity. However, the structure of toxic or narcotic compounds are generally known and thus we do not consider this a safety risk. The model can provide no uplift on "tacit knowledge" tasks like purification, scale-up, or processing beyond a web search or similar sized language model.

Citation

@article{narayanan2025training,
  title={Training a Scientific Reasoning Model for Chemistry},
  author={Narayanan, Siddharth M. and Braza, James D. and Griffiths, Ryan-Rhys and Bou, Albert and Wellawatte, Geemi P. and Ramos, Mayk Caldas and Mitchener, Ludovico and Rodriques, Samuel G. and White, Andrew D.},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2025}
}

Licensing

This model repository is considered open weights under an Apache 2.0 license, copyright 2025 FutureHouse.

Downloads last month
57
Safetensors
Model size
23.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for futurehouse/ether0

Finetuned
(42)
this model
Finetunes
1 model
Quantizations
3 models

Dataset used to train futurehouse/ether0