LChemME (base-sized) pre-trained on DOSEDO fragments

LChemME pre-trained using our LChemME python package on canonicalizing SMILES strings below 300 Da from the DOSEDO DNA-encoded diversity-oriented synthesis dataset.

Model description

LChemME is a Large Chemical Model for Embedding based on the BART architecture. BART is a transformer encoder-decoder model.

LChemME uses a relatively small vocabulary size (512) relative to natural language models. LChemME models are pretrained on the task of SMILES canonicalization (according to RDKit rules). This task requires the model to build an internal representation of the chemical graph directly from the SMILES string and decode the graph back to a canonical SMILES.

This checkpoint results from pretraining on 465,135 SMILES strings from a DNA-encoded diversity-oriented synthesis library with molecular weight less than 300 Da. The validation dataset comprised molecules with molecular weight greater than 350 Da. We aim for this LChemME model to assist with generalizing chemical property prediction from measurements on chemical fragments.

How to use

Here is how to use this model in PyTorch:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('scbirlab/lchemme-base-dosedo-lteq300')
model = AutoModelForSeq2SeqLM.from_pretrained('scbirlab/lchemme-base-dosedo-lteq300')

inputs = tokenizer("CC(Oc1ccccc1C(O)=O)=O", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
Downloads last month
5
Safetensors
Model size
101M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for scbirlab/lchemme-base-dosedo-lteq300

Base model

facebook/bart-base
Finetuned
(447)
this model

Dataset used to train scbirlab/lchemme-base-dosedo-lteq300

Space using scbirlab/lchemme-base-dosedo-lteq300 1

Collection including scbirlab/lchemme-base-dosedo-lteq300