Model Card for Model hogru/MolReactGen-USPTO50K-Reaction-Templates

MolReactGen is a model that generates reaction templates in SMARTS format (this model) and molecules in SMILES format.

Model Details

Model Description

MolReactGen is based on the the GPT-2 transformer decoder architecture and has been trained on a pre-processed version of the USPTO-50K dataset. More information can be found in these introductory slides.

  • Developed by: Stephan Holzgruber
  • Model type: Transformer decoder
  • License: MIT

Model Sources

Uses

The main use of this model is to pass the master's examination of the author ;-)

Direct Use

The model can be used in a Hugging Face text generation pipeline. For the intended use case a wrapper around the raw text generation pipeline is needed. This is the generate.py from the repository. The model has a default GenerationConfig() (generation_config.json) which can be overwritten. Depending on the number of molecules to be generated (num_return_sequences in the JSON file) this might take a while. The generation code above shows a progress bar during generation.

Bias, Risks, and Limitations

The model generates reaction templates that are similar to the USPTO-50K training data. Any checks of the reaction templates, e.g. chemical feasiblitly, must be adressed by the user of the model.

Training Details

Training Data

Pre-processed version of the USPTO-50K dataset, originally introduced by Schneider et al..

Training Procedure

The default Hugging Face Trainer() has been used, with an EarlyStoppingCallback().

Preprocessing

The training data was pre-processed with a PreTrainedTokenizerFast() trained on the training data with a bespoke RegEx pre-tokenizer which "understands" the SMARTS syntax.

Training Hyperparameters

  • Batch size: 8
  • Gradient accumulation steps: 4
  • Mixed precision: fp16, native amp
  • Learning rate: 0.0005
  • Learning rate scheduler: Cosine
  • Learning rate scheduler warmup: 0.1
  • Optimizer: AdamW with betas=(0.9,0.95) and epsilon=1e-08
  • Number of epochs: 43 (early stopping)

More configuration (options) can be found in the conf directory of the repository.

Evaluation

Please see the slides / the poster mentioned above.

Metrics

Please see the slides / the poster mentioned above.

Results

Please see the slides / the poster mentioned above.

Technical Specifications

Framework versions

  • Transformers 4.27.1
  • Pytorch 1.13.1
  • Datasets 2.10.1
  • Tokenizers 0.13.2

Hardware

  • Local PC running Ubuntu 22.04
  • NVIDIA GEFORCE RTX 3080Ti (12GB)
Downloads last month
17
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.