Questions about specific FART model versions on Hugging Face

#1
by Mafuton - opened

Hello,
I'm very interested in the FART models you have made available on Hugging Face, based on your excellent papers on molecular taste prediction. Thank you for sharing your work.
I have a couple of questions regarding the specific model names used on Hugging Face:
1. Could you please clarify the meaning or specific fine-tuning configuration behind the model name FART_ChemBERTa-77M-MLM_Augmented_No_Canonical?
From the papers, it seems that “FART” refers to the model, “ChemBERTa-77M-MLM” to the pre-trained base model using 77 million SMILES with MLM, and “Augmented” to the use of SMILES augmentation during fine-tuning.
However, the suffix “_No_Canonical” is not explicitly mentioned in the papers. I would appreciate it if you could explain what this signifies in terms of the training or evaluation process.
2. Could you also confirm the differences between the models FART_ChemBERTa-77M-MLM_Augmented and FART_ChemBERTa-77M-MLM_W?
Based on your papers (e.g., Table 1 and Sections 2.2/3.2.2), it appears that the former corresponds to the “FART augmented” model (trained with SMILES augmentation), and the latter to the “FART augmented + weighted” model (trained with augmentation and a weighted loss function to address class imbalance).
Is this understanding correct?
Thank you very much for your time and for providing these valuable models and data to the community.
Any clarification would be greatly appreciated and will help users like me better understand and apply your models.

Fart Labs org

Hi @Mafuton ,
Thank you for your interest in our work. The models in the most recent version of our preprint are linked here: https://huggingface.co/collections/FartLabs/fart-673c62059b78cf0d9835ba52. The model checkpoints you are referring to are earlier training runs that can be ignored in the context of our paper. Note that the model with a weighted loss function was only explicitly included in a first version of the manuscript before we manually corrected some entries in the dataset. The links to these outdated checkpoints are provided in the supplemental information (Table 2) available at https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/673a303ef9980725cf80bf6a/original/supplementary-information.pdf.
I do recommend using the models of the most recent version of our manuscript under the first link. Happy to help with any remaining questions.

Hi @yzimmermann ,
Thank you for the information. Based on what you've provided, I will consider using the latest version. Thank you for your prompt response.

Hi @yzimmermann ,
"I've already downloaded 'FART_Augmented'.
Upon checking the model structure, I've noticed it differs from 'FART_ChemBERTa-77M-MLM_Augmented.' According to your papers, you utilized ChemBERTa, which has been pre-trained on 77 million SMILES strings using a masked language modeling approach. I assume you utilized 'ChemBERTa-77M-MLM'. However, the structure of 'FART_Augmented' differs from 'ChemBERTa-77M-MLM' in terms of the number of encoders and the input/output dimensions of the encoders.
Could you please clarify which pre-trained model was utilized for 'FART_Augmented'?"
Thank you in advance.

Sign up or log in to comment