Problem With Downstream Fine-Tuning - NLI
Hi!
I pretrained a new ModernBERT with the transformers library (modified run_mlm.py script following training paper). Despite getting good Masked Token Prediction accuracy (0.83), and Perplexity (2.22), with good results on token prediction with transformers' pipeline, the model fails completley at basic NLI sequence classification (with ModernBertForSequenceClassification). F1 scores vary randomly, though training loss does decrease. Validation loss seems to hover around the same value, while accuracy hovers around 0.33 indicating random guessing.
I made sure there are no NaN values in my tensors (as seen here). I also have the latest transformers version (pulled from their docker repository, running in a container). I have the proper flash attention version installed for my torch, python and cuda version, runnning on a A100 GPU.
I can provide more training details if needed, broadly I followed the original paper, though I trained on a dataset with ~8B tokens. It's a domain-specific dataset.
I tested this with the offical ModernBERT-base - it also fails at this task.