--- license: mit pipeline_tag: text-classification library_name: transformers base_model: answerdotai/ModernBERT-large tags: - math - science - academic - reasoning - verification - weaver - cross-encoder - multi-domain language: - en --- # Weaver Distilled for All Datasets (ModernBERT-large) A general-purpose distilled cross-encoder model based on ModernBERT-large, trained to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro). This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models. ## Model Details - **Base Model**: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) (395M parameters) - **Architecture**: Cross-encoder with MLP head (1024 → 512 → 256 → 1) - **Max Sequence Length**: 4096 tokens - **Training Data**: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models - **Task**: Binary classification for answer correctness prediction across domains ## Quick Start ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example usage - works across math, science, and academic domains instruction = "Which of the following is a characteristic of prokaryotic cells? A) Nucleus B) Mitochondria C) Ribosomes D) Endoplasmic reticulum" response = "The answer is C) Ribosomes. Prokaryotic cells lack membrane-bound organelles like nuclei, mitochondria, and endoplasmic reticulum, but they do contain ribosomes for protein synthesis." # Tokenize input pair inputs = tokenizer( instruction, response, truncation=True, max_length=4096, padding=True, return_tensors="pt" ) # Get correctness score with torch.no_grad(): outputs = model(**inputs) score = torch.sigmoid(outputs.logits).item() print(f"Correctness score: {score:.3f}") print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}") ``` ## Training Details This model was trained using the [Weaver distillation pipeline](https://github.com/ScalingIntelligence/scaling-verification/tree/main/distillation) on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the [distillation README](https://github.com/ScalingIntelligence/scaling-verification/blob/main/distillation/README.md). ## Evaluation Evaluate this model on different datasets: ```bash # MATH500 python evaluate_crossencoder.py \ --model_name "answerdotai/ModernBERT-large" \ --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large" \ --dataset_path "hazyresearch/MATH500_with_Llama_3.1_70B_Instruct_v1" \ --dataset_split "data" \ --max_length 4096 \ --batch_size 64 # GPQA python evaluate_crossencoder.py \ --model_name "answerdotai/ModernBERT-large" \ --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large" \ --dataset_path "hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1" \ --dataset_split "data" \ --max_length 4096 \ --batch_size 64 ``` ## Citation ```bibtex @misc{saadfalcon2025shrinkinggenerationverificationgapweak, title={Shrinking the Generation-Verification Gap with Weak Verifiers}, author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré}, year={2025}, eprint={2506.18203}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2506.18203}, } ```