---
license: mit
pipeline_tag: text-classification
library_name: transformers
base_model: answerdotai/ModernBERT-large
tags:
- math
- science
- academic
- reasoning
- verification
- weaver
- cross-encoder
- multi-domain
language:
- en
---

# Weaver Distilled for All Datasets (ModernBERT-large)

A general-purpose distilled cross-encoder model based on ModernBERT-large, trained to predict the correctness of reasoning responses across multiple domains: mathematics (MATH500), science (GPQA), and academic knowledge (MMLU-Pro). This specialized verifier was trained on Weaver scores aggregated over 35 different verifiers and reward models.

## Model Details

- **Base Model**: [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) (395M parameters)
- **Architecture**: Cross-encoder with MLP head (1024 → 512 → 256 → 1)
- **Max Sequence Length**: 4096 tokens
- **Training Data**: Combined MATH500, GPQA, and MMLU-Pro with Weaver scores from 35 LM judges and reward models
- **Task**: Binary classification for answer correctness prediction across domains

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage - works across math, science, and academic domains
instruction = "Which of the following is a characteristic of prokaryotic cells? A) Nucleus B) Mitochondria C) Ribosomes D) Endoplasmic reticulum"
response = "The answer is C) Ribosomes. Prokaryotic cells lack membrane-bound organelles like nuclei, mitochondria, and endoplasmic reticulum, but they do contain ribosomes for protein synthesis."

# Tokenize input pair
inputs = tokenizer(
    instruction, 
    response,
    truncation=True,
    max_length=4096,
    padding=True,
    return_tensors="pt"
)

# Get correctness score
with torch.no_grad():
    outputs = model(**inputs)
    score = torch.sigmoid(outputs.logits).item()
    
print(f"Correctness score: {score:.3f}")
print(f"Prediction: {'Correct' if score > 0.5 else 'Incorrect'}")
```

## Training Details

This model was trained using the [Weaver distillation pipeline](https://github.com/ScalingIntelligence/scaling-verification/tree/main/distillation) on a combined dataset spanning multiple reasoning domains. For training your own distilled models, see the [distillation README](https://github.com/ScalingIntelligence/scaling-verification/blob/main/distillation/README.md).

## Evaluation

Evaluate this model on different datasets:

```bash
# MATH500
python evaluate_crossencoder.py \
  --model_name "answerdotai/ModernBERT-large" \
  --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large" \
  --dataset_path "hazyresearch/MATH500_with_Llama_3.1_70B_Instruct_v1" \
  --dataset_split "data" \
  --max_length 4096 \
  --batch_size 64

# GPQA
python evaluate_crossencoder.py \
  --model_name "answerdotai/ModernBERT-large" \
  --checkpoint_path "hazyresearch/Weaver_Distilled_All_Datasets_ModernBERT-large" \
  --dataset_path "hazyresearch/GPQA_with_Llama_3.1_70B_Instruct_v1" \
  --dataset_split "data" \
  --max_length 4096 \
  --batch_size 64
```

## Citation

```bibtex
@misc{saadfalcon2025shrinkinggenerationverificationgapweak,
      title={Shrinking the Generation-Verification Gap with Weak Verifiers}, 
      author={Jon Saad-Falcon and E. Kelly Buchanan and Mayee F. Chen and Tzu-Heng Huang and Brendan McLaughlin and Tanvir Bhathal and Shang Zhu and Ben Athiwaratkun and Frederic Sala and Scott Linderman and Azalia Mirhoseini and Christopher Ré},
      year={2025},
      eprint={2506.18203},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2506.18203}, 
}
```