Meta-rater Language Model (3.3B Parameters, 100B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This model demonstrates the scalability of Meta-rater's data selection benefits to larger model sizes and training datasets.

Model Details

Architecture: Transformer decoder-only
Parameters: 3.3B (3,335,989,760 parameters)
Training Tokens: 100B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Data Selection Method: Meta-rater with all 25 quality scores
Optimization: Learned optimal weightings from 1.3B experiments

Architecture Specifications

Hidden Dimension: 2,560
Number of Layers: 40
Attention Heads: 20
Key-Value Heads: 20
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Data Selection Framework

The training data was selected using the same Meta-rater framework as the 1.3B models, leveraging:

Quality Score Integration (25 total)

Natural Language Quality Signals (11): RedPajama rule-based measures
Data Importance Scores (3): DSIR similarity to Books, Wikipedia, AutoMathText
Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC

Optimal Weighting Strategy

The same learned weights from 1.3B proxy model experiments were applied, ensuring consistent data selection criteria across scales.

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~129 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 67.51% (+3.29% vs Random 3.3B)
- ARC-Easy: 72.10%
- ARC-Challenge: 37.54%
- SciQ: 92.90%
Commonsense Reasoning: 54.35% (+0.80% vs Random 3.3B)
- HellaSwag: 58.99%
- SIQA: 43.91%
- WinoGrande: 60.14%
Reading Comprehension: 36.06% (+0.78% vs Random 3.3B)
- RACE: 35.12%
- OpenbookQA: 37.00%
Overall Average: 54.71% (+1.73% vs Random 3.3B)

Knowledge-Intensive Tasks

MMLU: 26.21% (+0.73% vs Random 3.3B)
NaturalQuestions: 6.87% (+0.59% vs Random 3.3B)

Scaling Validation

Benefits Persist at Scale

Compared to the 1.3B Meta-rater model results:

Consistent Improvements: Similar relative gains maintained at larger scale
Absolute Performance: Substantial improvements in all categories
Efficiency: Data selection remains valuable even with more parameters

Cross-Scale Comparison

1.3B Meta-rater: 47.01% overall
3.3B Meta-rater: 54.71% overall (+7.70% from scaling)
Scale Efficiency: ~2.5x parameters yield significant performance gains

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-25raters"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Generate text (optimized for high-quality output)
prompt = "The key principles of sustainable development include"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Applications

This model is well-suited for:

Production applications requiring high-quality text generation
Research needing stronger baseline performance
Educational platforms with diverse content requirements
Content creation at scale with quality assurance
Multi-domain applications benefiting from balanced capabilities
Scaling studies for data selection methodologies

Key Achievements

Scalability Validation: Confirms Meta-rater benefits persist at larger scales
Improved Baselines: Establishes stronger performance benchmarks
Efficiency Demonstration: Better results with same computational budget
Quality Consistency: Maintains data selection advantages across scales

Research Significance

This model provides crucial evidence for:

Scaling Laws: Data quality benefits don't diminish with model size
Efficiency: Quality data selection remains valuable at any scale
Methodology Robustness: Meta-rater framework generalizes across sizes
Cost-Effectiveness: Better performance without additional training costs

Strengths

Enhanced performance across all evaluation categories
Scalable data selection methodology
Improved knowledge retention and reasoning
Consistent quality improvements over random selection
Validated framework transferability

Limitations

Higher computational requirements for training
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
Requires quality score preprocessing
Same data selection overhead as smaller models

Comparison Summary

vs Random 3.3B Baseline

Overall: +1.73% improvement (54.71% vs 52.98%)
General Knowledge: +3.29% improvement (strongest category)
All Categories: Consistent improvements across all task types

vs 1.3B Meta-rater

Scale Benefits: +7.70% improvement from increased parameters
Framework Consistency: Same data selection principles apply effectively
Efficiency: Larger models can better utilize high-quality data

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

Related Resources

1.3B Meta-rater Models: Smaller-scale versions with detailed analysis
PRRC Rating Models: Quality assessment models used for data selection
Annotated SlimPajama: Complete dataset with quality scores
Random Baselines: Corresponding baseline models for comparison
Project Page: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Github: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

opendatalab
/

meta-rater-3b-25raters