Meta-rater Language Model (3.3B Parameters, 100B Tokens)
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This is a 3.3B parameter transformer-based decoder-only language model trained from scratch on 100B tokens selected from SlimPajama dataset using the Meta-rater framework with all 25 quality scores. This model demonstrates the scalability of Meta-rater's data selection benefits to larger model sizes and training datasets.
Model Details
- Architecture: Transformer decoder-only
- Parameters: 3.3B (3,335,989,760 parameters)
- Training Tokens: 100B tokens
- Context Window: 1,024 tokens
- Vocabulary Size: 32,000 (LLaMA tokenizer)
- Data Selection Method: Meta-rater with all 25 quality scores
- Optimization: Learned optimal weightings from 1.3B experiments
Architecture Specifications
- Hidden Dimension: 2,560
- Number of Layers: 40
- Attention Heads: 20
- Key-Value Heads: 20
- MLP Ratio: 8/3
- Position Encoding: RoPE (base=10,000)
Data Selection Framework
The training data was selected using the same Meta-rater framework as the 1.3B models, leveraging:
Quality Score Integration (25 total)
- Natural Language Quality Signals (11): RedPajama rule-based measures
- Data Importance Scores (3): DSIR similarity to Books, Wikipedia, AutoMathText
- Model-based Ratings (11): PRRC + QuRating + FineWeb-Edu + WanjuanCC
Optimal Weighting Strategy
The same learned weights from 1.3B proxy model experiments were applied, ensuring consistent data selection criteria across scales.
Training Details
- Hardware: 32x NVIDIA A800 GPUs
- Global Batch Size: 4,194,304 tokens
- Learning Rate: 5e-5
- Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- Training Time: ~129 hours
Performance Results
Downstream Task Performance (Average Accuracy)
General Knowledge: 67.51% (+3.29% vs Random 3.3B)
- ARC-Easy: 72.10%
- ARC-Challenge: 37.54%
- SciQ: 92.90%
Commonsense Reasoning: 54.35% (+0.80% vs Random 3.3B)
- HellaSwag: 58.99%
- SIQA: 43.91%
- WinoGrande: 60.14%
Reading Comprehension: 36.06% (+0.78% vs Random 3.3B)
- RACE: 35.12%
- OpenbookQA: 37.00%
Overall Average: 54.71% (+1.73% vs Random 3.3B)
Knowledge-Intensive Tasks
- MMLU: 26.21% (+0.73% vs Random 3.3B)
- NaturalQuestions: 6.87% (+0.59% vs Random 3.3B)
Scaling Validation
Benefits Persist at Scale
Compared to the 1.3B Meta-rater model results:
- Consistent Improvements: Similar relative gains maintained at larger scale
- Absolute Performance: Substantial improvements in all categories
- Efficiency: Data selection remains valuable even with more parameters
Cross-Scale Comparison
- 1.3B Meta-rater: 47.01% overall
- 3.3B Meta-rater: 54.71% overall (+7.70% from scaling)
- Scale Efficiency: ~2.5x parameters yield significant performance gains
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer
model_name = "opendatalab/meta-rater-3b-25raters"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
# Generate text (optimized for high-quality output)
prompt = "The key principles of sustainable development include"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_length=150,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Applications
This model is well-suited for:
- Production applications requiring high-quality text generation
- Research needing stronger baseline performance
- Educational platforms with diverse content requirements
- Content creation at scale with quality assurance
- Multi-domain applications benefiting from balanced capabilities
- Scaling studies for data selection methodologies
Key Achievements
- Scalability Validation: Confirms Meta-rater benefits persist at larger scales
- Improved Baselines: Establishes stronger performance benchmarks
- Efficiency Demonstration: Better results with same computational budget
- Quality Consistency: Maintains data selection advantages across scales
Research Significance
This model provides crucial evidence for:
- Scaling Laws: Data quality benefits don't diminish with model size
- Efficiency: Quality data selection remains valuable at any scale
- Methodology Robustness: Meta-rater framework generalizes across sizes
- Cost-Effectiveness: Better performance without additional training costs
Strengths
- Enhanced performance across all evaluation categories
- Scalable data selection methodology
- Improved knowledge retention and reasoning
- Consistent quality improvements over random selection
- Validated framework transferability
Limitations
- Higher computational requirements for training
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- Requires quality score preprocessing
- Same data selection overhead as smaller models
Comparison Summary
vs Random 3.3B Baseline
- Overall: +1.73% improvement (54.71% vs 52.98%)
- General Knowledge: +3.29% improvement (strongest category)
- All Categories: Consistent improvements across all task types
vs 1.3B Meta-rater
- Scale Benefits: +7.70% improvement from increased parameters
- Framework Consistency: Same data selection principles apply effectively
- Efficiency: Larger models can better utilize high-quality data
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
Related Resources
- 1.3B Meta-rater Models: Smaller-scale versions with detailed analysis
- PRRC Rating Models: Quality assessment models used for data selection
- Annotated SlimPajama: Complete dataset with quality scores
- Random Baselines: Corresponding baseline models for comparison
- Project Page: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
- Github: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
License
Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 36