---
datasets:
- opendatalab/SlimPajama-Meta-rater
language:
- en
license: mit
pipeline_tag: text-generation
library_name: transformers
---

# PRRC-Professionalism Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194).

Code: https://github.com/opendatalab/Meta-rater

## Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Professionalism** dimension of the PRRC framework. The training data was curated by selecting text with high professionalism scores, focusing on content requiring significant expertise and technical knowledge.

## Model Details

- **Architecture**: Transformer decoder-only
- **Parameters**: 1.345B (1,345,423,360 parameters)
- **Training Tokens**: 30B tokens
- **Context Window**: 1,024 tokens
- **Vocabulary Size**: 32,000 (LLaMA tokenizer)
- **Data Selection Method**: Top-k selection based on Professionalism scores
- **Rating Model**: ModernBERT-base fine-tuned for Professionalism assessment

## Architecture Specifications

- **Hidden Dimension**: 2,048
- **Number of Layers**: 24
- **Attention Heads**: 16
- **Key-Value Heads**: 16
- **MLP Ratio**: 8/3
- **Position Encoding**: RoPE (base=10,000)

## Data Selection Criteria

The training data was selected using the Professionalism rating model, which evaluates:
- **Expertise Level**: Degree of specialized knowledge required
- **Technical Depth**: Complexity of concepts and terminology
- **Academic Rigor**: Quality of analysis and argumentation
- **Professional Standards**: Adherence to domain-specific conventions

Selected texts typically include:
- Academic papers and research articles
- Technical documentation and manuals
- Professional reports and analyses
- Advanced educational materials

## Training Details

- **Hardware**: 32x NVIDIA A800 GPUs
- **Global Batch Size**: 4,194,304 tokens
- **Learning Rate**: 5e-5
- **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
- **Training Time**: ~14 hours

## Performance Results

### Downstream Task Performance (Average Accuracy)

- **General Knowledge**: 56.11% (+3.32% vs Random)
  - ARC-Easy: 55.85%
  - ARC-Challenge: 27.56%
  - SciQ: 84.92%

- **Commonsense Reasoning**: 44.66% (+0.72% vs Random)
  - HellaSwag: 41.20%
  - SIQA: 39.99%
  - WinoGrande: 52.78%

- **Reading Comprehension**: 29.89% (-0.13% vs Random)
  - RACE: 29.98%
  - OpenbookQA: 29.80%

- **Overall Average**: 45.26% (+1.48% vs Random)

## Key Findings

- **Strong General Knowledge**: Significant improvement in knowledge-intensive tasks
- **Academic Performance**: Particularly effective for scientific and technical content
- **Professional Content**: Enhanced understanding of domain-specific terminology
- **Trade-offs**: Slight decrease in reading comprehension, possibly due to increased text complexity

## Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-professionalism"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for technical content)
prompt = "In quantum mechanics, the wave function"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Applications

This model is particularly well-suited for:
- **Technical writing** and documentation generation
- **Academic content** creation and analysis
- **Professional domain** text generation
- **Knowledge-intensive** question answering
- **Scientific communication** tasks
- **Educational content** for advanced learners

## Strengths

- Enhanced performance on knowledge-requiring tasks
- Better understanding of technical and academic content
- Improved handling of domain-specific terminology
- Strong performance on scientific reasoning tasks

## Limitations

- May generate overly complex language for general audiences
- Potential bias toward academic and technical writing styles
- Limited context window (1,024 tokens)
- No instruction tuning or safety alignment
- May underperform on casual or conversational content

## Comparison with Baselines

- **vs Random Baseline**: +1.48% overall, +3.32% on General Knowledge
- **vs Other PRRC Dimensions**: Strongest on knowledge tasks, competitive on reasoning
- **vs Meta-rater All (25)**: Individual dimension shows focused improvements

## Citation

If you use this model in your research, please cite:

```bibtex
@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}
```

The code for this model is available at: https://github.com/opendatalab/Meta-rater.

## License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

## Contact

For questions or issues, please contact the authors or open an issue in the repository.