Text Generation
Transformers
Safetensors
English
internlm
custom_code

PRRC-Professionalism Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Professionalism dimension of the PRRC framework. The training data was curated by selecting text with high professionalism scores, focusing on content requiring significant expertise and technical knowledge.

Model Details

  • Architecture: Transformer decoder-only
  • Parameters: 1.345B (1,345,423,360 parameters)
  • Training Tokens: 30B tokens
  • Context Window: 1,024 tokens
  • Vocabulary Size: 32,000 (LLaMA tokenizer)
  • Data Selection Method: Top-k selection based on Professionalism scores
  • Rating Model: ModernBERT-base fine-tuned for Professionalism assessment

Architecture Specifications

  • Hidden Dimension: 2,048
  • Number of Layers: 24
  • Attention Heads: 16
  • Key-Value Heads: 16
  • MLP Ratio: 8/3
  • Position Encoding: RoPE (base=10,000)

Data Selection Criteria

The training data was selected using the Professionalism rating model, which evaluates:

  • Expertise Level: Degree of specialized knowledge required
  • Technical Depth: Complexity of concepts and terminology
  • Academic Rigor: Quality of analysis and argumentation
  • Professional Standards: Adherence to domain-specific conventions

Selected texts typically include:

  • Academic papers and research articles
  • Technical documentation and manuals
  • Professional reports and analyses
  • Advanced educational materials

Training Details

  • Hardware: 32x NVIDIA A800 GPUs
  • Global Batch Size: 4,194,304 tokens
  • Learning Rate: 5e-5
  • Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
  • Training Time: ~14 hours

Performance Results

Downstream Task Performance (Average Accuracy)

  • General Knowledge: 56.11% (+3.32% vs Random)

    • ARC-Easy: 55.85%
    • ARC-Challenge: 27.56%
    • SciQ: 84.92%
  • Commonsense Reasoning: 44.66% (+0.72% vs Random)

    • HellaSwag: 41.20%
    • SIQA: 39.99%
    • WinoGrande: 52.78%
  • Reading Comprehension: 29.89% (-0.13% vs Random)

    • RACE: 29.98%
    • OpenbookQA: 29.80%
  • Overall Average: 45.26% (+1.48% vs Random)

Key Findings

  • Strong General Knowledge: Significant improvement in knowledge-intensive tasks
  • Academic Performance: Particularly effective for scientific and technical content
  • Professional Content: Enhanced understanding of domain-specific terminology
  • Trade-offs: Slight decrease in reading comprehension, possibly due to increased text complexity

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-professionalism"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for technical content)
prompt = "In quantum mechanics, the wave function"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Applications

This model is particularly well-suited for:

  • Technical writing and documentation generation
  • Academic content creation and analysis
  • Professional domain text generation
  • Knowledge-intensive question answering
  • Scientific communication tasks
  • Educational content for advanced learners

Strengths

  • Enhanced performance on knowledge-requiring tasks
  • Better understanding of technical and academic content
  • Improved handling of domain-specific terminology
  • Strong performance on scientific reasoning tasks

Limitations

  • May generate overly complex language for general audiences
  • Potential bias toward academic and technical writing styles
  • Limited context window (1,024 tokens)
  • No instruction tuning or safety alignment
  • May underperform on casual or conversational content

Comparison with Baselines

  • vs Random Baseline: +1.48% overall, +3.32% on General Knowledge
  • vs Other PRRC Dimensions: Strongest on knowledge tasks, competitive on reasoning
  • vs Meta-rater All (25): Individual dimension shows focused improvements

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

The code for this model is available at: https://github.com/opendatalab/Meta-rater.

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
29
Safetensors
Model size
1.35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train opendatalab/meta-rater-1b-professionalism