Safetensors
English
internlm
custom_code

PRRC-Readability Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Readability dimension of the PRRC framework. The training data was curated by selecting text with high readability scores, focusing on clear, coherent, and well-structured content.

Model Details

  • Architecture: Transformer decoder-only
  • Parameters: 1.345B (1,345,423,360 parameters)
  • Training Tokens: 30B tokens
  • Context Window: 1,024 tokens
  • Vocabulary Size: 32,000 (LLaMA tokenizer)
  • Data Selection Method: Top-k selection based on Readability scores
  • Rating Model: ModernBERT-base fine-tuned for Readability assessment

Architecture Specifications

  • Hidden Dimension: 2,048
  • Number of Layers: 24
  • Attention Heads: 16
  • Key-Value Heads: 16
  • MLP Ratio: 8/3
  • Position Encoding: RoPE (base=10,000)

Data Selection Criteria

The training data was selected using the Readability rating model, which evaluates:

  • Clarity: Clear and comprehensible language
  • Coherence: Logical flow and organization
  • Grammar: Proper sentence structure and grammar
  • Accessibility: Appropriate vocabulary and sentence complexity
  • Structure: Well-organized content with proper formatting

Selected texts typically include:

  • Well-written articles and essays
  • Clear educational materials
  • Professional communications
  • Edited publications and books
  • Quality journalism and reporting

Training Details

  • Hardware: 32x NVIDIA A800 GPUs
  • Global Batch Size: 4,194,304 tokens
  • Learning Rate: 5e-5
  • Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
  • Training Time: ~14 hours

Performance Results

Downstream Task Performance (Average Accuracy)

  • General Knowledge: 56.18% (+3.39% vs Random)

    • ARC-Easy: 55.64%
    • ARC-Challenge: 26.19%
    • SciQ: 86.70%
  • Commonsense Reasoning: 45.41% (+1.47% vs Random)

    • HellaSwag: 42.89%
    • SIQA: 40.17%
    • WinoGrande: 53.16%
  • Reading Comprehension: 31.20% (+1.18% vs Random)

    • RACE: 32.00%
    • OpenbookQA: 30.40%
  • Overall Average: 45.89% (+2.11% vs Random)

Key Findings

  • Balanced Performance: Consistent improvements across all task categories
  • Reading Comprehension: Strong improvement in text understanding tasks
  • Clear Communication: Enhanced ability to generate coherent and readable text
  • General Applicability: Well-rounded performance suitable for diverse applications

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-readability"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clear, readable content)
prompt = "The benefits of renewable energy include"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Applications

This model is particularly well-suited for:

  • Content writing and editing assistance
  • Educational materials creation
  • Clear communication tasks
  • General text generation with high readability
  • Documentation and technical writing
  • Public communication and outreach
  • Accessibility-focused content creation

Strengths

  • Generates clear and coherent text
  • Balanced performance across different task types
  • Improved reading comprehension capabilities
  • Well-structured and organized output
  • Suitable for general-purpose applications
  • Enhanced text clarity and flow

Limitations

  • May prioritize clarity over technical depth
  • Might avoid complex but necessary terminology
  • Limited context window (1,024 tokens)
  • No instruction tuning or safety alignment
  • Could oversimplify complex topics

Comparison with Baselines

  • vs Random Baseline: +2.11% overall improvement across all categories
  • vs Other PRRC Dimensions: Most balanced performance, strongest in reading comprehension
  • vs Meta-rater All (25): Shows focused improvement in text clarity and comprehension

Model Characteristics

This model excels at:

  • Clarity: Producing easy-to-understand text
  • Coherence: Maintaining logical flow in generation
  • Accessibility: Using appropriate vocabulary for broad audiences
  • Structure: Organizing information effectively
  • Readability: Optimizing text for human comprehension

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
6
Safetensors
Model size
1.35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train opendatalab/meta-rater-1b-readability