PRRC-Readability Language Model (1.3B Parameters, 30B Tokens)

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the Readability dimension of the PRRC framework. The training data was curated by selecting text with high readability scores, focusing on clear, coherent, and well-structured content.

Model Details

Architecture: Transformer decoder-only
Parameters: 1.345B (1,345,423,360 parameters)
Training Tokens: 30B tokens
Context Window: 1,024 tokens
Vocabulary Size: 32,000 (LLaMA tokenizer)
Data Selection Method: Top-k selection based on Readability scores
Rating Model: ModernBERT-base fine-tuned for Readability assessment

Architecture Specifications

Hidden Dimension: 2,048
Number of Layers: 24
Attention Heads: 16
Key-Value Heads: 16
MLP Ratio: 8/3
Position Encoding: RoPE (base=10,000)

Data Selection Criteria

The training data was selected using the Readability rating model, which evaluates:

Clarity: Clear and comprehensible language
Coherence: Logical flow and organization
Grammar: Proper sentence structure and grammar
Accessibility: Appropriate vocabulary and sentence complexity
Structure: Well-organized content with proper formatting

Selected texts typically include:

Well-written articles and essays
Clear educational materials
Professional communications
Edited publications and books
Quality journalism and reporting

Training Details

Hardware: 32x NVIDIA A800 GPUs
Global Batch Size: 4,194,304 tokens
Learning Rate: 5e-5
Optimizer: Adam (β₁=0.9, β₂=0.95, ε=1e-8)
Training Time: ~14 hours

Performance Results

Downstream Task Performance (Average Accuracy)

General Knowledge: 56.18% (+3.39% vs Random)
- ARC-Easy: 55.64%
- ARC-Challenge: 26.19%
- SciQ: 86.70%
Commonsense Reasoning: 45.41% (+1.47% vs Random)
- HellaSwag: 42.89%
- SIQA: 40.17%
- WinoGrande: 53.16%
Reading Comprehension: 31.20% (+1.18% vs Random)
- RACE: 32.00%
- OpenbookQA: 30.40%
Overall Average: 45.89% (+2.11% vs Random)

Key Findings

Balanced Performance: Consistent improvements across all task categories
Reading Comprehension: Strong improvement in text understanding tasks
Clear Communication: Enhanced ability to generate coherent and readable text
General Applicability: Well-rounded performance suitable for diverse applications

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "opendatalab/meta-rater-1b-readability"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text (particularly good for clear, readable content)
prompt = "The benefits of renewable energy include"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_length=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Applications

This model is particularly well-suited for:

Content writing and editing assistance
Educational materials creation
Clear communication tasks
General text generation with high readability
Documentation and technical writing
Public communication and outreach
Accessibility-focused content creation

Strengths

Generates clear and coherent text
Balanced performance across different task types
Improved reading comprehension capabilities
Well-structured and organized output
Suitable for general-purpose applications
Enhanced text clarity and flow

Limitations

May prioritize clarity over technical depth
Might avoid complex but necessary terminology
Limited context window (1,024 tokens)
No instruction tuning or safety alignment
Could oversimplify complex topics

Comparison with Baselines

vs Random Baseline: +2.11% overall improvement across all categories
vs Other PRRC Dimensions: Most balanced performance, strongest in reading comprehension
vs Meta-rater All (25): Shows focused improvement in text clarity and comprehension

Model Characteristics

This model excels at:

Clarity: Producing easy-to-understand text
Coherence: Maintaining logical flow in generation
Accessibility: Using appropriate vocabulary for broad audiences
Structure: Organizing information effectively
Readability: Optimizing text for human comprehension

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

opendatalab
/

meta-rater-1b-readability