--- datasets: - opendatalab/SlimPajama-Meta-rater language: - en license: mit pipeline_tag: text-generation library_name: transformers --- # PRRC-Professionalism Language Model (1.3B Parameters, 30B Tokens) This repository contains the model described in the paper [Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models](https://huggingface.co/papers/2504.14194). Code: https://github.com/opendatalab/Meta-rater ## Model Description This is a 1.3B parameter transformer-based decoder-only language model trained from scratch on 30B tokens selected from SlimPajama dataset using the **Professionalism** dimension of the PRRC framework. The training data was curated by selecting text with high professionalism scores, focusing on content requiring significant expertise and technical knowledge. ## Model Details - **Architecture**: Transformer decoder-only - **Parameters**: 1.345B (1,345,423,360 parameters) - **Training Tokens**: 30B tokens - **Context Window**: 1,024 tokens - **Vocabulary Size**: 32,000 (LLaMA tokenizer) - **Data Selection Method**: Top-k selection based on Professionalism scores - **Rating Model**: ModernBERT-base fine-tuned for Professionalism assessment ## Architecture Specifications - **Hidden Dimension**: 2,048 - **Number of Layers**: 24 - **Attention Heads**: 16 - **Key-Value Heads**: 16 - **MLP Ratio**: 8/3 - **Position Encoding**: RoPE (base=10,000) ## Data Selection Criteria The training data was selected using the Professionalism rating model, which evaluates: - **Expertise Level**: Degree of specialized knowledge required - **Technical Depth**: Complexity of concepts and terminology - **Academic Rigor**: Quality of analysis and argumentation - **Professional Standards**: Adherence to domain-specific conventions Selected texts typically include: - Academic papers and research articles - Technical documentation and manuals - Professional reports and analyses - Advanced educational materials ## Training Details - **Hardware**: 32x NVIDIA A800 GPUs - **Global Batch Size**: 4,194,304 tokens - **Learning Rate**: 5e-5 - **Optimizer**: Adam (β₁=0.9, β₂=0.95, ε=1e-8) - **Training Time**: ~14 hours ## Performance Results ### Downstream Task Performance (Average Accuracy) - **General Knowledge**: 56.11% (+3.32% vs Random) - ARC-Easy: 55.85% - ARC-Challenge: 27.56% - SciQ: 84.92% - **Commonsense Reasoning**: 44.66% (+0.72% vs Random) - HellaSwag: 41.20% - SIQA: 39.99% - WinoGrande: 52.78% - **Reading Comprehension**: 29.89% (-0.13% vs Random) - RACE: 29.98% - OpenbookQA: 29.80% - **Overall Average**: 45.26% (+1.48% vs Random) ## Key Findings - **Strong General Knowledge**: Significant improvement in knowledge-intensive tasks - **Academic Performance**: Particularly effective for scientific and technical content - **Professional Content**: Enhanced understanding of domain-specific terminology - **Trade-offs**: Slight decrease in reading comprehension, possibly due to increased text complexity ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model_name = "opendatalab/meta-rater-1b-professionalism" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Generate text (particularly good for technical content) prompt = "In quantum mechanics, the wave function" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_length=100, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Applications This model is particularly well-suited for: - **Technical writing** and documentation generation - **Academic content** creation and analysis - **Professional domain** text generation - **Knowledge-intensive** question answering - **Scientific communication** tasks - **Educational content** for advanced learners ## Strengths - Enhanced performance on knowledge-requiring tasks - Better understanding of technical and academic content - Improved handling of domain-specific terminology - Strong performance on scientific reasoning tasks ## Limitations - May generate overly complex language for general audiences - Potential bias toward academic and technical writing styles - Limited context window (1,024 tokens) - No instruction tuning or safety alignment - May underperform on casual or conversational content ## Comparison with Baselines - **vs Random Baseline**: +1.48% overall, +3.32% on General Knowledge - **vs Other PRRC Dimensions**: Strongest on knowledge tasks, competitive on reasoning - **vs Meta-rater All (25)**: Individual dimension shows focused improvements ## Citation If you use this model in your research, please cite: ```bibtex @article{zhuang2025meta, title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models}, author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui}, journal={arXiv preprint arXiv:2504.14194}, year={2025} } ``` The code for this model is available at: https://github.com/opendatalab/Meta-rater. ## License Please refer to the license terms of the original SlimPajama dataset and follow applicable data licensing requirements. ## Contact For questions or issues, please contact the authors or open an issue in the repository.