Professionalism Rating Model
This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.
Code: https://github.com/opendatalab/Meta-rater
Model Description
This model is a fine-tuned version of ModernBERT-base designed to evaluate the Professionalism dimension of text quality on a 5-point scale (0-5). Professionalism measures the degree of expertise and prerequisite knowledge required to comprehend a text, focusing on the depth, accuracy, and accessibility of content.
Model Details
- Base Model: ModernBERT-base
- Parameters: 149M
- Context Window: 4,096 tokens
- Task: Text quality rating (regression)
- Score Range: 0-5 (continuous)
- Performance: 91.57% F1 score, 93.78% accuracy
Rating Scale
The model uses an additive 5-point rating system:
- 0: Content requiring no technical knowledge
- 1: Simple content requiring minimal technical knowledge (nursery rhymes, children's books)
- 2: Somewhat complex content requiring basic specialized knowledge (popular books, popular science)
- 3: Moderate complexity requiring some expertise (advanced books, detailed articles)
- 4: Complex content requiring significant expertise (academic papers, technical reports)
- 5: Extremely high professionalism requiring advanced subject matter expertise (advanced academic papers, patents)
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "opendatalab/meta-rater-professionalism-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "The quantum entanglement phenomenon demonstrates non-local correlations between particles..."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
outputs = model(**inputs)
score = outputs.logits.squeeze().argmax(dim=0)
print(f"Professionalism Score: {score:.2f}")
Training Details
- Training Data: 747,422 examples from SlimPajama dataset
- Annotation Model: Llama-3.3-70B-Instruct
- Training Epochs: 10
- Evaluation Split: 93,428 test examples
- Data Split: 8:1:1 (train:dev:test)
Applications
This model is particularly useful for:
- Data curation for language model pre-training
- Content filtering based on technical complexity
- Educational content difficulty assessment
- Research paper and technical document evaluation
- Curriculum development and content sequencing
Limitations
- The model focuses on content professionalism, not writing style or grammar
- Performance may vary across different domains not well-represented in training data
- Designed primarily for English text
- Should not be used as the sole criterion for content quality assessment
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
License
This model is released under the same license as the base ModernBERT model.
Code
The code is available at: https://github.com/opendatalab/Meta-rater
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 25
Model tree for opendatalab/meta-rater-professionalism-rating
Base model
answerdotai/ModernBERT-base