Professionalism Rating Model

This repository contains the model described in the paper Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models.

Code: https://github.com/opendatalab/Meta-rater

Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the Professionalism dimension of text quality on a 5-point scale (0-5). Professionalism measures the degree of expertise and prerequisite knowledge required to comprehend a text, focusing on the depth, accuracy, and accessibility of content.

Model Details

  • Base Model: ModernBERT-base
  • Parameters: 149M
  • Context Window: 4,096 tokens
  • Task: Text quality rating (regression)
  • Score Range: 0-5 (continuous)
  • Performance: 91.57% F1 score, 93.78% accuracy

Rating Scale

The model uses an additive 5-point rating system:

  • 0: Content requiring no technical knowledge
  • 1: Simple content requiring minimal technical knowledge (nursery rhymes, children's books)
  • 2: Somewhat complex content requiring basic specialized knowledge (popular books, popular science)
  • 3: Moderate complexity requiring some expertise (advanced books, detailed articles)
  • 4: Complex content requiring significant expertise (academic papers, technical reports)
  • 5: Extremely high professionalism requiring advanced subject matter expertise (advanced academic papers, patents)

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-professionalism-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "The quantum entanglement phenomenon demonstrates non-local correlations between particles..."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Professionalism Score: {score:.2f}")

Training Details

  • Training Data: 747,422 examples from SlimPajama dataset
  • Annotation Model: Llama-3.3-70B-Instruct
  • Training Epochs: 10
  • Evaluation Split: 93,428 test examples
  • Data Split: 8:1:1 (train:dev:test)

Applications

This model is particularly useful for:

  • Data curation for language model pre-training
  • Content filtering based on technical complexity
  • Educational content difficulty assessment
  • Research paper and technical document evaluation
  • Curriculum development and content sequencing

Limitations

  • The model focuses on content professionalism, not writing style or grammar
  • Performance may vary across different domains not well-represented in training data
  • Designed primarily for English text
  • Should not be used as the sole criterion for content quality assessment

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

License

This model is released under the same license as the base ModernBERT model.

Code

The code is available at: https://github.com/opendatalab/Meta-rater

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
25
Safetensors
Model size
150M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for opendatalab/meta-rater-professionalism-rating

Finetuned
(572)
this model

Dataset used to train opendatalab/meta-rater-professionalism-rating