Cleanliness Rating Model
Model Description
This model is a fine-tuned version of ModernBERT-base designed to evaluate the Cleanliness dimension of text quality on a 5-point scale (0-5). Cleanliness measures how well-formatted, complete, and noise-free a text is, focusing on structural integrity rather than semantic content.
Model Details
- Base Model: ModernBERT-base
- Parameters: 149M
- Context Window: 4,096 tokens
- Task: Text quality rating (regression)
- Score Range: 0-5 (continuous)
- Performance: 87.88% F1 score, 92.25% accuracy
Rating Scale
The model uses a 5-point rating system based on four key criteria:
- 1: Serious formatting/structural issues that significantly affect fluency
- 2: Obvious problems that noticeably affect reading fluency
- 3: Some problems present but don't seriously impact reading fluency
- 4: Minor issues that don't affect overall readability
- 5: Perfect formatting and structure across all criteria
Evaluation Criteria
The model assesses text across four main dimensions:
1. Correct Formatting
- Text appears human-edited rather than machine-extracted
- No inappropriate or corrupted characters
- Proper text structure and layout
2. Appropriate Content
- No irrelevant links, advertisements, or spam
- Sufficient content length to extract clear structure and theme
- Content focused on the main topic
3. Completeness Content
- Complete sentences written naturally by humans
- Coherent opinions, facts, or stories rather than fragments
- Proper article structure and flow
Note: Text ending with $TRUNCATED$
is considered a manual ending flag and doesn't affect completeness scoring.
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "opendatalab/meta-rater-cleanliness-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example text
text = "This is a well-formatted article about renewable energy. It contains complete sentences and proper structure."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
outputs = model(**inputs)
score = outputs.logits.squeeze().argmax(dim=0)
print(f"Cleanliness Score: {score:.2f}")
Training Details
- Training Data: 747,422 examples from SlimPajama dataset
- Annotation Model: Llama-3.3-70B-Instruct
- Training Epochs: 10
- Evaluation Split: 93,428 test examples
- Data Split: 8:1:1 (train:dev:test)
Applications
This model is particularly useful for:
- Web scraping quality control and content filtering
- Data preprocessing for machine learning datasets
- Content management systems for automated quality checks
- Data curation for language model pre-training
- Document digitization quality assessment
- Automated content moderation and filtering
Common Issues Detected
The model can identify various types of text quality problems:
- Formatting artifacts from web scraping or OCR
- Incomplete sentences or fragmented text
- Excessive links or promotional content
- Corrupted characters or encoding issues
- Poor structure with inadequate content organization
- Advertisement contamination and irrelevant insertions
What the Model Does NOT Consider
- The specific language the text is written in
- The length of the text
- Usage of placeholders for data privacy or safety
- Content topic, professionalism, or semantic meaning
- Writing style or grammatical sophistication
Use Cases by Score Range
- 4.0-5.0: High-quality content suitable for training data
- 3.0-3.9: Acceptable content with minor cleaning needed
- 2.0-2.9: Moderate issues requiring preprocessing
- 1.0-1.9: Significant problems, may need extensive cleaning
- 0.0-0.9: Poor quality, likely unsuitable for most applications
Limitations
- Designed primarily for English text
- May not detect all domain-specific formatting conventions
- Performance may vary for highly technical formats (code, mathematical notation)
- Should be used in conjunction with other quality metrics for comprehensive assessment
Citation
If you use this model in your research, please cite:
@article{zhuang2025meta,
title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
journal={arXiv preprint arXiv:2504.14194},
year={2025}
}
You can find more details about Meta-rater at https://github.com/opendatalab/Meta-rater.
License
This model is released under the same license as the base ModernBERT model.
Contact
For questions or issues, please contact the authors or open an issue in the repository.
- Downloads last month
- 12
Model tree for opendatalab/meta-rater-cleanliness-rating
Base model
answerdotai/ModernBERT-base