Cleanliness Rating Model

Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the Cleanliness dimension of text quality on a 5-point scale (0-5). Cleanliness measures how well-formatted, complete, and noise-free a text is, focusing on structural integrity rather than semantic content.

Model Details

Base Model: ModernBERT-base
Parameters: 149M
Context Window: 4,096 tokens
Task: Text quality rating (regression)
Score Range: 0-5 (continuous)
Performance: 87.88% F1 score, 92.25% accuracy

Rating Scale

The model uses a 5-point rating system based on four key criteria:

1: Serious formatting/structural issues that significantly affect fluency
2: Obvious problems that noticeably affect reading fluency
3: Some problems present but don't seriously impact reading fluency
4: Minor issues that don't affect overall readability
5: Perfect formatting and structure across all criteria

Evaluation Criteria

The model assesses text across four main dimensions:

1. Correct Formatting

Text appears human-edited rather than machine-extracted
No inappropriate or corrupted characters
Proper text structure and layout

2. Appropriate Content

No irrelevant links, advertisements, or spam
Sufficient content length to extract clear structure and theme
Content focused on the main topic

3. Completeness Content

Complete sentences written naturally by humans
Coherent opinions, facts, or stories rather than fragments
Proper article structure and flow

Note: Text ending with $TRUNCATED$ is considered a manual ending flag and doesn't affect completeness scoring.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-cleanliness-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "This is a well-formatted article about renewable energy. It contains complete sentences and proper structure."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Cleanliness Score: {score:.2f}")

Training Details

Training Data: 747,422 examples from SlimPajama dataset
Annotation Model: Llama-3.3-70B-Instruct
Training Epochs: 10
Evaluation Split: 93,428 test examples
Data Split: 8:1:1 (train:dev:test)

Applications

This model is particularly useful for:

Web scraping quality control and content filtering
Data preprocessing for machine learning datasets
Content management systems for automated quality checks
Data curation for language model pre-training
Document digitization quality assessment
Automated content moderation and filtering

Common Issues Detected

The model can identify various types of text quality problems:

Formatting artifacts from web scraping or OCR
Incomplete sentences or fragmented text
Excessive links or promotional content
Corrupted characters or encoding issues
Poor structure with inadequate content organization
Advertisement contamination and irrelevant insertions

What the Model Does NOT Consider

The specific language the text is written in
The length of the text
Usage of placeholders for data privacy or safety
Content topic, professionalism, or semantic meaning
Writing style or grammatical sophistication

Use Cases by Score Range

4.0-5.0: High-quality content suitable for training data
3.0-3.9: Acceptable content with minor cleaning needed
2.0-2.9: Moderate issues requiring preprocessing
1.0-1.9: Significant problems, may need extensive cleaning
0.0-0.9: Poor quality, likely unsuitable for most applications

Limitations

Designed primarily for English text
May not detect all domain-specific formatting conventions
Performance may vary for highly technical formats (code, mathematical notation)
Should be used in conjunction with other quality metrics for comprehensive assessment

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

You can find more details about Meta-rater at https://github.com/opendatalab/Meta-rater.

License

This model is released under the same license as the base ModernBERT model.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

opendatalab
/

meta-rater-cleanliness-rating