Text Classification
Transformers
Safetensors
English
modernbert

Cleanliness Rating Model

Model Description

This model is a fine-tuned version of ModernBERT-base designed to evaluate the Cleanliness dimension of text quality on a 5-point scale (0-5). Cleanliness measures how well-formatted, complete, and noise-free a text is, focusing on structural integrity rather than semantic content.

Model Details

  • Base Model: ModernBERT-base
  • Parameters: 149M
  • Context Window: 4,096 tokens
  • Task: Text quality rating (regression)
  • Score Range: 0-5 (continuous)
  • Performance: 87.88% F1 score, 92.25% accuracy

Rating Scale

The model uses a 5-point rating system based on four key criteria:

  • 1: Serious formatting/structural issues that significantly affect fluency
  • 2: Obvious problems that noticeably affect reading fluency
  • 3: Some problems present but don't seriously impact reading fluency
  • 4: Minor issues that don't affect overall readability
  • 5: Perfect formatting and structure across all criteria

Evaluation Criteria

The model assesses text across four main dimensions:

1. Correct Formatting

  • Text appears human-edited rather than machine-extracted
  • No inappropriate or corrupted characters
  • Proper text structure and layout

2. Appropriate Content

  • No irrelevant links, advertisements, or spam
  • Sufficient content length to extract clear structure and theme
  • Content focused on the main topic

3. Completeness Content

  • Complete sentences written naturally by humans
  • Coherent opinions, facts, or stories rather than fragments
  • Proper article structure and flow

Note: Text ending with $TRUNCATED$ is considered a manual ending flag and doesn't affect completeness scoring.

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the model and tokenizer
model_name = "opendatalab/meta-rater-cleanliness-rating"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text
text = "This is a well-formatted article about renewable energy. It contains complete sentences and proper structure."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    outputs = model(**inputs)
    score = outputs.logits.squeeze().argmax(dim=0)

print(f"Cleanliness Score: {score:.2f}")

Training Details

  • Training Data: 747,422 examples from SlimPajama dataset
  • Annotation Model: Llama-3.3-70B-Instruct
  • Training Epochs: 10
  • Evaluation Split: 93,428 test examples
  • Data Split: 8:1:1 (train:dev:test)

Applications

This model is particularly useful for:

  • Web scraping quality control and content filtering
  • Data preprocessing for machine learning datasets
  • Content management systems for automated quality checks
  • Data curation for language model pre-training
  • Document digitization quality assessment
  • Automated content moderation and filtering

Common Issues Detected

The model can identify various types of text quality problems:

  • Formatting artifacts from web scraping or OCR
  • Incomplete sentences or fragmented text
  • Excessive links or promotional content
  • Corrupted characters or encoding issues
  • Poor structure with inadequate content organization
  • Advertisement contamination and irrelevant insertions

What the Model Does NOT Consider

  • The specific language the text is written in
  • The length of the text
  • Usage of placeholders for data privacy or safety
  • Content topic, professionalism, or semantic meaning
  • Writing style or grammatical sophistication

Use Cases by Score Range

  • 4.0-5.0: High-quality content suitable for training data
  • 3.0-3.9: Acceptable content with minor cleaning needed
  • 2.0-2.9: Moderate issues requiring preprocessing
  • 1.0-1.9: Significant problems, may need extensive cleaning
  • 0.0-0.9: Poor quality, likely unsuitable for most applications

Limitations

  • Designed primarily for English text
  • May not detect all domain-specific formatting conventions
  • Performance may vary for highly technical formats (code, mathematical notation)
  • Should be used in conjunction with other quality metrics for comprehensive assessment

Citation

If you use this model in your research, please cite:

@article{zhuang2025meta,
  title={Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models},
  author={Zhuang, Xinlin and Peng, Jiahui and Ma, Ren and Wang, Yinfan and Bai, Tianyi and Wei, Xingjian and Qiu, Jiantao and Zhang, Chi and Qian, Ying and He, Conghui},
  journal={arXiv preprint arXiv:2504.14194},
  year={2025}
}

You can find more details about Meta-rater at https://github.com/opendatalab/Meta-rater.

License

This model is released under the same license as the base ModernBERT model.

Contact

For questions or issues, please contact the authors or open an issue in the repository.

Downloads last month
12
Safetensors
Model size
150M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for opendatalab/meta-rater-cleanliness-rating

Finetuned
(559)
this model

Dataset used to train opendatalab/meta-rater-cleanliness-rating