BAAI
/

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

CCI4.0-ZH-HQ-Classifiers

Overview

CCI4.0-ZH-HQ-Classifiers is our model-based quality labeling system designed to score the quality of the CCI4.0 Chinese corpus. Considering that a single classifier has limited recall when identifying high-quality pretraining documents, we follow a similar approach to Nemotron-CC’s treatment of English Common Crawl by building three separate quality classifiers to re-evaluate the Chinese pretraining data.

Quality Classifier Training

We used two large language models to perform quality annotations on Chinese samples, constructing two separate 460K-sample training sets based on Qwen2.5-72B and DeepSeek-V3, respectively. These datasets were then used for parameter tuning, resulting in two distinct quality classifiers. Additionally, we employed a fastText-based classifier trained on a combination of instruction-formatted data and high-scoring posts selected from the Chinese corpus.

Quality Scoring and Bucketing

Following a similar approach to Nemotron-CC, we first use each of the three classifiers to predict quality scores for all documents in the corpus. For each classifier, we rank the documents by their predicted scores and discretize the results into integer buckets ranging from 0 to 19, with each bucket representing approximately 5% of the data. Bucket 19 corresponds to the top 5% of highest-quality documents. To obtain the final quality score for each document, we ensemble the integer scores from the three classifiers using a maximum aggregation strategy.

Quality Labeling

To assign quality labels that better reflect the actual impact of data on downstream performance, we further evaluated each score bucket through pretraining. Specifically, we pretrained a 1B-parameter dense model on 100B tokens sampled from each bucket and measured its downstream task performance. The evaluation results show that the downstream performance trends are consistent with the classifier-based quality score rankings, validating the effectiveness of the quality labeling. Curve of Chinese Evaluation Metrics Across Quality Score Buckets

Usage

  model = AutoModelForSequenceClassification.from_pretrained(
    model_dir,
    trust_remote_code=False,
    ignore_mismatched_sizes=False,)
  model.cuda()
  model.eval()

  tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    use_fast=True,
    token=None,
    trust_remote_code=False,)

  result = tokenizer(
    [sentecnce],
    padding=False,
    max_length=512,
    truncation=True,
    return_tensors="pt",).to("cuda")
  for key in result:
    result[key] = torch.tensor(result[key])

  model_out = model(**result)
  pred_score = float(model_out.logits.tolist()[0][0])

Citation

Please cite using:

@dataset{cci4_m2_v1,
  title={CCI4.0-M2 v1 Dataset Collection},
  author={OpenSeek Team},
  year={2025},
  publisher={Beijing Academy of Artificial Intelligence}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including BAAI/CCI4.0-ZH-HQ-Classifiers