--- library_name: transformers license: mit base_model: agentlans/multilingual-e5-small-aligned-v2 tags: - generated_from_trainer metrics: - accuracy language: - ar - zh - cs - da - nl - fr - de - el - hu - id - it - ja - fa - pl - pt - ru - es - sv - tr - vi datasets: - agentlans/fineweb2hq-vs-c4 pipeline_tag: text-classification --- # agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier > [!IMPORTANT] > **Note:** This model is provided for reference and reproducibility, not for standalone use. This model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2) on the [agentlans/fineweb2hq-vs-c4](https://huggingface.co/datasets/agentlans/fineweb2hq-vs-c4) dataset. The aim is to classify text as higher quality (FineWeb 2 HQ) or lower quality (C4) for AI training. On the validation set: - Loss: 0.1983 - Accuracy: 0.9515 - Combined Score: 1.3494 - Num Input Tokens Seen: 122880000 ## Example ```python from transformers import pipeline classifier = pipeline("text-classification", model="agentlans/multilingual-e5-small-fineweb2hq-vs-c4-classifier") classifier("Your text here.") ``` ## Limitations - **Not trained on English data** - Tends to be overly permissive, labelling most texts outside training data as high quality - May be biased against some text types ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 3.0 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | Combined Score | Input Tokens Seen | |:-------------:|:-----:|:------:|:---------------:|:--------:|:--------------:|:-----------------:| | 0.1387 | 1.0 | 40000 | 0.1983 | 0.9515 | 1.3494 | 40960000 | | 0.0682 | 2.0 | 80000 | 0.2264 | 0.9528 | 1.3270 | 81920000 | | 0.0424 | 3.0 | 120000 | 0.2598 | 0.9552 | 1.2845 | 122880000 | ### Framework versions - Transformers 4.51.3 - Pytorch 2.6.0+cu124 - Datasets 3.2.0 - Tokenizers 0.21.0