davanstrien HF Staff commited on
Commit
a107300
·
verified ·
1 Parent(s): 85f5a12

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -67
README.md CHANGED
@@ -4,12 +4,33 @@ license: apache-2.0
4
  base_model: answerdotai/ModernBERT-base
5
  tags:
6
  - generated_from_trainer
 
 
 
 
7
  metrics:
8
  - accuracy
9
  - f1
 
10
  model-index:
11
- - name: modernbert-topics-1m
12
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  datasets:
14
  - WebOrganizer/TopicAnnotations-Llama-3.1-8B
15
  language:
@@ -17,68 +38,157 @@ language:
17
  pipeline_tag: text-classification
18
  ---
19
 
20
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
21
- should probably proofread and complete it, then remove this comment. -->
22
-
23
- # modernbert-topics-1m
24
-
25
- This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B) dataset.
26
-
27
- It achieves the following results on the evaluation set:
28
- - Loss: 0.5923
29
- - Accuracy: 0.7949
30
- - F1: 0.7948
31
- - Worst Group Accuracy: 0.5723
32
-
33
- ## Model description
34
-
35
- More information needed
36
-
37
- ## Intended uses & limitations
38
-
39
- More information needed
40
-
41
- ## Training and evaluation data
42
-
43
- More information needed
44
-
45
- ## Training procedure
46
-
47
- ### Training hyperparameters
48
-
49
- The following hyperparameters were used during training:
50
- - learning_rate: 5e-05
51
- - train_batch_size: 64
52
- - eval_batch_size: 64
53
- - seed: 42
54
- - distributed_type: multi-GPU
55
- - num_devices: 4
56
- - gradient_accumulation_steps: 4
57
- - total_train_batch_size: 1024
58
- - total_eval_batch_size: 256
59
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
60
- - lr_scheduler_type: linear
61
- - lr_scheduler_warmup_steps: 5000
62
- - num_epochs: 5
63
-
64
- ### Training results
65
-
66
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Worst Group Accuracy |
67
- |:-------------:|:------:|:----:|:---------------:|:--------:|:------:|:--------------------:|
68
- | 6.7028 | 0.5223 | 500 | 0.8360 | 0.7279 | 0.7269 | 0.4539 |
69
- | 2.9939 | 1.0439 | 1000 | 0.7108 | 0.7635 | 0.7632 | 0.5876 |
70
- | 2.625 | 1.5662 | 1500 | 0.6393 | 0.7785 | 0.7778 | 0.6283 |
71
- | 2.425 | 2.0878 | 2000 | 0.6043 | 0.7886 | 0.7880 | 0.6098 |
72
- | 2.2422 | 2.6101 | 2500 | 0.5870 | 0.7908 | 0.7902 | 0.6272 |
73
- | 2.158 | 3.1316 | 3000 | 0.5723 | 0.7944 | 0.7939 | 0.6427 |
74
- | 1.9898 | 3.6540 | 3500 | 0.5684 | 0.7947 | 0.7945 | 0.6595 |
75
- | 1.8709 | 4.1755 | 4000 | 0.6023 | 0.7941 | 0.7938 | 0.6061 |
76
- | 1.6459 | 4.6978 | 4500 | 0.5923 | 0.7949 | 0.7948 | 0.5723 |
77
-
78
-
79
- ### Framework versions
80
-
81
- - Transformers 4.51.3
82
- - Pytorch 2.6.0+cu124
83
- - Datasets 3.5.0
84
- - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  base_model: answerdotai/ModernBERT-base
5
  tags:
6
  - generated_from_trainer
7
+ - text-classification
8
+ - topic-detection
9
+ - modernbert
10
+ - web-content-classification
11
  metrics:
12
  - accuracy
13
  - f1
14
+ - worst_group_accuracy
15
  model-index:
16
+ - name: davanstrien/modernbert-topics-1m
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Topic Classification
21
+ dataset:
22
+ name: WebOrganizer/TopicAnnotations-Llama-3.1-8B
23
+ type: WebOrganizer/TopicAnnotations-Llama-3.1-8B
24
+ metrics:
25
+ - name: Accuracy
26
+ type: accuracy
27
+ value: 0.7949
28
+ - name: F1
29
+ type: f1
30
+ value: 0.7948
31
+ - name: Worst Group Accuracy
32
+ type: worst_group_accuracy
33
+ value: 0.5723
34
  datasets:
35
  - WebOrganizer/TopicAnnotations-Llama-3.1-8B
36
  language:
 
38
  pipeline_tag: text-classification
39
  ---
40
 
41
+ # ModernBERT Topics Classification Model (davanstrien/modernbert-topics-1m)
42
+
43
+ ## Model Description
44
+
45
+ This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B) dataset for multi-class topic classification. It is designed to classify web content into 24 distinct topic categories, ranging from "Adult Content" to "Food & Dining," making it useful for content categorization, filtering, and organization tasks.
46
+
47
+ The model leverages ModernBERT's architecture, which includes efficient attention mechanisms and allows for longer context handling compared to traditional BERT models (up to 8192 tokens). This implementation was specifically created to be compatible with VLLM, enabling faster and more efficient deployment, especially for processing large volumes of web content.
48
+
49
+ This model serves as an alternative to the original [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier), with the key difference being the use of ModernBERT as the base architecture instead of typical BERT models, providing improved efficiency and longer context handling.
50
+
51
+ ## Performance
52
+
53
+ The model achieves the following results on the evaluation set:
54
+ - **Loss:** 0.5923
55
+ - **Accuracy:** 0.7949
56
+ - **F1 Score:** 0.7948
57
+ - **Worst Group Accuracy:** 0.5723
58
+
59
+ These metrics indicate strong overall performance, with nearly 80% accuracy across all categories. The "Worst Group Accuracy" metric of 57.23% suggests there is some variance in performance across different topic categories, which should be considered when using this model for specific domains.
60
+
61
+ ## Intended Uses & Limitations
62
+
63
+ ### Intended Uses
64
+ - Web content categorization and organization
65
+ - Content filtering systems for various platforms and applications
66
+ - Topic-based content recommendation systems
67
+ - Research and analysis of web content distribution
68
+ - Automated content tagging for content management systems
69
+ - Information retrieval systems that benefit from topical categorization
70
+ - Pre-processing step for domain-specific training data curation
71
+
72
+ ### Limitations
73
+ - Performance varies across categories, with a worst group accuracy of 57.23%, indicating some topics may be classified less reliably than others
74
+ - The model may struggle with content that spans multiple categories or contains ambiguous topics
75
+ - Limited to English language content
76
+ - May not perform optimally on specialized domain-specific content that differs significantly from the training data
77
+ - Classification is limited to the 24 predefined categories; content outside these categories may be misclassified
78
+ - The model's training data was annotated by an LLM (Llama-3.1-8B), which may introduce systematic biases compared to human annotations
79
+ - While the model can process up to 8192 tokens, very long documents may lose important context if truncated
80
+
81
+ ## Training and Evaluation Data
82
+
83
+ This model was trained on the [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B) dataset, which contains 1 million web pages annotated with topic labels generated by the Llama-3.1-8B model. The dataset is derived from the DCLM RefinedWeb reproduction and was created as part of the research presented in ["Organize the Web: Constructing Domains Enhances Pre-Training Data Curation"](https://arxiv.org/abs/2502.10341).
84
+
85
+ Each sample in the dataset contains the full text content of a web page, its URL, the most likely topic label with its probability, probabilities for all possible topics, and additional metadata. The dataset was specifically designed for training topic classifiers and is used as first-stage training data for the WebOrganizer TopicClassifier.
86
+
87
+ The 24 topic categories covered by this model are:
88
+
89
+ 1. Adult Content
90
+ 2. Politics (includes social issues, campaigns, legislation, geopolitics, protests, activism)
91
+ 3. History & Geography (includes archaeology)
92
+ 4. Health (includes medicine, wellness, mental health, veterinary science, nutrition)
93
+ 5. Home & Hobbies (includes real estate, DIY, gardening, pets, collecting)
94
+ 6. Travel & Tourism (includes hospitality, hotels, cruises)
95
+ 7. Religion (includes spirituality)
96
+ 8. Sports & Fitness (includes martial arts, motor sports, outdoor activities)
97
+ 9. Games (includes video games, board games, gambling)
98
+ 10. Entertainment (includes music, movies, TV, celebrities, humor)
99
+ 11. Literature (includes criticism, linguistics, philosophy, humanities)
100
+ 12. Art & Design (includes architecture)
101
+ 13. Science, Math & Technology (includes physics, chemistry, biology, mathematics, engineering)
102
+ 14. Education & Jobs (includes pedagogy, training, academia)
103
+ 15. Software Development (includes algorithms, coding, web development)
104
+ 16. Fashion & Beauty (includes clothing, accessories, cosmetics)
105
+ 17. Industrial (includes mining, agriculture, manufacturing, construction)
106
+ 18. Software (topics related to software use and the internet)
107
+ 19. Finance & Business (includes taxes, investments, insurance, marketing, HR)
108
+ 20. Electronics & Hardware (includes computer hardware, phones, consumer electronics)
109
+ 21. Crime & Law (includes law enforcement)
110
+ 22. Transportation (includes vehicles, public transit, aviation, logistics)
111
+ 23. Social Life (includes family, relationships, community)
112
+ 24. Food & Dining (includes recipes, groceries, beverages, restaurants)
113
+
114
+ ## Training Procedure
115
+
116
+ ### Training Hyperparameters
117
+
118
+ The model was trained with the following hyperparameters:
119
+ - **Learning rate:** 5e-05
120
+ - **Train batch size:** 64 (1024 total with gradient accumulation)
121
+ - **Eval batch size:** 64 (256 total)
122
+ - **Optimizer:** AdamW with betas=(0.9, 0.999) and epsilon=1e-08
123
+ - **LR scheduler:** Linear with 5000 warmup steps
124
+ - **Training epochs:** 5
125
+ - **Distributed training:** Multi-GPU with 4 devices
126
+ - **Gradient accumulation steps:** 4
127
+ - **Seed:** 42
128
+
129
+
130
+ ## Technical Specifications
131
+
132
+ ### Model Architecture
133
+ - **Base model:** ModernBertForSequenceClassification
134
+ - **Hidden size:** 768
135
+ - **Number of hidden layers:** 22
136
+ - **Number of attention heads:** 12
137
+ - **Intermediate size:** 1152
138
+ - **Max position embeddings:** 8192
139
+ - **Vocabulary size:** 50368
140
+
141
+ ### Framework Versions
142
+ - Transformers: 4.51.3
143
+ - PyTorch: 2.6.0+cu124
144
+ - Datasets: 3.5.0
145
+ - Tokenizers: 0.21.1
146
+
147
+ ## Inference Information
148
+
149
+ This model is compatible with VLLM and inference engines, which can significantly improve inference speed, especially for batch processing. When using the model, be sure to use the ModernBERT tokenizer and respect the model's maximum sequence length of 8192 tokens.
150
+
151
+ Example usage:
152
+ ```python
153
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
154
+ import torch
155
+
156
+ # Load model and tokenizer
157
+ model_name = "davanstrien/modernbert-topics-1m"
158
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
159
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
160
+
161
+ # Prepare input text
162
+ text = "The impact of global warming on coral reef ecosystems"
163
+
164
+ # Tokenize and predict
165
+ inputs = tokenizer(text, return_tensors="pt", truncation=True)
166
+ with torch.no_grad():
167
+ outputs = model(**inputs)
168
+
169
+ # Get prediction
170
+ prediction = outputs.logits.argmax(-1).item()
171
+ predicted_label = model.config.id2label[prediction]
172
+ print(f"Predicted topic: {predicted_label}")
173
+ ```
174
+
175
+ ## Ethical Considerations and Biases
176
+
177
+ - This model may inherit biases present in the training data, potentially leading to inconsistent classification across different demographic or cultural contexts.
178
+ - Topics with less representation in the training data may show lower accuracy.
179
+ - Users should be aware that fully automated content classification without human oversight may lead to inappropriate categorizations in edge cases.
180
+
181
+ ## Citation and Contact Information
182
+
183
+ If you use this model in your research or applications, please cite the original ModernBERT model, as well as the WebOrganizer dataset and paper:
184
+
185
+ ```bibtex
186
+ @article{wettig2025organize,
187
+ title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
188
+ author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
189
+ journal={arXiv preprint arXiv:2502.10341},
190
+ year={2025}
191
+ }
192
+ ```
193
+
194
+ For questions, issues, or contributions related to this model, please reach out through the [Hugging Face model repository](https://huggingface.co/davanstrien/modernbert-topics-1m).