File size: 11,963 Bytes
7fbc2e1
 
 
 
 
 
a107300
 
 
 
7fbc2e1
 
 
a107300
7fbc2e1
c38e108
a107300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a659e9d
 
 
 
 
7fbc2e1
 
bba62a4
a107300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c9c2e4
 
a107300
 
3c9c2e4
 
 
 
 
c38e108
3c9c2e4
 
 
a107300
 
 
 
c38e108
a107300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93797a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c38e108
93797a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c38e108
93797a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a107300
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
library_name: transformers
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- generated_from_trainer
- text-classification
- topic-detection
- modernbert
- web-content-classification
metrics:
- accuracy
- f1
- worst_group_accuracy
model-index:
- name: davanstrien/ModernBERT-web-topics-1m
  results:
  - task:
      type: text-classification
      name: Topic Classification
    dataset:
      name: WebOrganizer/TopicAnnotations-Llama-3.1-8B
      type: WebOrganizer/TopicAnnotations-Llama-3.1-8B
    metrics:
      - name: Accuracy
        type: accuracy
        value: 0.7949
      - name: F1
        type: f1
        value: 0.7948
      - name: Worst Group Accuracy
        type: worst_group_accuracy
        value: 0.5723
datasets:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B
language:
- en
pipeline_tag: text-classification
---

# ModernBERT-web-topics-1m

## Model Description

This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on the [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B) dataset for multi-class topic classification. It is designed to classify web content into 24 distinct topic categories, ranging from "Adult Content" to "Food & Dining," making it useful for content categorization, filtering, and organization tasks.

The model leverages ModernBERT's architecture, which includes efficient attention mechanisms and allows for longer context handling compared to traditional BERT models (up to 8192 tokens). This implementation was specifically created to be compatible with VLLM, enabling faster and more efficient deployment, especially for processing large volumes of web content.

This model serves as an alternative to the original [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier), with the key difference being the use of ModernBERT as the base architecture instead of typical BERT models, providing improved efficiency and longer context handling.

## Performance

The model achieves the following results on the evaluation set:
- **Loss:** 0.5923
- **Accuracy:** 0.7949
- **F1 Score:** 0.7948
- **Worst Group Accuracy:** 0.5723

These metrics indicate strong overall performance, with nearly 80% accuracy across all categories. The "Worst Group Accuracy" metric of 57.23% suggests there is some variance in performance across different topic categories, which should be considered when using this model for specific domains.

## Intended Uses & Limitations

### Intended Uses
- Web content categorization and organization
- Content filtering systems for various platforms and applications
- Topic-based content recommendation systems
- Research and analysis of web content distribution
- Automated content tagging for content management systems
- Information retrieval systems that benefit from topical categorization
- Pre-processing step for domain-specific training data curation

### Limitations
- Performance varies across categories, with a worst group accuracy of 57.23%, indicating some topics may be classified less reliably than others
- The model may struggle with content that spans multiple categories or contains ambiguous topics
- Limited to English language content
- May not perform optimally on specialized domain-specific content that differs significantly from the training data
- Classification is limited to the 24 predefined categories; content outside these categories may be misclassified
- The model's training data was annotated by an LLM (Llama-3.1-8B), which may introduce systematic biases compared to human annotations
- While the model can process up to 8192 tokens, very long documents may lose important context if truncated

## Training and Evaluation Data

This model was trained on the [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B) dataset, which contains 1 million web pages annotated with topic labels generated by the Llama-3.1-8B model. The dataset is derived from the DCLM RefinedWeb reproduction and was created as part of the research presented in ["Organize the Web: Constructing Domains Enhances Pre-Training Data Curation"](https://arxiv.org/abs/2502.10341).

Each sample in the dataset contains the full text content of a web page, its URL, the most likely topic label with its probability, probabilities for all possible topics, and additional metadata. The dataset was specifically designed for training topic classifiers and is used as first-stage training data for the WebOrganizer TopicClassifier.

The 24 topic categories covered by this model are:

1. Adult Content
2. Politics (includes social issues, campaigns, legislation, geopolitics, protests, activism)
3. History & Geography (includes archaeology)
4. Health (includes medicine, wellness, mental health, veterinary science, nutrition)
5. Home & Hobbies (includes real estate, DIY, gardening, pets, collecting)
6. Travel & Tourism (includes hospitality, hotels, cruises)
7. Religion (includes spirituality)
8. Sports & Fitness (includes martial arts, motor sports, outdoor activities)
9. Games (includes video games, board games, gambling)
10. Entertainment (includes music, movies, TV, celebrities, humor)
11. Literature (includes criticism, linguistics, philosophy, humanities)
12. Art & Design (includes architecture)
13. Science, Math & Technology (includes physics, chemistry, biology, mathematics, engineering)
14. Education & Jobs (includes pedagogy, training, academia)
15. Software Development (includes algorithms, coding, web development)
16. Fashion & Beauty (includes clothing, accessories, cosmetics)
17. Industrial (includes mining, agriculture, manufacturing, construction)
18. Software (topics related to software use and the internet)
19. Finance & Business (includes taxes, investments, insurance, marketing, HR)
20. Electronics & Hardware (includes computer hardware, phones, consumer electronics)
21. Crime & Law (includes law enforcement)
22. Transportation (includes vehicles, public transit, aviation, logistics)
23. Social Life (includes family, relationships, community)
24. Food & Dining (includes recipes, groceries, beverages, restaurants)

## Training Procedure

### Training Hyperparameters

The model was trained with the following hyperparameters:
- **Learning rate:** 5e-05
- **Train batch size:** 64 (1024 total with gradient accumulation)
- **Eval batch size:** 64 (256 total)
- **Optimizer:** AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- **LR scheduler:** Linear with 5000 warmup steps
- **Training epochs:** 5
- **Distributed training:** Multi-GPU with 4 devices
- **Gradient accumulation steps:** 4
- **Seed:** 42


## Technical Specifications

### Model Architecture
- **Base model:** ModernBertForSequenceClassification
- **Hidden size:** 768
- **Number of hidden layers:** 22
- **Number of attention heads:** 12
- **Intermediate size:** 1152
- **Max position embeddings:** 8192
- **Vocabulary size:** 50368

### Framework Versions
- Transformers: 4.51.3
- PyTorch: 2.6.0+cu124
- Datasets: 3.5.0
- Tokenizers: 0.21.1

## Inference Information

This model is compatible with VLLM and inference engines, which can significantly improve inference speed, especially for batch processing. When using the model, be sure to use the ModernBERT tokenizer and respect the model's maximum sequence length of 8192 tokens.



Example usage:
```python
# via pipeline

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="davanstrien/ModernBERT-web-topics-1m")

# direct use

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "davanstrien/ModernBERT-web-topics-1m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input text
text = "The impact of global warming on coral reef ecosystems"

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True)
with torch.no_grad():
    outputs = model(**inputs)
    
# Get prediction
prediction = outputs.logits.argmax(-1).item()
predicted_label = model.config.id2label[prediction]
print(f"Predicted topic: {predicted_label}")
```

### Efficient Inference with vLLM
This model is compatible with vLLM for efficient, large-scale inference. vLLM is a high-performance inference engine that can significantly accelerate inference for ModernBERT classifiers.
Installation

To use vLLM with this model, install the latest version that supports ModernBERT (support was added in April 2025):

#### Basic Usage

Here's how to load and use the model with vLLM:

```python
from vllm import LLM
import torch
import torch.nn.functional as F

# Load the model with vLLM
llm = LLM(model="davanstrien/ModernBERT-web-topics-1m", task="classify")

# Single prediction
text = "This article discusses various approaches to content categorization using machine learning"
outputs = llm.classify(text)

# Process outputs
logits = torch.tensor(outputs[0].outputs.probs)
probabilities = F.softmax(logits, dim=0)
top_idx = torch.argmax(probabilities).item()
top_prob = probabilities[top_idx].item()

# Get label mapping from model config
import httpx
from huggingface_hub import hf_hub_url
from toolz import keymap

id2label = (
    httpx.get(
        hf_hub_url(
            "davanstrien/ModernBERT-web-topics-1m", 
            filename="config.json"
        )
    )
    .json()
    .get("id2label")
)
id2label = keymap(int, id2label)

# Get predicted label
predicted_label = id2label.get(top_idx)
print(f"Predicted topic: {predicted_label}")
print(f"Confidence: {top_prob:.4f}")
```

#### Batch Processing for Large Datasets

For large datasets, vLLM can process thousands of examples efficiently:

```python
from toolz import partition_all
from tqdm.auto import tqdm

# Load your dataset (could be from Hugging Face, Pandas, etc.)
# Example with documents list
documents = ["Document 1 content", "Document 2 content", ..., "Document N content"]

# Process in batches for very large datasets
batch_size = 10000
all_results = []

for batch in tqdm(list(partition_all(batch_size, documents))):
    all_results.extend(llm.classify(batch))

# Helper function to extract labels and confidence scores
def get_top_label(output, label_map):
    logits = torch.tensor(output.outputs.probs)
    probs = F.softmax(logits, dim=0)
    top_idx = torch.argmax(probs).item()
    top_prob = probs[top_idx].item()
    return label_map.get(top_idx), top_prob

# Process all results
predictions = [get_top_label(output, id2label) for output in all_results]
labels = [pred[0] for pred in predictions]
confidence_scores = [pred[1] for pred in predictions]
```



## Ethical Considerations and Biases

- This model may inherit biases present in the training data, potentially leading to inconsistent classification across different demographic or cultural contexts.
- Topics with less representation in the training data may show lower accuracy.
- Users should be aware that fully automated content classification without human oversight may lead to inappropriate categorizations in edge cases.

## Citation and Contact Information

If you use this model in your research or applications, please cite the original ModernBERT model, as well as the WebOrganizer dataset and paper:

```bibtex
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
```

For questions, issues, or contributions related to this model, please reach out through the [Hugging Face model repository](https://huggingface.co/davanstrien/modernbert-topics-1m).