File size: 9,605 Bytes
5acf86e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: mit
tags:
- multilabel-classification
- multilingual
- twitter
- violence-prediction
datasets:
- m2im/multilingual-twitter-collective-violence-dataset
language:
- multilingual
---

# Model Card for m2im/smaller_labse_finetuned_twitter

This model is a fine-tuned version of smaller-LaBSE (a distilled variant of LaBSE), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.

## Model Details

### Model Description

- **Developed by:** Dr. Milton Mendieta and Dr. Timothy Warren
- **Funded by:** Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
- **Shared by:** Dr. Milton Mendieta and Dr. Timothy Warren
- **Model type:** Transformer-based sentence encoder fine-tuned for multilabel classification
- **Language(s):** The smaller Language-agnostic BERT Sentence Encoder (smaller-LaBSE) is a distilled version of the original LaBSE model, initially trained on 15 languages. It was subsequently fine-tuned on multilingual social media data from X (formerly Twitter), covering 68 languages from 2014 onward, including the undefined `und` language category.
- **License:** MIT
- **Finetuned from model:** [setu4993/smaller-LaBSE](https://huggingface.co/setu4993/smaller-LaBSE)

### Model Sources

- **Repository:** [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction)
- **Paper:** TBD

## Uses

### Direct Use

This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.

### Downstream Use

The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.

### Out-of-Scope Use

- General-purpose sentiment analysis
- Legal, health, or financial decision-making
- Use in low-resource languages not covered by training data

## Bias, Risks, and Limitations

- **Geographic bias**: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
- **Temporal bias**: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
- **Sample size sensitivity**: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
- **Spatial ambiguity**: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii.
- **Language coverage limitations**: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.

## Recommendations

- **Use with short-term events**: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
- **Avoid low-sample inference**: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
- **Limit reliance on large-radius labels**: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
- **Contextual validation**: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
- **Consider post-processing**: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
- **Batch predictions**: Avoid use in isolated tweets; batch predictions are more reliable

## How to Get Started with the Model

```python
from transformers import pipeline
import html, re

def clean_tweet(example):
    tweet = example['text']
    tweet = tweet.replace("\n", " ")
    tweet = html.unescape(tweet)
    tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub('RT ', '', tweet)
    return {'text': tweet.strip()}

pipe = pipeline("text-classification", model="m2im/smaller_labse_finetuned_twitter", tokenizer="m2im/smaller_labse_finetuned_twitter", top_k=None)

example = {"text": "Protesta en Quito por medidas económicas."}
cleaned = clean_tweet(example)
print(pipe(cleaned["text"]))
```

## Training Details

### Training Data

- Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset)
- Labels: 6 of the most informative out of 40 available:
  - `pre7geo10`, `pre7geo30`, `pre7geo50`
  - `post7geo10`, `post7geo30`, `post7geo50`

### Training Procedure

- Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
- Tokenization with smaller-LaBSE tokenizer
- Multi-label head using `BCEWithLogitsLoss`

#### Training Hyperparameters

- Model checkpoints: `setu4993/smaller-LaBSE`
- Head class: `AutoModelForSequenceClassification`
- Optimizer: AdamW
- Batch size (train/validation): 1024
- Epochs: 20
- Learning rate: 5e-5
- Learning rate scheduler: Cosine
- Weight decay: 0.1
- Max sequence length: 32
- Precision: Mixed fp16
- Random seed: 42
- Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set

## Evaluation

### Testing Data, Factors & Metrics

- **Dataset**: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`).
- **Metrics**: 
  - **ROC-AUC** (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds.
  - **Macro F1**: Harmonic mean of precision and recall, averaged equally across all classes.
  - **Micro F1**: Harmonic mean of precision and recall, aggregated globally across all predictions.
  - **Precision** and **Recall**: Standard classification metrics to assess false positive and false negative trade-offs.

### Results

- Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on smaller-LaBSE-generated sentence embeddings. The best performing classical model—Random Forest—achieved a **macro F1 score of approximately 0.61**, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
- In contrast, the **fine-tuned smaller-LaBSE model**, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a **ROC-AUC score of 0.7246** on the validation set.
- These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.

## Model Examination

- Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional smaller-LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
- The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
- Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals—especially at broader spatial radii (50 km)—is weaker and more prone to noise.

## Environmental Impact

- **Hardware Type:** 16 NVIDIA Tesla V100 GPUs
- **Hours used:** ~10 hours
- **Cloud Provider:** University research computing cluster
- **Compute Region:** North America
- **Carbon Emitted:** Not formally calculated

## Technical Specifications

### Model Architecture and Objective

- Transformer encoder (BERT-based)
- Objective: Multilabel binary classification with sentence embeddings

### Compute Infrastructure

- **Hardware:** One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
- **Software:** PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management

## Citation

**BibTeX:**  

```bibtex
@misc{mendieta2025labseviolence,
  author       = {Milton Mendieta, Timothy Warren},
  title        = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/m2im/smaller_labse_finetuned_twitter}},
  note         = {Research on multilingual NLP and conflict prediction}
}
```

## Citation
  
**APA:**  
Mendieta, M., & Warren, T. (2025). *Fine-tuning multilingual language models to predict collective violence using Twitter data* [Model]. Hugging Face. https://huggingface.co/m2im/smaller_labse_finetuned_twitter

## Model Card Authors

Dr. Milton Mendieta and Dr. Timothy Warren

## Model Card Contact

[email protected]