smallLabse_finetuned_twitter / README.md

initial commit

5acf86e verified 4 months ago

9.61 kB

	---
	license: mit
	tags:
	- multilabel-classification
	- multilingual
	- twitter
	- violence-prediction
	datasets:
	- m2im/multilingual-twitter-collective-violence-dataset
	language:
	- multilingual
	---

	# Model Card for m2im/smaller_labse_finetuned_twitter

	This model is a fine-tuned version of smaller-LaBSE (a distilled variant of LaBSE), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction.

	## Model Details

	### Model Description

	- Developed by: Dr. Milton Mendieta and Dr. Timothy Warren
	- Funded by: Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS)
	- Shared by: Dr. Milton Mendieta and Dr. Timothy Warren
	- Model type: Transformer-based sentence encoder fine-tuned for multilabel classification
	- Language(s): The smaller Language-agnostic BERT Sentence Encoder (smaller-LaBSE) is a distilled version of the original LaBSE model, initially trained on 15 languages. It was subsequently fine-tuned on multilingual social media data from X (formerly Twitter), covering 68 languages from 2014 onward, including the undefined `und` language category.
	- License: MIT
	- Finetuned from model: [setu4993/smaller-LaBSE](https://huggingface.co/setu4993/smaller-LaBSE)

	### Model Sources

	- Repository: [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction)
	- Paper: TBD

	## Uses

	### Direct Use

	This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events.

	### Downstream Use

	The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest.

	### Out-of-Scope Use

	- General-purpose sentiment analysis
	- Legal, health, or financial decision-making
	- Use in low-resource languages not covered by training data

	## Bias, Risks, and Limitations

	- Geographic bias: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.).
	- Temporal bias: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise.
	- Sample size sensitivity: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings.
	- Spatial ambiguity: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii.
	- Language coverage limitations: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants.

	## Recommendations

	- Use with short-term events: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution.
	- Avoid low-sample inference: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class.
	- Limit reliance on large-radius labels: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information.
	- Contextual validation: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages.
	- Consider post-processing: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases.
	- Batch predictions: Avoid use in isolated tweets; batch predictions are more reliable

	## How to Get Started with the Model

	```python
	from transformers import pipeline
	import html, re

	def clean_tweet(example):
	tweet = example['text']
	tweet = tweet.replace("\n", " ")
	tweet = html.unescape(tweet)
	tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet)
	tweet = re.sub(r'http\S+', '', tweet)
	tweet = re.sub('RT ', '', tweet)
	return {'text': tweet.strip()}

	pipe = pipeline("text-classification", model="m2im/smaller_labse_finetuned_twitter", tokenizer="m2im/smaller_labse_finetuned_twitter", top_k=None)

	example = {"text": "Protesta en Quito por medidas económicas."}
	cleaned = clean_tweet(example)
	print(pipe(cleaned["text"]))
	```

	## Training Details

	### Training Data

	- Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset)
	- Labels: 6 of the most informative out of 40 available:
	- `pre7geo10`, `pre7geo30`, `pre7geo50`
	- `post7geo10`, `post7geo30`, `post7geo50`

	### Training Procedure

	- Text preprocessing using tweet normalization (removal of mentions, URLs, etc.)
	- Tokenization with smaller-LaBSE tokenizer
	- Multi-label head using `BCEWithLogitsLoss`

	#### Training Hyperparameters

	- Model checkpoints: `setu4993/smaller-LaBSE`
	- Head class: `AutoModelForSequenceClassification`
	- Optimizer: AdamW
	- Batch size (train/validation): 1024
	- Epochs: 20
	- Learning rate: 5e-5
	- Learning rate scheduler: Cosine
	- Weight decay: 0.1
	- Max sequence length: 32
	- Precision: Mixed fp16
	- Random seed: 42
	- Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set

	## Evaluation

	### Testing Data, Factors & Metrics

	- Dataset: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`).
	- Metrics:
	- ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds.
	- Macro F1: Harmonic mean of precision and recall, averaged equally across all classes.
	- Micro F1: Harmonic mean of precision and recall, aggregated globally across all predictions.
	- Precision and Recall: Standard classification metrics to assess false positive and false negative trade-offs.

	### Results

	- Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on smaller-LaBSE-generated sentence embeddings. The best performing classical model—Random Forest—achieved a macro F1 score of approximately 0.61, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task.
	- In contrast, the fine-tuned smaller-LaBSE model, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a ROC-AUC score of 0.7246 on the validation set.
	- These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection.

	## Model Examination

	- Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional smaller-LaBSE sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization.
	- The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence.
	- Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals—especially at broader spatial radii (50 km)—is weaker and more prone to noise.

	## Environmental Impact

	- Hardware Type: 16 NVIDIA Tesla V100 GPUs
	- Hours used: ~10 hours
	- Cloud Provider: University research computing cluster
	- Compute Region: North America
	- Carbon Emitted: Not formally calculated

	## Technical Specifications

	### Model Architecture and Objective

	- Transformer encoder (BERT-based)
	- Objective: Multilabel binary classification with sentence embeddings

	### Compute Infrastructure

	- Hardware: One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab.
	- Software: PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management

	## Citation

	BibTeX:

	```bibtex
	@misc{mendieta2025labseviolence,
	author = {Milton Mendieta, Timothy Warren},
	title = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/m2im/smaller_labse_finetuned_twitter}},
	note = {Research on multilingual NLP and conflict prediction}
	}
	```

	## Citation

	APA:
	Mendieta, M., & Warren, T. (2025). Fine-tuning multilingual language models to predict collective violence using Twitter data [Model]. Hugging Face. https://huggingface.co/m2im/smaller_labse_finetuned_twitter

	## Model Card Authors

	Dr. Milton Mendieta and Dr. Timothy Warren

	## Model Card Contact

	[email protected]