README.md · excribe/classifer_sgd_longformer

classifer_sgd_longformer_4099 / README.md

excribe

Update README.md

fcf2943 verified about 2 months ago

preview code

raw

history blame contribute delete

7.31 kB

	---
	license: cc-by-nc-3.0
	language:
	- es
	base_model:
	- allenai/longformer-base-4096
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- sgd,
	- documental
	- gestion
	- documental type
	- text-classification
	- longformer
	- spanish
	- document-management
	- smote
	- multi-class
	- fine-tuned
	- transformers
	- gpu
	- a100
	- cc-by-nc-3.0
	---
	# Excribe Classifier SGD Longformer 4096

	## Model Overview

	Excribe/Classifier_SGD_Longformer_4099 is a fine-tuned version of the `allenai/longformer-base-4096` model, designed for text classification tasks in document management, specifically for classifying Spanish-language input documents into document type categories (`tipo_documento_codigo`). Developed by Excribe.co, this model leverages the Longformer architecture to handle long texts (up to 4096 tokens) and is optimized for GPU environments, such as NVIDIA A100.

	The model was trained on a Spanish dataset (`final.parquet`) containing 8,850 samples across 109 document type classes. It addresses class imbalance using SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set, ensuring robust performance on minority classes. The fine-tuning process achieved an evaluation F1-score of 0.4855, accuracy of 0.6096, precision of 0.5212, and recall of 0.5006 on a validation set of 1,770 samples.

	### Key Features
	- Task: Multi-class text classification for document type identification.
	- Language: Spanish.
	- Input: Raw text (`texto_entrada`) from documents.
	- Output: Predicted document type code (`tipo_documento_codigo`) from 109 classes.
	- Handling Long Texts: Processes the first 4096-token chunk of input text.
	- Class Imbalance: Mitigated using SMOTE on the training set.
	- Hardware Optimization: Fine-tuned with mixed precision (fp16) and gradient accumulation for A100 GPUs.

	## Dataset

	The training dataset (`final.parquet`) consists of 8,850 Spanish text samples, each labeled with a document type code (`tipo_documento_codigo`). The dataset exhibits significant class imbalance, with class frequencies ranging from 10 to 2,363 samples per class. The dataset was split into:
	- Training set: 7,080 samples (before SMOTE, expanded to 9,903 after SMOTE).
	- Validation set: 1,770 samples (untouched by SMOTE for unbiased evaluation).

	SMOTE was applied to the training set to oversample minority classes (those with fewer than 30 samples) to a target of 40 samples per class, generating 2,823 synthetic samples. Single-instance classes were excluded from SMOTE to avoid resampling errors and were included in the training set as-is.

	## Model Training

	### Base Model
	The model is based on `allenai/longformer-base-4096`, a transformer model designed for long-document processing with a sparse attention mechanism, allowing efficient handling of sequences up to 4096 tokens.

	### Fine-Tuning
	The fine-tuning process was conducted using the Hugging Face `Trainer` API with the following configuration:
	- Epochs: 3
	- Learning Rate: 2e-5
	- Batch Size: Effective batch size of 16 (per_device_train_batch_size=2, gradient_accumulation_steps=8)
	- Optimizer: AdamW with weight decay (0.01)
	- Warmup Steps: 50
	- Mixed Precision: fp16 for GPU efficiency
	- Evaluation Strategy: Per epoch, with the best model selected based on the macro F1-score
	- SMOTE: Applied to the training set to balance classes
	- Hardware: NVIDIA A100 GPU

	The training process took approximately 159.09 minutes (9,545.32 seconds) and produced the following evaluation metrics on the validation set:
	- Eval Loss: 1.5475
	- Eval Accuracy: 0.6096
	- Eval F1 (macro): 0.4855
	- Eval Precision (macro): 0.5212
	- Eval Recall (macro): 0.5006

	Training logs and checkpoints are saved in `./results`, with TensorBoard logs in `./logs`. The final model and tokenizer are saved in `./fine_tuned_longformer`.

	## Usage

	### Installation
	To use the model, install the required dependencies:
	```bash
	pip install transformers torch pandas scikit-learn numpy
	```

	### Inference Example
	Below is a Python script to load and use the fine-tuned model for inference:

	```python
	from transformers import LongformerTokenizer, LongformerForSequenceClassification
	import torch
	import numpy as np

	# Load the model and tokenizer
	model_path = "excribe/classifier_sgd_longformer_4099"
	tokenizer = LongformerTokenizer.from_pretrained(model_path)
	model = LongformerForSequenceClassification.from_pretrained(model_path)

	# Load label encoder classes
	label_encoder_classes = np.load("label_encoder_classes.npy", allow_pickle=True)
	id2label = {i: int(label) for i, label in enumerate(label_encoder_classes)}

	# Example text
	text = "Your Spanish document text here..."

	# Tokenize input
	inputs = tokenizer(
	text,
	add_special_tokens=True,
	max_length=4096,
	padding="max_length",
	truncation=True,
	return_tensors="pt"
	)

	# Move inputs to GPU if available
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# Perform inference
	model.eval()
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	predicted_id = torch.argmax(logits, dim=1).item()

	# Map prediction to label
	predicted_label = id2label[predicted_id]
	print(f"Predicted document type code: {predicted_label}")
	```

	### Notes
	- The model processes only the first 4096 tokens of the input text. For longer documents, consider chunking strategies or alternative models.
	- Ensure the input text is in Spanish, as the model was trained exclusively on Spanish data.
	- The label encoder classes (`label_encoder_classes.npy`) must be available to map predicted IDs to document type codes.

	## Limitations
	- First Chunk Limitation: The model uses only the first 4096-token chunk, which may miss relevant information in longer documents.
	- Class Imbalance: While SMOTE improves minority class performance, some classes (e.g., single-instance classes) may still be underrepresented.
	- Macro Metrics: The reported F1-score (0.4855) is macro-averaged, meaning it treats all classes equally, which may mask performance disparities across imbalanced classes.
	- Hardware Requirements: Inference on CPU is possible but slower; a GPU is recommended for efficiency.

	## License
	This model is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) license. You are free to share and adapt the model for non-commercial purposes, provided appropriate credit is given to Excribe.co.

	## Author
	- Organization: Excribe.co
	- Contact: Reach out via Hugging Face (https://huggingface.co/excribe)

	## Citation
	If you use this model in your work, please cite:
	```
	@misc{excribe_classifier_sgd_longformer_4099,
	author = {Excribe.co},
	title = {Classifier SGD Longformer 4099: A Fine-Tuned Model for Spanish Document Type Classification},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/excribe/classifier_sgd_longformer_4099}
	}
	```

	## Acknowledgments
	- Built upon the `allenai/longformer-base-4096` model.
	- Utilizes the Hugging Face `transformers` library and `Trainer` API.
	- Thanks to the open-source community for tools like `imbalanced-learn` and `scikit-learn`.