File size: 7,311 Bytes
c1f9cf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcf2943
c1f9cf0
 
 
fcf2943
c1f9cf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcf2943
c1f9cf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fcf2943
c1f9cf0
 
fcf2943
 
c1f9cf0
 
 
 
fcf2943
 
c1f9cf0
 
 
fcf2943
c1f9cf0
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: cc-by-nc-3.0
language:
- es
base_model:
- allenai/longformer-base-4096
pipeline_tag: text-classification
library_name: transformers
tags:
- sgd,
- documental
- gestion
- documental type
- text-classification
- longformer
- spanish
- document-management
- smote
- multi-class
- fine-tuned
- transformers
- gpu
- a100
- cc-by-nc-3.0
---
# Excribe Classifier SGD Longformer 4096

## Model Overview

**Excribe/Classifier_SGD_Longformer_4099** is a fine-tuned version of the `allenai/longformer-base-4096` model, designed for text classification tasks in document management, specifically for classifying Spanish-language input documents into document type categories (`tipo_documento_codigo`). Developed by **Excribe.co**, this model leverages the Longformer architecture to handle long texts (up to 4096 tokens) and is optimized for GPU environments, such as NVIDIA A100.

The model was trained on a Spanish dataset (`final.parquet`) containing 8,850 samples across 109 document type classes. It addresses class imbalance using SMOTE (Synthetic Minority Over-sampling Technique) applied to the training set, ensuring robust performance on minority classes. The fine-tuning process achieved an evaluation F1-score of **0.4855**, accuracy of **0.6096**, precision of **0.5212**, and recall of **0.5006** on a validation set of 1,770 samples.

### Key Features
- **Task**: Multi-class text classification for document type identification.
- **Language**: Spanish.
- **Input**: Raw text (`texto_entrada`) from documents.
- **Output**: Predicted document type code (`tipo_documento_codigo`) from 109 classes.
- **Handling Long Texts**: Processes the first 4096-token chunk of input text.
- **Class Imbalance**: Mitigated using SMOTE on the training set.
- **Hardware Optimization**: Fine-tuned with mixed precision (fp16) and gradient accumulation for A100 GPUs.

## Dataset

The training dataset (`final.parquet`) consists of 8,850 Spanish text samples, each labeled with a document type code (`tipo_documento_codigo`). The dataset exhibits significant class imbalance, with class frequencies ranging from 10 to 2,363 samples per class. The dataset was split into:
- **Training set**: 7,080 samples (before SMOTE, expanded to 9,903 after SMOTE).
- **Validation set**: 1,770 samples (untouched by SMOTE for unbiased evaluation).

SMOTE was applied to the training set to oversample minority classes (those with fewer than 30 samples) to a target of 40 samples per class, generating 2,823 synthetic samples. Single-instance classes were excluded from SMOTE to avoid resampling errors and were included in the training set as-is.

## Model Training

### Base Model
The model is based on `allenai/longformer-base-4096`, a transformer model designed for long-document processing with a sparse attention mechanism, allowing efficient handling of sequences up to 4096 tokens.

### Fine-Tuning
The fine-tuning process was conducted using the Hugging Face `Trainer` API with the following configuration:
- **Epochs**: 3
- **Learning Rate**: 2e-5
- **Batch Size**: Effective batch size of 16 (per_device_train_batch_size=2, gradient_accumulation_steps=8)
- **Optimizer**: AdamW with weight decay (0.01)
- **Warmup Steps**: 50
- **Mixed Precision**: fp16 for GPU efficiency
- **Evaluation Strategy**: Per epoch, with the best model selected based on the macro F1-score
- **SMOTE**: Applied to the training set to balance classes
- **Hardware**: NVIDIA A100 GPU

The training process took approximately 159.09 minutes (9,545.32 seconds) and produced the following evaluation metrics on the validation set:
- **Eval Loss**: 1.5475
- **Eval Accuracy**: 0.6096
- **Eval F1 (macro)**: 0.4855
- **Eval Precision (macro)**: 0.5212
- **Eval Recall (macro)**: 0.5006

Training logs and checkpoints are saved in `./results`, with TensorBoard logs in `./logs`. The final model and tokenizer are saved in `./fine_tuned_longformer`.

## Usage

### Installation
To use the model, install the required dependencies:
```bash
pip install transformers torch pandas scikit-learn numpy
```

### Inference Example
Below is a Python script to load and use the fine-tuned model for inference:

```python
from transformers import LongformerTokenizer, LongformerForSequenceClassification
import torch
import numpy as np

# Load the model and tokenizer
model_path = "excribe/classifier_sgd_longformer_4099"
tokenizer = LongformerTokenizer.from_pretrained(model_path)
model = LongformerForSequenceClassification.from_pretrained(model_path)

# Load label encoder classes
label_encoder_classes = np.load("label_encoder_classes.npy", allow_pickle=True)
id2label = {i: int(label) for i, label in enumerate(label_encoder_classes)}

# Example text
text = "Your Spanish document text here..."

# Tokenize input
inputs = tokenizer(
    text,
    add_special_tokens=True,
    max_length=4096,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

# Move inputs to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Perform inference
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_id = torch.argmax(logits, dim=1).item()

# Map prediction to label
predicted_label = id2label[predicted_id]
print(f"Predicted document type code: {predicted_label}")
```

### Notes
- The model processes only the first 4096 tokens of the input text. For longer documents, consider chunking strategies or alternative models.
- Ensure the input text is in Spanish, as the model was trained exclusively on Spanish data.
- The label encoder classes (`label_encoder_classes.npy`) must be available to map predicted IDs to document type codes.

## Limitations
- **First Chunk Limitation**: The model uses only the first 4096-token chunk, which may miss relevant information in longer documents.
- **Class Imbalance**: While SMOTE improves minority class performance, some classes (e.g., single-instance classes) may still be underrepresented.
- **Macro Metrics**: The reported F1-score (0.4855) is macro-averaged, meaning it treats all classes equally, which may mask performance disparities across imbalanced classes.
- **Hardware Requirements**: Inference on CPU is possible but slower; a GPU is recommended for efficiency.

## License
This model is licensed under the **Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0)** license. You are free to share and adapt the model for non-commercial purposes, provided appropriate credit is given to Excribe.co.

## Author
- **Organization**: Excribe.co
- **Contact**: Reach out via Hugging Face (https://huggingface.co/excribe)

## Citation
If you use this model in your work, please cite:
```
@misc{excribe_classifier_sgd_longformer_4099,
  author = {Excribe.co},
  title = {Classifier SGD Longformer 4099: A Fine-Tuned Model for Spanish Document Type Classification},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/excribe/classifier_sgd_longformer_4099}
}
```

## Acknowledgments
- Built upon the `allenai/longformer-base-4096` model.
- Utilizes the Hugging Face `transformers` library and `Trainer` API.
- Thanks to the open-source community for tools like `imbalanced-learn` and `scikit-learn`.