File size: 8,230 Bytes
df168d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---

license: cc-by-nc-3.0
base_model: PlanTL-GOB-ES/roberta-base-bne
tags:
- ner
- named-entity-recognition
- spanish
- roberta
- huggingface
- excribe 
datasets:
- excribe/ner_sgd_dataset 
metrics:
- precision
- recall
- f1
- accuracy 
language:
- es 
pipeline_tag: token-classification

---

# Model Card for excribe/ner_sgd_roberta

## Model Details

### Model Description

This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).

- **Developed by:** Exscribe
- **Model type:** Token Classification (NER)
- **Language(s):** Spanish (es)
- **License:** CC-BY-NC-3.0
- **Base Model:** PlanTL-GOB-ES/roberta-base-bne
- **Finetuned Model Repository:** excribe/ner_sgd_roberta

### Model Architecture

The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.

- **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
- **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)

## Training Details

### Training Data

The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:

- **Number of Rows:** 27,807
- **Number of Columns:** 32
- **Key Columns Used for NER:**
  - `texto_entrada` (input text)
  - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
- **Null Values per Entity Column:**
  - `direccion`: 82
  - `telefono`: 10,073
  - `mail`: 1,086
  - `nombre`: 0
  - `documento`: 6,407
  - `referencia`: 200
  - `departamento`: 0
  - `municipio`: 0
- **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.

The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.

### Training Procedure

The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:

- **Training Arguments:**
  - Epochs: 3
  - Learning Rate: 2e-5
  - Batch Size: 8 (per device)
  - Weight Decay: 0.01
  - Evaluation Strategy: Per epoch
  - Save Strategy: Per epoch
  - Load Best Model at End: True (based on F1 score)
  - Optimizer: AdamW
  - Precision: Mixed precision (FP16) on GPU
  - Seed: 42
- **Hardware:** GPU (CUDA-enabled, if available) or CPU
- **Libraries Used:**
  - `transformers`
  - `datasets`
  - `evaluate`
  - `seqeval`
  - `pandas`
  - `pyarrow`
  - `torch`

The training process included:

1. Loading and preprocessing the Parquet dataset.
2. Converting text and entity annotations to BIO format.
3. Tokenizing and aligning labels with sub-tokens.
4. Fine-tuning the model with a custom classification head.
5. Evaluating on the validation set after each epoch.
6. Saving the best model based on the F1 score.

### Training Metrics

The model was evaluated on the test set after training, achieving the following metrics:

- **Precision:** 0.8948
- **Recall:** 0.9052
- **F1-Score:** 0.9000
- **Accuracy:** 0.9857
- **Evaluation Loss:** 0.0455
- **Runtime:** 12.16 seconds
- **Samples per Second:** 228.612
- **Steps per Second:** 28.607

## Evaluation

### Evaluation Metrics

The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:

- **Precision:** Proportion of correctly predicted entity tokens.
- **Recall:** Proportion of true entity tokens correctly identified.
- **F1-Score:** Harmonic mean of precision and recall.
- **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens).

**Test Set Performance:**

- Precision: 0.8948
- Recall: 0.9052
- F1-Score: 0.9000
- Accuracy: 0.9857

### Example Inference

Below are example outputs from the model using the `pipeline` for NER:

**Input Text 1:**\
"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."

**Output:**

- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)

**Input Text 2:**\
"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"

**Output:**

- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)

## Usage

### Using the Model with Hugging Face Transformers

To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:

```python
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")

# Create NER pipeline
ner_pipeline = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple",
    device=0 if torch.cuda.is_available() else -1
)

# Example text
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."

# Perform inference
entities = ner_pipeline(text)
for entity in entities:
    print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
```

### Installation Requirements

To run the model, install the required libraries:

```bash
pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
```

### Hardware Requirements

- **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
- **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.

## Limitations

- **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
- **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
- **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
- **Language:** The model is optimized for Spanish and may not perform well on other languages.

## Citation

If you use this model, please cite:

```bibtex
@misc{excribe_ner_sgd_roberta,
  author = {Exscribe},
  title = {NER Model for Spanish Administrative Texts},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
}
```

## Contact

For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.