Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,231 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
|
3 |
+
license: cc-by-nc-3.0
|
4 |
+
base_model: PlanTL-GOB-ES/roberta-base-bne
|
5 |
+
tags:
|
6 |
+
- ner
|
7 |
+
- named-entity-recognition
|
8 |
+
- spanish
|
9 |
+
- roberta
|
10 |
+
- huggingface
|
11 |
+
- excribe
|
12 |
+
datasets:
|
13 |
+
- excribe/ner_sgd_dataset
|
14 |
+
metrics:
|
15 |
+
- precision
|
16 |
+
- recall
|
17 |
+
- f1
|
18 |
+
- accuracy
|
19 |
+
language:
|
20 |
+
- es
|
21 |
+
pipeline_tag: token-classification
|
22 |
+
|
23 |
+
---
|
24 |
+
|
25 |
+
# Model Card for excribe/ner_sgd_roberta
|
26 |
+
|
27 |
+
## Model Details
|
28 |
+
|
29 |
+
### Model Description
|
30 |
+
|
31 |
+
This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).
|
32 |
+
|
33 |
+
- **Developed by:** Exscribe
|
34 |
+
- **Model type:** Token Classification (NER)
|
35 |
+
- **Language(s):** Spanish (es)
|
36 |
+
- **License:** CC-BY-NC-3.0
|
37 |
+
- **Base Model:** PlanTL-GOB-ES/roberta-base-bne
|
38 |
+
- **Finetuned Model Repository:** excribe/ner_sgd_roberta
|
39 |
+
|
40 |
+
### Model Architecture
|
41 |
+
|
42 |
+
The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.
|
43 |
+
|
44 |
+
- **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
|
45 |
+
- **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)
|
46 |
+
|
47 |
+
## Training Details
|
48 |
+
|
49 |
+
### Training Data
|
50 |
+
|
51 |
+
The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:
|
52 |
+
|
53 |
+
- **Number of Rows:** 27,807
|
54 |
+
- **Number of Columns:** 32
|
55 |
+
- **Key Columns Used for NER:**
|
56 |
+
- `texto_entrada` (input text)
|
57 |
+
- Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
|
58 |
+
- **Null Values per Entity Column:**
|
59 |
+
- `direccion`: 82
|
60 |
+
- `telefono`: 10,073
|
61 |
+
- `mail`: 1,086
|
62 |
+
- `nombre`: 0
|
63 |
+
- `documento`: 6,407
|
64 |
+
- `referencia`: 200
|
65 |
+
- `departamento`: 0
|
66 |
+
- `municipio`: 0
|
67 |
+
- **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.
|
68 |
+
|
69 |
+
The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.
|
70 |
+
|
71 |
+
### Training Procedure
|
72 |
+
|
73 |
+
The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:
|
74 |
+
|
75 |
+
- **Training Arguments:**
|
76 |
+
- Epochs: 3
|
77 |
+
- Learning Rate: 2e-5
|
78 |
+
- Batch Size: 8 (per device)
|
79 |
+
- Weight Decay: 0.01
|
80 |
+
- Evaluation Strategy: Per epoch
|
81 |
+
- Save Strategy: Per epoch
|
82 |
+
- Load Best Model at End: True (based on F1 score)
|
83 |
+
- Optimizer: AdamW
|
84 |
+
- Precision: Mixed precision (FP16) on GPU
|
85 |
+
- Seed: 42
|
86 |
+
- **Hardware:** GPU (CUDA-enabled, if available) or CPU
|
87 |
+
- **Libraries Used:**
|
88 |
+
- `transformers`
|
89 |
+
- `datasets`
|
90 |
+
- `evaluate`
|
91 |
+
- `seqeval`
|
92 |
+
- `pandas`
|
93 |
+
- `pyarrow`
|
94 |
+
- `torch`
|
95 |
+
|
96 |
+
The training process included:
|
97 |
+
|
98 |
+
1. Loading and preprocessing the Parquet dataset.
|
99 |
+
2. Converting text and entity annotations to BIO format.
|
100 |
+
3. Tokenizing and aligning labels with sub-tokens.
|
101 |
+
4. Fine-tuning the model with a custom classification head.
|
102 |
+
5. Evaluating on the validation set after each epoch.
|
103 |
+
6. Saving the best model based on the F1 score.
|
104 |
+
|
105 |
+
### Training Metrics
|
106 |
+
|
107 |
+
The model was evaluated on the test set after training, achieving the following metrics:
|
108 |
+
|
109 |
+
- **Precision:** 0.8948
|
110 |
+
- **Recall:** 0.9052
|
111 |
+
- **F1-Score:** 0.9000
|
112 |
+
- **Accuracy:** 0.9857
|
113 |
+
- **Evaluation Loss:** 0.0455
|
114 |
+
- **Runtime:** 12.16 seconds
|
115 |
+
- **Samples per Second:** 228.612
|
116 |
+
- **Steps per Second:** 28.607
|
117 |
+
|
118 |
+
## Evaluation
|
119 |
+
|
120 |
+
### Evaluation Metrics
|
121 |
+
|
122 |
+
The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:
|
123 |
+
|
124 |
+
- **Precision:** Proportion of correctly predicted entity tokens.
|
125 |
+
- **Recall:** Proportion of true entity tokens correctly identified.
|
126 |
+
- **F1-Score:** Harmonic mean of precision and recall.
|
127 |
+
- **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens).
|
128 |
+
|
129 |
+
**Test Set Performance:**
|
130 |
+
|
131 |
+
- Precision: 0.8948
|
132 |
+
- Recall: 0.9052
|
133 |
+
- F1-Score: 0.9000
|
134 |
+
- Accuracy: 0.9857
|
135 |
+
|
136 |
+
### Example Inference
|
137 |
+
|
138 |
+
Below are example outputs from the model using the `pipeline` for NER:
|
139 |
+
|
140 |
+
**Input Text 1:**\
|
141 |
+
"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."
|
142 |
+
|
143 |
+
**Output:**
|
144 |
+
|
145 |
+
- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
|
146 |
+
- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
|
147 |
+
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
|
148 |
+
- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
|
149 |
+
- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
|
150 |
+
- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)
|
151 |
+
|
152 |
+
**Input Text 2:**\
|
153 |
+
"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"
|
154 |
+
|
155 |
+
**Output:**
|
156 |
+
|
157 |
+
- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
|
158 |
+
- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
|
159 |
+
- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
|
160 |
+
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
|
161 |
+
- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)
|
162 |
+
|
163 |
+
## Usage
|
164 |
+
|
165 |
+
### Using the Model with Hugging Face Transformers
|
166 |
+
|
167 |
+
To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:
|
168 |
+
|
169 |
+
```python
|
170 |
+
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
171 |
+
|
172 |
+
# Load the model and tokenizer
|
173 |
+
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
|
174 |
+
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")
|
175 |
+
|
176 |
+
# Create NER pipeline
|
177 |
+
ner_pipeline = pipeline(
|
178 |
+
"ner",
|
179 |
+
model=model,
|
180 |
+
tokenizer=tokenizer,
|
181 |
+
aggregation_strategy="simple",
|
182 |
+
device=0 if torch.cuda.is_available() else -1
|
183 |
+
)
|
184 |
+
|
185 |
+
# Example text
|
186 |
+
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."
|
187 |
+
|
188 |
+
# Perform inference
|
189 |
+
entities = ner_pipeline(text)
|
190 |
+
for entity in entities:
|
191 |
+
print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
|
192 |
+
```
|
193 |
+
|
194 |
+
### Installation Requirements
|
195 |
+
|
196 |
+
To run the model, install the required libraries:
|
197 |
+
|
198 |
+
```bash
|
199 |
+
pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
|
200 |
+
```
|
201 |
+
|
202 |
+
### Hardware Requirements
|
203 |
+
|
204 |
+
- **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
|
205 |
+
- **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.
|
206 |
+
|
207 |
+
## Limitations
|
208 |
+
|
209 |
+
- **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
|
210 |
+
- **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
|
211 |
+
- **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
|
212 |
+
- **Language:** The model is optimized for Spanish and may not perform well on other languages.
|
213 |
+
|
214 |
+
## Citation
|
215 |
+
|
216 |
+
If you use this model, please cite:
|
217 |
+
|
218 |
+
```bibtex
|
219 |
+
@misc{excribe_ner_sgd_roberta,
|
220 |
+
author = {Exscribe},
|
221 |
+
title = {NER Model for Spanish Administrative Texts},
|
222 |
+
year = {2025},
|
223 |
+
publisher = {Hugging Face},
|
224 |
+
journal = {Hugging Face Model Hub},
|
225 |
+
howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
|
226 |
+
}
|
227 |
+
```
|
228 |
+
|
229 |
+
## Contact
|
230 |
+
|
231 |
+
For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.
|