excribe commited on
Commit
df168d5
·
verified ·
1 Parent(s): ab9d295

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -3
README.md CHANGED
@@ -1,3 +1,231 @@
1
- ---
2
- license: cc-by-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ license: cc-by-nc-3.0
4
+ base_model: PlanTL-GOB-ES/roberta-base-bne
5
+ tags:
6
+ - ner
7
+ - named-entity-recognition
8
+ - spanish
9
+ - roberta
10
+ - huggingface
11
+ - excribe
12
+ datasets:
13
+ - excribe/ner_sgd_dataset
14
+ metrics:
15
+ - precision
16
+ - recall
17
+ - f1
18
+ - accuracy
19
+ language:
20
+ - es
21
+ pipeline_tag: token-classification
22
+
23
+ ---
24
+
25
+ # Model Card for excribe/ner_sgd_roberta
26
+
27
+ ## Model Details
28
+
29
+ ### Model Description
30
+
31
+ This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).
32
+
33
+ - **Developed by:** Exscribe
34
+ - **Model type:** Token Classification (NER)
35
+ - **Language(s):** Spanish (es)
36
+ - **License:** CC-BY-NC-3.0
37
+ - **Base Model:** PlanTL-GOB-ES/roberta-base-bne
38
+ - **Finetuned Model Repository:** excribe/ner_sgd_roberta
39
+
40
+ ### Model Architecture
41
+
42
+ The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.
43
+
44
+ - **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
45
+ - **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)
46
+
47
+ ## Training Details
48
+
49
+ ### Training Data
50
+
51
+ The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:
52
+
53
+ - **Number of Rows:** 27,807
54
+ - **Number of Columns:** 32
55
+ - **Key Columns Used for NER:**
56
+ - `texto_entrada` (input text)
57
+ - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
58
+ - **Null Values per Entity Column:**
59
+ - `direccion`: 82
60
+ - `telefono`: 10,073
61
+ - `mail`: 1,086
62
+ - `nombre`: 0
63
+ - `documento`: 6,407
64
+ - `referencia`: 200
65
+ - `departamento`: 0
66
+ - `municipio`: 0
67
+ - **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.
68
+
69
+ The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.
70
+
71
+ ### Training Procedure
72
+
73
+ The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:
74
+
75
+ - **Training Arguments:**
76
+ - Epochs: 3
77
+ - Learning Rate: 2e-5
78
+ - Batch Size: 8 (per device)
79
+ - Weight Decay: 0.01
80
+ - Evaluation Strategy: Per epoch
81
+ - Save Strategy: Per epoch
82
+ - Load Best Model at End: True (based on F1 score)
83
+ - Optimizer: AdamW
84
+ - Precision: Mixed precision (FP16) on GPU
85
+ - Seed: 42
86
+ - **Hardware:** GPU (CUDA-enabled, if available) or CPU
87
+ - **Libraries Used:**
88
+ - `transformers`
89
+ - `datasets`
90
+ - `evaluate`
91
+ - `seqeval`
92
+ - `pandas`
93
+ - `pyarrow`
94
+ - `torch`
95
+
96
+ The training process included:
97
+
98
+ 1. Loading and preprocessing the Parquet dataset.
99
+ 2. Converting text and entity annotations to BIO format.
100
+ 3. Tokenizing and aligning labels with sub-tokens.
101
+ 4. Fine-tuning the model with a custom classification head.
102
+ 5. Evaluating on the validation set after each epoch.
103
+ 6. Saving the best model based on the F1 score.
104
+
105
+ ### Training Metrics
106
+
107
+ The model was evaluated on the test set after training, achieving the following metrics:
108
+
109
+ - **Precision:** 0.8948
110
+ - **Recall:** 0.9052
111
+ - **F1-Score:** 0.9000
112
+ - **Accuracy:** 0.9857
113
+ - **Evaluation Loss:** 0.0455
114
+ - **Runtime:** 12.16 seconds
115
+ - **Samples per Second:** 228.612
116
+ - **Steps per Second:** 28.607
117
+
118
+ ## Evaluation
119
+
120
+ ### Evaluation Metrics
121
+
122
+ The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:
123
+
124
+ - **Precision:** Proportion of correctly predicted entity tokens.
125
+ - **Recall:** Proportion of true entity tokens correctly identified.
126
+ - **F1-Score:** Harmonic mean of precision and recall.
127
+ - **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens).
128
+
129
+ **Test Set Performance:**
130
+
131
+ - Precision: 0.8948
132
+ - Recall: 0.9052
133
+ - F1-Score: 0.9000
134
+ - Accuracy: 0.9857
135
+
136
+ ### Example Inference
137
+
138
+ Below are example outputs from the model using the `pipeline` for NER:
139
+
140
+ **Input Text 1:**\
141
+ "Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."
142
+
143
+ **Output:**
144
+
145
+ - Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
146
+ - Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
147
+ - Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
148
+ - Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
149
+ - Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
150
+ - Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)
151
+
152
+ **Input Text 2:**\
153
+ "Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"
154
+
155
+ **Output:**
156
+
157
+ - Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
158
+ - Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
159
+ - Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
160
+ - Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
161
+ - Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)
162
+
163
+ ## Usage
164
+
165
+ ### Using the Model with Hugging Face Transformers
166
+
167
+ To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:
168
+
169
+ ```python
170
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
171
+
172
+ # Load the model and tokenizer
173
+ model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
174
+ tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")
175
+
176
+ # Create NER pipeline
177
+ ner_pipeline = pipeline(
178
+ "ner",
179
+ model=model,
180
+ tokenizer=tokenizer,
181
+ aggregation_strategy="simple",
182
+ device=0 if torch.cuda.is_available() else -1
183
+ )
184
+
185
+ # Example text
186
+ text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."
187
+
188
+ # Perform inference
189
+ entities = ner_pipeline(text)
190
+ for entity in entities:
191
+ print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
192
+ ```
193
+
194
+ ### Installation Requirements
195
+
196
+ To run the model, install the required libraries:
197
+
198
+ ```bash
199
+ pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
200
+ ```
201
+
202
+ ### Hardware Requirements
203
+
204
+ - **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
205
+ - **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.
206
+
207
+ ## Limitations
208
+
209
+ - **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
210
+ - **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
211
+ - **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
212
+ - **Language:** The model is optimized for Spanish and may not perform well on other languages.
213
+
214
+ ## Citation
215
+
216
+ If you use this model, please cite:
217
+
218
+ ```bibtex
219
+ @misc{excribe_ner_sgd_roberta,
220
+ author = {Exscribe},
221
+ title = {NER Model for Spanish Administrative Texts},
222
+ year = {2025},
223
+ publisher = {Hugging Face},
224
+ journal = {Hugging Face Model Hub},
225
+ howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
226
+ }
227
+ ```
228
+
229
+ ## Contact
230
+
231
+ For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.