File size: 8,230 Bytes
df168d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
---
license: cc-by-nc-3.0
base_model: PlanTL-GOB-ES/roberta-base-bne
tags:
- ner
- named-entity-recognition
- spanish
- roberta
- huggingface
- excribe
datasets:
- excribe/ner_sgd_dataset
metrics:
- precision
- recall
- f1
- accuracy
language:
- es
pipeline_tag: token-classification
---
# Model Card for excribe/ner_sgd_roberta
## Model Details
### Model Description
This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).
- **Developed by:** Exscribe
- **Model type:** Token Classification (NER)
- **Language(s):** Spanish (es)
- **License:** CC-BY-NC-3.0
- **Base Model:** PlanTL-GOB-ES/roberta-base-bne
- **Finetuned Model Repository:** excribe/ner_sgd_roberta
### Model Architecture
The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.
- **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
- **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)
## Training Details
### Training Data
The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:
- **Number of Rows:** 27,807
- **Number of Columns:** 32
- **Key Columns Used for NER:**
- `texto_entrada` (input text)
- Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
- **Null Values per Entity Column:**
- `direccion`: 82
- `telefono`: 10,073
- `mail`: 1,086
- `nombre`: 0
- `documento`: 6,407
- `referencia`: 200
- `departamento`: 0
- `municipio`: 0
- **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.
The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.
### Training Procedure
The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:
- **Training Arguments:**
- Epochs: 3
- Learning Rate: 2e-5
- Batch Size: 8 (per device)
- Weight Decay: 0.01
- Evaluation Strategy: Per epoch
- Save Strategy: Per epoch
- Load Best Model at End: True (based on F1 score)
- Optimizer: AdamW
- Precision: Mixed precision (FP16) on GPU
- Seed: 42
- **Hardware:** GPU (CUDA-enabled, if available) or CPU
- **Libraries Used:**
- `transformers`
- `datasets`
- `evaluate`
- `seqeval`
- `pandas`
- `pyarrow`
- `torch`
The training process included:
1. Loading and preprocessing the Parquet dataset.
2. Converting text and entity annotations to BIO format.
3. Tokenizing and aligning labels with sub-tokens.
4. Fine-tuning the model with a custom classification head.
5. Evaluating on the validation set after each epoch.
6. Saving the best model based on the F1 score.
### Training Metrics
The model was evaluated on the test set after training, achieving the following metrics:
- **Precision:** 0.8948
- **Recall:** 0.9052
- **F1-Score:** 0.9000
- **Accuracy:** 0.9857
- **Evaluation Loss:** 0.0455
- **Runtime:** 12.16 seconds
- **Samples per Second:** 228.612
- **Steps per Second:** 28.607
## Evaluation
### Evaluation Metrics
The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:
- **Precision:** Proportion of correctly predicted entity tokens.
- **Recall:** Proportion of true entity tokens correctly identified.
- **F1-Score:** Harmonic mean of precision and recall.
- **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens).
**Test Set Performance:**
- Precision: 0.8948
- Recall: 0.9052
- F1-Score: 0.9000
- Accuracy: 0.9857
### Example Inference
Below are example outputs from the model using the `pipeline` for NER:
**Input Text 1:**\
"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."
**Output:**
- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)
**Input Text 2:**\
"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"
**Output:**
- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)
## Usage
### Using the Model with Hugging Face Transformers
To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:
```python
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Load the model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")
# Create NER pipeline
ner_pipeline = pipeline(
"ner",
model=model,
tokenizer=tokenizer,
aggregation_strategy="simple",
device=0 if torch.cuda.is_available() else -1
)
# Example text
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."
# Perform inference
entities = ner_pipeline(text)
for entity in entities:
print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
```
### Installation Requirements
To run the model, install the required libraries:
```bash
pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
```
### Hardware Requirements
- **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
- **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.
## Limitations
- **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
- **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
- **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
- **Language:** The model is optimized for Spanish and may not perform well on other languages.
## Citation
If you use this model, please cite:
```bibtex
@misc{excribe_ner_sgd_roberta,
author = {Exscribe},
title = {NER Model for Spanish Administrative Texts},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
}
```
## Contact
For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue. |