|
--- |
|
|
|
license: cc-by-nc-3.0 |
|
base_model: PlanTL-GOB-ES/roberta-base-bne |
|
tags: |
|
- ner |
|
- named-entity-recognition |
|
- spanish |
|
- roberta |
|
- huggingface |
|
- excribe |
|
datasets: |
|
- excribe/ner_sgd_dataset |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
- accuracy |
|
language: |
|
- es |
|
pipeline_tag: token-classification |
|
|
|
--- |
|
|
|
# Model Card for excribe/ner_sgd_roberta |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`). |
|
|
|
- **Developed by:** Exscribe |
|
- **Model type:** Token Classification (NER) |
|
- **Language(s):** Spanish (es) |
|
- **License:** CC-BY-NC-3.0 |
|
- **Base Model:** PlanTL-GOB-ES/roberta-base-bne |
|
- **Finetuned Model Repository:** excribe/ner_sgd_roberta |
|
|
|
### Model Architecture |
|
|
|
The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels. |
|
|
|
- **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`) |
|
- **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`) |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes: |
|
|
|
- **Number of Rows:** 27,807 |
|
- **Number of Columns:** 32 |
|
- **Key Columns Used for NER:** |
|
- `texto_entrada` (input text) |
|
- Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio` |
|
- **Null Values per Entity Column:** |
|
- `direccion`: 82 |
|
- `telefono`: 10,073 |
|
- `mail`: 1,086 |
|
- `nombre`: 0 |
|
- `documento`: 6,407 |
|
- `referencia`: 200 |
|
- `departamento`: 0 |
|
- `municipio`: 0 |
|
- **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training. |
|
|
|
The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets. |
|
|
|
### Training Procedure |
|
|
|
The model was fine-tuned using the Hugging Face `transformers` library with the following configuration: |
|
|
|
- **Training Arguments:** |
|
- Epochs: 3 |
|
- Learning Rate: 2e-5 |
|
- Batch Size: 8 (per device) |
|
- Weight Decay: 0.01 |
|
- Evaluation Strategy: Per epoch |
|
- Save Strategy: Per epoch |
|
- Load Best Model at End: True (based on F1 score) |
|
- Optimizer: AdamW |
|
- Precision: Mixed precision (FP16) on GPU |
|
- Seed: 42 |
|
- **Hardware:** GPU (CUDA-enabled, if available) or CPU |
|
- **Libraries Used:** |
|
- `transformers` |
|
- `datasets` |
|
- `evaluate` |
|
- `seqeval` |
|
- `pandas` |
|
- `pyarrow` |
|
- `torch` |
|
|
|
The training process included: |
|
|
|
1. Loading and preprocessing the Parquet dataset. |
|
2. Converting text and entity annotations to BIO format. |
|
3. Tokenizing and aligning labels with sub-tokens. |
|
4. Fine-tuning the model with a custom classification head. |
|
5. Evaluating on the validation set after each epoch. |
|
6. Saving the best model based on the F1 score. |
|
|
|
### Training Metrics |
|
|
|
The model was evaluated on the test set after training, achieving the following metrics: |
|
|
|
- **Precision:** 0.8948 |
|
- **Recall:** 0.9052 |
|
- **F1-Score:** 0.9000 |
|
- **Accuracy:** 0.9857 |
|
- **Evaluation Loss:** 0.0455 |
|
- **Runtime:** 12.16 seconds |
|
- **Samples per Second:** 228.612 |
|
- **Steps per Second:** 28.607 |
|
|
|
## Evaluation |
|
|
|
### Evaluation Metrics |
|
|
|
The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes: |
|
|
|
- **Precision:** Proportion of correctly predicted entity tokens. |
|
- **Recall:** Proportion of true entity tokens correctly identified. |
|
- **F1-Score:** Harmonic mean of precision and recall. |
|
- **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens). |
|
|
|
**Test Set Performance:** |
|
|
|
- Precision: 0.8948 |
|
- Recall: 0.9052 |
|
- F1-Score: 0.9000 |
|
- Accuracy: 0.9857 |
|
|
|
### Example Inference |
|
|
|
Below are example outputs from the model using the `pipeline` for NER: |
|
|
|
**Input Text 1:**\ |
|
"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765." |
|
|
|
**Output:** |
|
|
|
- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99) |
|
- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98) |
|
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99) |
|
- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99) |
|
- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97) |
|
- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98) |
|
|
|
**Input Text 2:**\ |
|
"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567" |
|
|
|
**Output:** |
|
|
|
- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98) |
|
- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99) |
|
- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99) |
|
- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99) |
|
- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98) |
|
|
|
## Usage |
|
|
|
### Using the Model with Hugging Face Transformers |
|
|
|
To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER: |
|
|
|
```python |
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
|
|
# Load the model and tokenizer |
|
model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta") |
|
tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta") |
|
|
|
# Create NER pipeline |
|
ner_pipeline = pipeline( |
|
"ner", |
|
model=model, |
|
tokenizer=tokenizer, |
|
aggregation_strategy="simple", |
|
device=0 if torch.cuda.is_available() else -1 |
|
) |
|
|
|
# Example text |
|
text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876." |
|
|
|
# Perform inference |
|
entities = ner_pipeline(text) |
|
for entity in entities: |
|
print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})") |
|
``` |
|
|
|
### Installation Requirements |
|
|
|
To run the model, install the required libraries: |
|
|
|
```bash |
|
pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow |
|
``` |
|
|
|
### Hardware Requirements |
|
|
|
- **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing. |
|
- **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage. |
|
|
|
## Limitations |
|
|
|
- **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature). |
|
- **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases. |
|
- **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities. |
|
- **Language:** The model is optimized for Spanish and may not perform well on other languages. |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{excribe_ner_sgd_roberta, |
|
author = {Exscribe}, |
|
title = {NER Model for Spanish Administrative Texts}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
journal = {Hugging Face Model Hub}, |
|
howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}} |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue. |