excribe
/

ner_sgd_roberta

+---
+license: cc-by-nc-3.0
+base_model: PlanTL-GOB-ES/roberta-base-bne
+tags:
+- ner
+- named-entity-recognition
+- spanish
+- roberta
+- huggingface
+- excribe
+datasets:
+- excribe/ner_sgd_dataset
+metrics:
+- precision
+- recall
+- f1
+- accuracy
+language:
+- es
+pipeline_tag: token-classification
+---
+# Model Card for excribe/ner_sgd_roberta
+## Model Details
+### Model Description
+This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).
+- **Developed by:** Exscribe
+- **Model type:** Token Classification (NER)
+- **Language(s):** Spanish (es)
+- **License:** CC-BY-NC-3.0
+- **Base Model:** PlanTL-GOB-ES/roberta-base-bne
+- **Finetuned Model Repository:** excribe/ner_sgd_roberta
+### Model Architecture
+The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.
+- **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
+- **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)
+## Training Details
+### Training Data
+The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:
+- **Number of Rows:** 27,807
+- **Number of Columns:** 32
+- **Key Columns Used for NER:**
+  - `texto_entrada` (input text)
+  - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
+- **Null Values per Entity Column:**
+  - `direccion`: 82
+  - `telefono`: 10,073
+  - `mail`: 1,086
+  - `nombre`: 0
+  - `documento`: 6,407
+  - `referencia`: 200
+  - `departamento`: 0
+  - `municipio`: 0
+- **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.
+The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.
+### Training Procedure
+The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:
+- **Training Arguments:**
+  - Epochs: 3
+  - Learning Rate: 2e-5
+  - Batch Size: 8 (per device)
+  - Weight Decay: 0.01
+  - Evaluation Strategy: Per epoch
+  - Save Strategy: Per epoch
+  - Load Best Model at End: True (based on F1 score)
+  - Optimizer: AdamW
+  - Precision: Mixed precision (FP16) on GPU
+  - Seed: 42
+- **Hardware:** GPU (CUDA-enabled, if available) or CPU
+- **Libraries Used:**
+  - `transformers`
+  - `datasets`
+  - `evaluate`
+  - `seqeval`
+  - `pandas`
+  - `pyarrow`
+  - `torch`
+The training process included:
+1. Loading and preprocessing the Parquet dataset.
+2. Converting text and entity annotations to BIO format.
+3. Tokenizing and aligning labels with sub-tokens.
+4. Fine-tuning the model with a custom classification head.
+5. Evaluating on the validation set after each epoch.
+6. Saving the best model based on the F1 score.
+### Training Metrics
+The model was evaluated on the test set after training, achieving the following metrics:
+- **Precision:** 0.8948
+- **Recall:** 0.9052
+- **F1-Score:** 0.9000
+- **Accuracy:** 0.9857
+- **Evaluation Loss:** 0.0455
+- **Runtime:** 12.16 seconds
+- **Samples per Second:** 228.612
+- **Steps per Second:** 28.607
+## Evaluation
+### Evaluation Metrics
+The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:
+- **Precision:** Proportion of correctly predicted entity tokens.
+- **Recall:** Proportion of true entity tokens correctly identified.
+- **F1-Score:** Harmonic mean of precision and recall.
+- **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens).
+**Test Set Performance:**
+- Precision: 0.8948
+- Recall: 0.9052
+- F1-Score: 0.9000
+- Accuracy: 0.9857
+### Example Inference
+Below are example outputs from the model using the `pipeline` for NER:
+**Input Text 1:**\
+"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."
+**Output:**
+- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
+- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
+- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
+- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
+- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
+- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)
+**Input Text 2:**\
+"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"
+**Output:**
+- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
+- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
+- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
+- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
+- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)
+## Usage
+### Using the Model with Hugging Face Transformers
+To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:
+```python
+from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
+# Load the model and tokenizer
+model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
+tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")
+# Create NER pipeline
+ner_pipeline = pipeline(
+    "ner",
+    model=model,
+    tokenizer=tokenizer,
+    aggregation_strategy="simple",
+    device=0 if torch.cuda.is_available() else -1
+)
+# Example text
+text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."
+# Perform inference
+entities = ner_pipeline(text)
+for entity in entities:
+    print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
+```
+### Installation Requirements
+To run the model, install the required libraries:
+```bash
+pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
+```
+### Hardware Requirements
+- **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
+- **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.
+## Limitations
+- **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
+- **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
+- **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
+- **Language:** The model is optimized for Spanish and may not perform well on other languages.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{excribe_ner_sgd_roberta,
+  author = {Exscribe},
+  title = {NER Model for Spanish Administrative Texts},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
+}
+```
+## Contact
+For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.