--- license: cc-by-nc-3.0 base_model: PlanTL-GOB-ES/roberta-base-bne tags: - ner - named-entity-recognition - spanish - roberta - huggingface - excribe datasets: - excribe/ner_sgd_dataset metrics: - precision - recall - f1 - accuracy language: - es pipeline_tag: token-classification --- # Model Card for excribe/ner_sgd_roberta ## Model Details ### Model Description This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for **Named Entity Recognition (NER)** in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`). - **Developed by:** Exscribe - **Model type:** Token Classification (NER) - **Language(s):** Spanish (es) - **License:** CC-BY-NC-3.0 - **Base Model:** PlanTL-GOB-ES/roberta-base-bne - **Finetuned Model Repository:** excribe/ner_sgd_roberta ### Model Architecture The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels. - **Number of Labels:** 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`) - **Label Schema:** BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`) ## Training Details ### Training Data The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes: - **Number of Rows:** 27,807 - **Number of Columns:** 32 - **Key Columns Used for NER:** - `texto_entrada` (input text) - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio` - **Null Values per Entity Column:** - `direccion`: 82 - `telefono`: 10,073 - `mail`: 1,086 - `nombre`: 0 - `documento`: 6,407 - `referencia`: 200 - `departamento`: 0 - `municipio`: 0 - **Dataset Description:** The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training. The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets. ### Training Procedure The model was fine-tuned using the Hugging Face `transformers` library with the following configuration: - **Training Arguments:** - Epochs: 3 - Learning Rate: 2e-5 - Batch Size: 8 (per device) - Weight Decay: 0.01 - Evaluation Strategy: Per epoch - Save Strategy: Per epoch - Load Best Model at End: True (based on F1 score) - Optimizer: AdamW - Precision: Mixed precision (FP16) on GPU - Seed: 42 - **Hardware:** GPU (CUDA-enabled, if available) or CPU - **Libraries Used:** - `transformers` - `datasets` - `evaluate` - `seqeval` - `pandas` - `pyarrow` - `torch` The training process included: 1. Loading and preprocessing the Parquet dataset. 2. Converting text and entity annotations to BIO format. 3. Tokenizing and aligning labels with sub-tokens. 4. Fine-tuning the model with a custom classification head. 5. Evaluating on the validation set after each epoch. 6. Saving the best model based on the F1 score. ### Training Metrics The model was evaluated on the test set after training, achieving the following metrics: - **Precision:** 0.8948 - **Recall:** 0.9052 - **F1-Score:** 0.9000 - **Accuracy:** 0.9857 - **Evaluation Loss:** 0.0455 - **Runtime:** 12.16 seconds - **Samples per Second:** 228.612 - **Steps per Second:** 28.607 ## Evaluation ### Evaluation Metrics The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes: - **Precision:** Proportion of correctly predicted entity tokens. - **Recall:** Proportion of true entity tokens correctly identified. - **F1-Score:** Harmonic mean of precision and recall. - **Accuracy:** Proportion of correctly classified tokens (including non-entity tokens). **Test Set Performance:** - Precision: 0.8948 - Recall: 0.9052 - F1-Score: 0.9000 - Accuracy: 0.9857 ### Example Inference Below are example outputs from the model using the `pipeline` for NER: **Input Text 1:**\ "Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail juan.p@correo.co. El documento asociado es el ID-98765." **Output:** - Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99) - Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98) - Entidad: "juan.p@correo.co" → Tipo: MAIL (Confianza: \~0.99) - Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99) - Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97) - Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98) **Input Text 2:**\ "Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: info@empresa.com. Tel: 3001234567" **Output:** - Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98) - Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99) - Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99) - Entidad: "info@empresa.com" → Tipo: MAIL (Confianza: \~0.99) - Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98) ## Usage ### Using the Model with Hugging Face Transformers To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER: ```python from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer # Load the model and tokenizer model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta") tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta") # Create NER pipeline ner_pipeline = pipeline( "ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=0 if torch.cuda.is_available() else -1 ) # Example text text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876." # Perform inference entities = ner_pipeline(text) for entity in entities: print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})") ``` ### Installation Requirements To run the model, install the required libraries: ```bash pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow ``` ### Hardware Requirements - **Inference:** Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing. - **Training:** GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage. ## Limitations - **Dataset Bias:** The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature). - **Entity Overlap:** The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases. - **Null Values:** High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities. - **Language:** The model is optimized for Spanish and may not perform well on other languages. ## Citation If you use this model, please cite: ```bibtex @misc{excribe_ner_sgd_roberta, author = {Exscribe}, title = {NER Model for Spanish Administrative Texts}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}} } ``` ## Contact For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.