excribe
/

ner_sgd_bertina_roberta

 ---
 license: cc-by-nc-3.0
+language:
+- es
+tags:
+- ner
+- spanish
+- transformers
+- roberta
+- text-classification
+- document-management
+metrics:
+- precision
+- recall
+- f1
+- accuracy
+model-index:
+- name: ner_sgd_bertina_roberta
+  results:
+  - task:
+      type: named-entity-recognition
+      name: Named Entity Recognition
+    dataset:
+      name: Custom SGD Dataset
+      type: custom
+      args: final.parquet
+    metrics:
+    - name: Precision
+      type: precision
+      value: 0.9031
+    - name: Recall
+      type: recall
+      value: 0.9149
+    - name: F1
+      type: f1
+      value: 0.909
+    - name: Accuracy
+      type: accuracy
+      value: 0.9869
+base_model:
+- bertin-project/bertin-roberta-base-spanish
+pipeline_tag: feature-extraction
 ---
+# NER SGD Bertina RoBERTa
+## Model Overview
+This is a fine-tuned Named Entity Recognition (NER) model for extracting specific entities from Spanish text, designed for administrative and document management contexts. It is based on the `bertin-project/bertin-roberta-base-spanish` model and trained to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` using the BIO tagging scheme (`B-TAG`, `I-TAG`, `O`).
+The model was trained on a custom dataset stored in a Parquet file (`final.parquet`), containing Spanish text and labeled entities, likely from a document management system (SGD). It leverages the Hugging Face `transformers` library for training and inference, aligning with your prior interest in automating data extraction for document management systems.
+### Model Details
+- **Base Model**: `bertin-project/bertin-roberta-base-spanish`
+- **Task**: Named Entity Recognition (NER)
+- **Language**: Spanish
+- **Labels**:
+  - `O`: Outside of an entity
+  - `B-DIRECCION`, `I-DIRECCION`: Address
+  - `B-TELEFONO`, `I-TELEFONO`: Phone number
+  - `B-MAIL`, `I-MAIL`: Email
+  - `B-NOMBRE`, `I-NOMBRE`: Name
+  - `B-DOCUMENTO`, `I-DOCUMENTO`: Document ID
+  - `B-REFERENCIA`, `I-REFERENCIA`: Reference
+  - `B-DEPARTAMENTO`, `I-DEPARTAMENTO`: Department
+  - `B-MUNICIPIO`, `I-MUNICIPIO`: Municipality
+- **Training Framework**: Hugging Face `transformers`, `datasets`, `evaluate`, `seqeval`
+- **Training Hardware**: GPU (if available) or CPU
+## Intended Use
+This model is designed for extracting structured information from unstructured Spanish text in administrative or document management contexts, such as extracting contact details or references from official correspondence. It is intended for **non-commercial use only**, as per the CC-BY-NC-3.0 license, aligning with your focus on academic or institutional applications.
+### Example Usage
+Below is an example of how to use the model with the Hugging Face `pipeline` for NER:
+```python
+from transformers import pipeline
+# Load the model and tokenizer
+ner_pipeline = pipeline("ner", model="excribe/ner_sgd_bertina_roberta", aggregation_strategy="simple")
+# Example text
+text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected] tradedoubler:59376,59376"
+# Run inference
+entities = ner_pipeline(text)
+# Print results
+for entity in entities:
+    print(f"Entity: {entity['word']} | Type: {entity['entity_group']} | Score: {entity['score']:.4f}")
+```
+**Example Output**:
+```
+Entity: Juan Pérez | Type: NOMBRE | Score: 0.9876
+Entity: Calle Falsa 123 | Type: DIRECCION | Score: 0.9789
+Entity: Bogotá | Type: MUNICIPIO | Score: 0.9654
+Entity: 555-9876 | Type: TELEFONO | Score: 0.9921
+Entity: [email protected] | Type: MAIL | Score: 0.9890
+```
+## Training Data
+The model was trained on a custom dataset (`final.parquet`) containing **27,807 rows** and **32 columns**, likely sourced from a document management system (SGD). The dataset includes Spanish texts in the `texto_entrada` column and labeled entities in the following columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio`. Other columns, such as `radicado`, `fecha_radicacion`, and `sgd_tpr_descrip`, suggest the data is related to administrative or official documents, aligning with your interest in enhancing document management systems.
+### Dataset Details
+- **Number of Rows**: 27,807
+- **Relevant Columns for NER**:
+  - `texto_entrada`: Source text
+  - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
+- **Missing Values**:
+  - `direccion`: 82 missing
+  - `telefono`: 10,073 missing
+  - `mail`: 1,086 missing
+  - `nombre`: 0 missing
+  - `documento`: 6,407 missing
+  - `referencia`: 200 missing
+  - `departamento`: 0 missing
+  - `municipio`: 0 missing
+- **Preprocessing**:
+  - A custom function (`convert_row_to_bio_optimized`) converted entity columns into BIO tags, handling overlaps by prioritizing earlier entities.
+  - The dataset was tokenized using the base model's tokenizer, with labels aligned to sub-tokens.
+  - Split: Training (~80%), Validation (~10%), Test (~10%).
+## Training Procedure
+The model was fine-tuned using the Hugging Face `Trainer` API with the following hyperparameters, reflecting your interest in fine-tuning for NER tasks:
+- **Epochs**: 3
+- **Learning Rate**: 2e-5
+- **Batch Size**: 8 (per device)
+- **Weight Decay**: 0.01
+- **Evaluation Strategy**: Per epoch
+- **Optimizer**: AdamW (default in `transformers`)
+- **Mixed Precision**: Enabled if GPU is available
+- **Metrics**: Precision, Recall, F1, Accuracy (via `seqeval`)
+The training process included:
+1. Loading the dataset from Parquet and converting it to a Hugging Face `Dataset`.
+2. Generating BIO tags for each text.
+3.-pep8
+3. Tokenizing and aligning labels with the model's tokenizer.
+4. Fine-tuning the model with the `Trainer` API.
+5. Evaluating on the validation set and saving the best model based on F1 score.
+6. Final evaluation on the test set.
+### Evaluation Metrics
+The model was evaluated on the test set after 3 epochs, achieving the following metrics:
+- **Precision**: 0.9031
+- **Recall**: 0.9149
+- **F1-Score**: 0.9090
+- **Accuracy**: 0.9869
+- **Loss**: 0.0465
+- **Runtime**: 12.22 seconds
+- **Samples per Second**: 227.546
+- **Steps per Second**: 28.474
+These metrics were computed using the `seqeval` library with the IOB2 scheme in strict mode, ensuring accurate entity boundary and type matching.
+## How to Use
+To use the model, install the required dependencies and load it with the Hugging Face `transformers` library:
+```bash
+pip install transformers torch
+```
+Then, use the `pipeline` as shown in the example above, or load the model manually:
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_bertina_roberta")
+tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_bertina_roberta")
+# Tokenize input text
+inputs = tokenizer("Calle Falsa 123, Bogotá", return_tensors="pt")
+# Get predictions
+outputs = model(**inputs)
+logits = outputs.logits
+predictions = logits.argmax(dim=-1)
+```
+## Limitations
+- The model is trained on a dataset from an administrative document management context and may not generalize well to other domains (e.g., social media or informal texts).
+- Overlapping entities are resolved by prioritizing earlier matches, which may miss some valid entities.
+- Missing values in entity columns (e.g., 10,073 missing `telefono` values) may reduce performance for certain entity types.
+- The model is optimized for Spanish and may not perform well on other languages.
+- Due to the CC-BY-NC-3.0 license, the model cannot be used for commercial purposes.
+## Ethical Considerations
+- **Bias**: The model may reflect biases in the training data, such as underrepresentation of certain entity types (e.g., `telefono` has many missing values) or overrepresentation of formal administrative language.
+- **Privacy**: The model extracts sensitive entities like names, addresses, and phone numbers. Ensure input texts do not contain personal data unless authorized, especially given your focus on document management systems handling potentially sensitive data.
+- **Non-Commercial Use**: The model is licensed for non-commercial use only, as per CC-BY-NC-3.0, aligning with your likely academic or institutional goals.
+## License
+This model is licensed under the [Creative Commons Attribution-NonCommercial 3.0 Unported License (CC-BY-NC-3.0)](https://creativecommons.org/licenses/by-nc/3.0/). You are free to share and adapt the model for non-commercial purposes, provided you give appropriate credit to the author.
+## Contact
+For issues or questions, please contact the model author via the Hugging Face repository or open an issue.
+## Acknowledgments
+This model was trained using the Hugging Face ecosystem (`transformers`, `datasets`, `evaluate`, `seqeval`). Thanks to the `bertin-project` team for providing the base model `bertin-roberta-base-spanish`.