ner_sgd_roberta / README.md

Update README.md

df168d5 verified 2 months ago

8.23 kB

	---

	license: cc-by-nc-3.0
	base_model: PlanTL-GOB-ES/roberta-base-bne
	tags:
	- ner
	- named-entity-recognition
	- spanish
	- roberta
	- huggingface
	- excribe
	datasets:
	- excribe/ner_sgd_dataset
	metrics:
	- precision
	- recall
	- f1
	- accuracy
	language:
	- es
	pipeline_tag: token-classification

	---

	# Model Card for excribe/ner_sgd_roberta

	## Model Details

	### Model Description

	This model is a fine-tuned version of PlanTL-GOB-ES/roberta-base-bne for Named Entity Recognition (NER) in Spanish. It is designed to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` in texts related to administrative or governmental correspondence. The model uses the BIO (Beginning, Inside, Outside) tagging scheme and was trained on a custom dataset derived from a Parquet file (`final.parquet`).

	- Developed by: Exscribe
	- Model type: Token Classification (NER)
	- Language(s): Spanish (es)
	- License: CC-BY-NC-3.0
	- Base Model: PlanTL-GOB-ES/roberta-base-bne
	- Finetuned Model Repository: excribe/ner_sgd_roberta

	### Model Architecture

	The model is based on the RoBERTa architecture, specifically the `PlanTL-GOB-ES/roberta-base-bne` checkpoint, which is pre-trained on a large corpus of Spanish texts. It has been fine-tuned for token classification with a custom classification head tailored to the defined entity labels.

	- Number of Labels: 17 (including `O` and BIO tags for 8 entity types: `DIRECCION`, `TELEFONO`, `MAIL`, `NOMBRE`, `DOCUMENTO`, `REFERENCIA`, `DEPARTAMENTO`, `MUNICIPIO`)
	- Label Schema: BIO (e.g., `B-DIRECCION`, `I-DIRECCION`, `O`)

	## Training Details

	### Training Data

	The model was trained on a custom dataset derived from a Parquet file (`final.parquet`) containing administrative texts. The dataset includes:

	- Number of Rows: 27,807
	- Number of Columns: 32
	- Key Columns Used for NER:
	- `texto_entrada` (input text)
	- Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
	- Null Values per Entity Column:
	- `direccion`: 82
	- `telefono`: 10,073
	- `mail`: 1,086
	- `nombre`: 0
	- `documento`: 6,407
	- `referencia`: 200
	- `departamento`: 0
	- `municipio`: 0
	- Dataset Description: The dataset contains administrative correspondence data with fields like case IDs (`radicado`), dates (`fecha_radicacion`), document paths, and text inputs (`texto_entrada`). The entity columns were used to generate BIO tags for NER training.

	The dataset was preprocessed to convert raw text and entity annotations into BIO format, tokenized using the `PlanTL-GOB-ES/roberta-base-bne` tokenizer, and split into training (81%), validation (9%), and test (10%) sets.

	### Training Procedure

	The model was fine-tuned using the Hugging Face `transformers` library with the following configuration:

	- Training Arguments:
	- Epochs: 3
	- Learning Rate: 2e-5
	- Batch Size: 8 (per device)
	- Weight Decay: 0.01
	- Evaluation Strategy: Per epoch
	- Save Strategy: Per epoch
	- Load Best Model at End: True (based on F1 score)
	- Optimizer: AdamW
	- Precision: Mixed precision (FP16) on GPU
	- Seed: 42
	- Hardware: GPU (CUDA-enabled, if available) or CPU
	- Libraries Used:
	- `transformers`
	- `datasets`
	- `evaluate`
	- `seqeval`
	- `pandas`
	- `pyarrow`
	- `torch`

	The training process included:

	1. Loading and preprocessing the Parquet dataset.
	2. Converting text and entity annotations to BIO format.
	3. Tokenizing and aligning labels with sub-tokens.
	4. Fine-tuning the model with a custom classification head.
	5. Evaluating on the validation set after each epoch.
	6. Saving the best model based on the F1 score.

	### Training Metrics

	The model was evaluated on the test set after training, achieving the following metrics:

	- Precision: 0.8948
	- Recall: 0.9052
	- F1-Score: 0.9000
	- Accuracy: 0.9857
	- Evaluation Loss: 0.0455
	- Runtime: 12.16 seconds
	- Samples per Second: 228.612
	- Steps per Second: 28.607

	## Evaluation

	### Evaluation Metrics

	The model was evaluated using the `seqeval` metric in strict IOB2 mode, which computes:

	- Precision: Proportion of correctly predicted entity tokens.
	- Recall: Proportion of true entity tokens correctly identified.
	- F1-Score: Harmonic mean of precision and recall.
	- Accuracy: Proportion of correctly classified tokens (including non-entity tokens).

	Test Set Performance:

	- Precision: 0.8948
	- Recall: 0.9052
	- F1-Score: 0.9000
	- Accuracy: 0.9857

	### Example Inference

	Below are example outputs from the model using the `pipeline` for NER:

	Input Text 1:\
	"Se informa que el asunto principal es la Factura #REF123. Contactar a Juan Pérez en la dirección Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected]. El documento asociado es el ID-98765."

	Output:

	- Entidad: "Calle Falsa 123, Bogotá" → Tipo: DIRECCION (Confianza: \~0.99)
	- Entidad: "555-9876" → Tipo: TELEFONO (Confianza: \~0.98)
	- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
	- Entidad: "Juan Pérez" → Tipo: NOMBRE (Confianza: \~0.99)
	- Entidad: "ID-98765" → Tipo: DOCUMENTO (Confianza: \~0.97)
	- Entidad: "#REF123" → Tipo: REFERENCIA (Confianza: \~0.98)

	Input Text 2:\
	"Referencia: EXP-002. Municipio de Chía, departamento Cundinamarca. Necesitamos hablar sobre el pago pendiente. Email de contacto: [email protected]. Tel: 3001234567"

	Output:

	- Entidad: "EXP-002" → Tipo: REFERENCIA (Confianza: \~0.98)
	- Entidad: "Chía" → Tipo: MUNICIPIO (Confianza: \~0.99)
	- Entidad: "Cundinamarca" → Tipo: DEPARTAMENTO (Confianza: \~0.99)
	- Entidad: "[email protected]" → Tipo: MAIL (Confianza: \~0.99)
	- Entidad: "3001234567" → Tipo: TELEFONO (Confianza: \~0.98)

	## Usage

	### Using the Model with Hugging Face Transformers

	To use the model for inference, you can load it with the `transformers` library and create a `pipeline` for NER:

	```python
	from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

	# Load the model and tokenizer
	model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_roberta")
	tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_roberta")

	# Create NER pipeline
	ner_pipeline = pipeline(
	"ner",
	model=model,
	tokenizer=tokenizer,
	aggregation_strategy="simple",
	device=0 if torch.cuda.is_available() else -1
	)

	# Example text
	text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876."

	# Perform inference
	entities = ner_pipeline(text)
	for entity in entities:
	print(f"Entidad: {entity['word']} → Tipo: {entity['entity_group']} (Confianza: {entity['score']:.4f})")
	```

	### Installation Requirements

	To run the model, install the required libraries:

	```bash
	pip install transformers[torch] datasets evaluate seqeval accelerate pandas pyarrow
	```

	### Hardware Requirements

	- Inference: Can run on CPU or GPU. GPU (e.g., NVIDIA with CUDA) is recommended for faster processing.
	- Training: GPU with at least 8GB VRAM is recommended for fine-tuning. The model was trained with mixed precision (FP16) to optimize memory usage.

	## Limitations

	- Dataset Bias: The model was trained on administrative texts, so it may not generalize well to other domains (e.g., social media, literature).
	- Entity Overlap: The preprocessing handles overlapping entities by prioritizing earlier matches, which may lead to missed entities in complex cases.
	- Null Values: High null rates in some entity columns (e.g., `telefono`: 10,073) may reduce performance for those entities.
	- Language: The model is optimized for Spanish and may not perform well on other languages.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{excribe_ner_sgd_roberta,
	author = {Exscribe},
	title = {NER Model for Spanish Administrative Texts},
	year = {2025},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/excribe/ner_sgd_roberta}}
	}
	```

	## Contact

	For questions or issues, please contact the maintainers via the Hugging Face repository or open an issue.