excribe commited on
Commit
8793054
·
verified ·
1 Parent(s): 1c8cb15

Update README.md (#1)

Browse files

- Update README.md (27607bdd80d0bbfbffb8ce25ddf35c397dc5df12)

Files changed (1) hide show
  1. README.md +214 -0
README.md CHANGED
@@ -1,3 +1,217 @@
1
  ---
2
  license: cc-by-nc-3.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-3.0
3
+ language:
4
+ - es
5
+ tags:
6
+ - ner
7
+ - spanish
8
+ - transformers
9
+ - roberta
10
+ - text-classification
11
+ - document-management
12
+ metrics:
13
+ - precision
14
+ - recall
15
+ - f1
16
+ - accuracy
17
+ model-index:
18
+ - name: ner_sgd_bertina_roberta
19
+ results:
20
+ - task:
21
+ type: named-entity-recognition
22
+ name: Named Entity Recognition
23
+ dataset:
24
+ name: Custom SGD Dataset
25
+ type: custom
26
+ args: final.parquet
27
+ metrics:
28
+ - name: Precision
29
+ type: precision
30
+ value: 0.9031
31
+ - name: Recall
32
+ type: recall
33
+ value: 0.9149
34
+ - name: F1
35
+ type: f1
36
+ value: 0.909
37
+ - name: Accuracy
38
+ type: accuracy
39
+ value: 0.9869
40
+ base_model:
41
+ - bertin-project/bertin-roberta-base-spanish
42
+ pipeline_tag: feature-extraction
43
  ---
44
+
45
+ # NER SGD Bertina RoBERTa
46
+
47
+ ## Model Overview
48
+
49
+ This is a fine-tuned Named Entity Recognition (NER) model for extracting specific entities from Spanish text, designed for administrative and document management contexts. It is based on the `bertin-project/bertin-roberta-base-spanish` model and trained to identify entities such as `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio` using the BIO tagging scheme (`B-TAG`, `I-TAG`, `O`).
50
+
51
+ The model was trained on a custom dataset stored in a Parquet file (`final.parquet`), containing Spanish text and labeled entities, likely from a document management system (SGD). It leverages the Hugging Face `transformers` library for training and inference, aligning with your prior interest in automating data extraction for document management systems.
52
+
53
+ ### Model Details
54
+
55
+ - **Base Model**: `bertin-project/bertin-roberta-base-spanish`
56
+ - **Task**: Named Entity Recognition (NER)
57
+ - **Language**: Spanish
58
+ - **Labels**:
59
+ - `O`: Outside of an entity
60
+ - `B-DIRECCION`, `I-DIRECCION`: Address
61
+ - `B-TELEFONO`, `I-TELEFONO`: Phone number
62
+ - `B-MAIL`, `I-MAIL`: Email
63
+ - `B-NOMBRE`, `I-NOMBRE`: Name
64
+ - `B-DOCUMENTO`, `I-DOCUMENTO`: Document ID
65
+ - `B-REFERENCIA`, `I-REFERENCIA`: Reference
66
+ - `B-DEPARTAMENTO`, `I-DEPARTAMENTO`: Department
67
+ - `B-MUNICIPIO`, `I-MUNICIPIO`: Municipality
68
+ - **Training Framework**: Hugging Face `transformers`, `datasets`, `evaluate`, `seqeval`
69
+ - **Training Hardware**: GPU (if available) or CPU
70
+
71
+ ## Intended Use
72
+
73
+ This model is designed for extracting structured information from unstructured Spanish text in administrative or document management contexts, such as extracting contact details or references from official correspondence. It is intended for **non-commercial use only**, as per the CC-BY-NC-3.0 license, aligning with your focus on academic or institutional applications.
74
+
75
+ ### Example Usage
76
+
77
+ Below is an example of how to use the model with the Hugging Face `pipeline` for NER:
78
+
79
+ ```python
80
+ from transformers import pipeline
81
+
82
+ # Load the model and tokenizer
83
+ ner_pipeline = pipeline("ner", model="excribe/ner_sgd_bertina_roberta", aggregation_strategy="simple")
84
+
85
+ # Example text
86
+ text = "Contactar a Juan Pérez en Calle Falsa 123, Bogotá. Teléfono 555-9876 o al mail [email protected] tradedoubler:59376,59376"
87
+
88
+ # Run inference
89
+ entities = ner_pipeline(text)
90
+
91
+ # Print results
92
+ for entity in entities:
93
+ print(f"Entity: {entity['word']} | Type: {entity['entity_group']} | Score: {entity['score']:.4f}")
94
+ ```
95
+
96
+ **Example Output**:
97
+
98
+ ```
99
+ Entity: Juan Pérez | Type: NOMBRE | Score: 0.9876
100
+ Entity: Calle Falsa 123 | Type: DIRECCION | Score: 0.9789
101
+ Entity: Bogotá | Type: MUNICIPIO | Score: 0.9654
102
+ Entity: 555-9876 | Type: TELEFONO | Score: 0.9921
103
+ Entity: [email protected] | Type: MAIL | Score: 0.9890
104
+ ```
105
+
106
+ ## Training Data
107
+
108
+ The model was trained on a custom dataset (`final.parquet`) containing **27,807 rows** and **32 columns**, likely sourced from a document management system (SGD). The dataset includes Spanish texts in the `texto_entrada` column and labeled entities in the following columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, and `municipio`. Other columns, such as `radicado`, `fecha_radicacion`, and `sgd_tpr_descrip`, suggest the data is related to administrative or official documents, aligning with your interest in enhancing document management systems.
109
+
110
+ ### Dataset Details
111
+
112
+ - **Number of Rows**: 27,807
113
+ - **Relevant Columns for NER**:
114
+ - `texto_entrada`: Source text
115
+ - Entity columns: `direccion`, `telefono`, `mail`, `nombre`, `documento`, `referencia`, `departamento`, `municipio`
116
+ - **Missing Values**:
117
+ - `direccion`: 82 missing
118
+ - `telefono`: 10,073 missing
119
+ - `mail`: 1,086 missing
120
+ - `nombre`: 0 missing
121
+ - `documento`: 6,407 missing
122
+ - `referencia`: 200 missing
123
+ - `departamento`: 0 missing
124
+ - `municipio`: 0 missing
125
+ - **Preprocessing**:
126
+ - A custom function (`convert_row_to_bio_optimized`) converted entity columns into BIO tags, handling overlaps by prioritizing earlier entities.
127
+ - The dataset was tokenized using the base model's tokenizer, with labels aligned to sub-tokens.
128
+ - Split: Training (~80%), Validation (~10%), Test (~10%).
129
+
130
+ ## Training Procedure
131
+
132
+ The model was fine-tuned using the Hugging Face `Trainer` API with the following hyperparameters, reflecting your interest in fine-tuning for NER tasks:
133
+
134
+ - **Epochs**: 3
135
+ - **Learning Rate**: 2e-5
136
+ - **Batch Size**: 8 (per device)
137
+ - **Weight Decay**: 0.01
138
+ - **Evaluation Strategy**: Per epoch
139
+ - **Optimizer**: AdamW (default in `transformers`)
140
+ - **Mixed Precision**: Enabled if GPU is available
141
+ - **Metrics**: Precision, Recall, F1, Accuracy (via `seqeval`)
142
+
143
+ The training process included:
144
+
145
+ 1. Loading the dataset from Parquet and converting it to a Hugging Face `Dataset`.
146
+ 2. Generating BIO tags for each text.
147
+ 3.-pep8
148
+ 3. Tokenizing and aligning labels with the model's tokenizer.
149
+ 4. Fine-tuning the model with the `Trainer` API.
150
+ 5. Evaluating on the validation set and saving the best model based on F1 score.
151
+ 6. Final evaluation on the test set.
152
+
153
+ ### Evaluation Metrics
154
+
155
+ The model was evaluated on the test set after 3 epochs, achieving the following metrics:
156
+
157
+ - **Precision**: 0.9031
158
+ - **Recall**: 0.9149
159
+ - **F1-Score**: 0.9090
160
+ - **Accuracy**: 0.9869
161
+ - **Loss**: 0.0465
162
+ - **Runtime**: 12.22 seconds
163
+ - **Samples per Second**: 227.546
164
+ - **Steps per Second**: 28.474
165
+
166
+ These metrics were computed using the `seqeval` library with the IOB2 scheme in strict mode, ensuring accurate entity boundary and type matching.
167
+
168
+ ## How to Use
169
+
170
+ To use the model, install the required dependencies and load it with the Hugging Face `transformers` library:
171
+
172
+ ```bash
173
+ pip install transformers torch
174
+ ```
175
+
176
+ Then, use the `pipeline` as shown in the example above, or load the model manually:
177
+
178
+ ```python
179
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
180
+
181
+ model = AutoModelForTokenClassification.from_pretrained("excribe/ner_sgd_bertina_roberta")
182
+ tokenizer = AutoTokenizer.from_pretrained("excribe/ner_sgd_bertina_roberta")
183
+
184
+ # Tokenize input text
185
+ inputs = tokenizer("Calle Falsa 123, Bogotá", return_tensors="pt")
186
+
187
+ # Get predictions
188
+ outputs = model(**inputs)
189
+ logits = outputs.logits
190
+ predictions = logits.argmax(dim=-1)
191
+ ```
192
+
193
+ ## Limitations
194
+
195
+ - The model is trained on a dataset from an administrative document management context and may not generalize well to other domains (e.g., social media or informal texts).
196
+ - Overlapping entities are resolved by prioritizing earlier matches, which may miss some valid entities.
197
+ - Missing values in entity columns (e.g., 10,073 missing `telefono` values) may reduce performance for certain entity types.
198
+ - The model is optimized for Spanish and may not perform well on other languages.
199
+ - Due to the CC-BY-NC-3.0 license, the model cannot be used for commercial purposes.
200
+
201
+ ## Ethical Considerations
202
+
203
+ - **Bias**: The model may reflect biases in the training data, such as underrepresentation of certain entity types (e.g., `telefono` has many missing values) or overrepresentation of formal administrative language.
204
+ - **Privacy**: The model extracts sensitive entities like names, addresses, and phone numbers. Ensure input texts do not contain personal data unless authorized, especially given your focus on document management systems handling potentially sensitive data.
205
+ - **Non-Commercial Use**: The model is licensed for non-commercial use only, as per CC-BY-NC-3.0, aligning with your likely academic or institutional goals.
206
+
207
+ ## License
208
+
209
+ This model is licensed under the [Creative Commons Attribution-NonCommercial 3.0 Unported License (CC-BY-NC-3.0)](https://creativecommons.org/licenses/by-nc/3.0/). You are free to share and adapt the model for non-commercial purposes, provided you give appropriate credit to the author.
210
+
211
+ ## Contact
212
+
213
+ For issues or questions, please contact the model author via the Hugging Face repository or open an issue.
214
+
215
+ ## Acknowledgments
216
+
217
+ This model was trained using the Hugging Face ecosystem (`transformers`, `datasets`, `evaluate`, `seqeval`). Thanks to the `bertin-project` team for providing the base model `bertin-roberta-base-spanish`.