Davlan commited on
Commit
0e4769c
·
1 Parent(s): 8ca6f67

updating readme

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -1 +1,62 @@
1
- hello
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hugging Face's logo
2
+ ---
3
+ language: amh, hau, ibo, kin, lug, luo, pcm, swa, wol, yor
4
+ datasets:
5
+ - masakhaner
6
+ ---
7
+ # xlm-roberta-large-masakhaner
8
+ ## Model description
9
+ **xlm-roberta-large-masakhaner** is the first **Named Entity Recognition** model for 10 African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian Pidgin, Swahilu, Wolof, and Yorùbá) based on a fine-tuned XLM-RoBERTa large model. It achieves the **state-of-the-art performance** for the NER task. It has been trained to recognize four types of entities: dates & times (DATE), location (LOC), organizations (ORG), and person (PER).
10
+ Specifically, this model is a *xlm-roberta-large* model that was fine-tuned on an aggregation of African language datasets obtained from Masakhane[MasakhaNER](https://github.com/masakhane-io/masakhane-ner) dataset.
11
+ ## Intended uses & limitations
12
+ #### How to use
13
+ You can use this model with Transformers *pipeline* for NER.
14
+ ```python
15
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
16
+ from transformers import pipeline
17
+ tokenizer = AutoTokenizer.from_pretrained("Davlan/xlm-roberta-large-masakhaner")
18
+ model = AutoModelForTokenClassification.from_pretrained("Davlan/xlm-roberta-large-masakhaner")
19
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer)
20
+ example = "Emir of Kano turban Zhang wey don spend 18 years for Nigeria"
21
+ ner_results = nlp(example)
22
+ print(ner_results)
23
+ ```
24
+ #### Limitations and bias
25
+ This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.
26
+ ## Training data
27
+ This model was fine-tuned on 10 African NER datasets (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian Pidgin, Swahilu, Wolof, and Yorùbá) Masakhane[MasakhaNER](https://github.com/masakhane-io/masakhane-ner) dataset
28
+
29
+ The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
30
+ Abbreviation|Description
31
+ -|-
32
+ O|Outside of a named entity
33
+ B-DATE |Beginning of a DATE entity right after another DATE entity
34
+ I-DATE |DATE entity
35
+ B-PER |Beginning of a person’s name right after another person’s name
36
+ I-PER |Person’s name
37
+ B-ORG |Beginning of an organisation right after another organisation
38
+ I-ORG |Organisation
39
+ B-LOC |Beginning of a location right after another location
40
+ I-LOC |Location
41
+ ## Training procedure
42
+ This model was trained on a single NVIDIA V100 GPU with recommended hyperparameters from the [original MasakhaNER paper]() which trained & evaluated the model on MasakhaNER corpus.
43
+ ## Eval results on Test set (F-score)
44
+ language|F1-score
45
+ -|-
46
+ amh |75.76
47
+ hau |91.75
48
+ ibo |86.26
49
+ kin |76.38
50
+ lug |84.64
51
+ luo |80.65
52
+ pcm |89.55
53
+ swa |89.48
54
+ wol |70.70
55
+ yor |82.05
56
+
57
+ ### BibTeX entry and citation info
58
+ ```
59
+
60
+ ```
61
+
62
+