israel commited on
Commit
1a45aa2
Β·
verified Β·
1 Parent(s): 43097de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -55
README.md CHANGED
@@ -5,80 +5,67 @@ language:
5
  pipeline_tag: text-generation
6
  ---
7
 
8
- # Walia Instruction Dataset for Amharic
9
 
10
- This repository contains instruction-tuning datasets used in [Walia-LLM](https://aclanthology.org/2024.findings-emnlp.25/), a fine-tuned LLaMA-2 model for the Amharic language. The dataset was carefully constructed by integrating task-specific and generative datasets, and it supports a variety of natural language processing tasks in Amharic.
11
 
12
- ## Dataset Summary
13
 
14
- The Walia dataset is designed to enhance large language models for the Amharic language by:
 
15
 
16
- - Converting existing task-specific datasets (e.g., sentiment analysis, QA, NER) into instruction format.
17
- - Creating new generative datasets (e.g., poem generation, religious lyrics, story generation).
18
- - Translating English instruction datasets (e.g., Alpaca, Dolly) into Amharic for comparative studies.
19
 
20
- Each data point follows a structured instruction format with:
21
- - `"instruction"` – a natural language task description,
22
- - `"input"` – optional input text for the task,
23
- - `"output"` – the expected model output in Amharic.
 
 
 
 
 
 
 
 
24
 
25
- ## Supported Tasks
26
 
27
- | Task | Source/Type | Notes |
28
- |---------------------------|-------------------|----------------------------|
29
- | Sentiment Analysis | AfriSenti | 3-class sentiment |
30
- | Named Entity Recognition | MasakhaNER | Personal name extraction |
31
- | News Classification | MasakhaNews | Multilingual topic classes |
32
- | QA | AmharicQA | Wikipedia-based |
33
- | Summarization | XL-Sum | Amharic summaries |
34
- | Machine Translation | NLLB, WMT19 | Both directions supported |
35
- | Poem/Lyrics/Story Gen | Custom | Sourced from web/telegram |
36
- | Spelling Correction | Synthetic | Character perturbations |
37
 
38
- ## Dataset Structure
39
 
40
- ```json
41
- {
42
- "instruction": "Translate the following sentence to Amharic.",
43
- "input": "Hello, how are you?",
44
- "output": "αˆ°αˆ‹αˆα£ αŠ₯αŠ•α‹΄α‰΅ αŠαˆ…?"
45
- }
46
- ```
47
 
48
- ## Data Statistics
 
 
 
49
 
50
- - ~122,000 instruction samples for training
51
- - ~15,000 for validation and test
52
- - 16+ task types and instruction templates
53
- - All responses are in Amharic (except source text in MT)
54
 
55
- ## How to Use
 
 
56
 
57
- You can load the dataset using the Hugging Face `datasets` library:
58
 
59
  ```python
60
- from datasets import load_dataset
61
-
62
- dataset = load_dataset("EthioNLP/walia-amharic-instructions")
63
- print(dataset["train"][0])
64
- ```
65
-
66
- ## Applications
67
 
68
- - Supervised fine-tuning (SFT) of LLMs for Amharic
69
- - Cross-lingual instruction tuning experiments
70
- - Evaluation of generative capabilities in low-resource languages
71
 
72
- ## Related Models
73
-
74
- The dataset is used to fine-tune:
75
- - [`EthioNLP/walia-llama-2`](https://huggingface.co/EthioNLP/walia-llama-2)
76
- - Other LLaMA variants for Amharic
77
 
78
  ## Citation
79
 
80
- Please cite the following paper if you use this dataset:
81
-
82
  ```bibtex
83
  @inproceedings{azime-etal-2024-walia,
84
  title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
@@ -93,4 +80,4 @@ Please cite the following paper if you use this dataset:
93
  doi = "10.18653/v1/2024.findings-emnlp.25",
94
  pages = "432--444"
95
  }
96
- ```
 
5
  pipeline_tag: text-generation
6
  ---
7
 
 
8
 
9
+ Walia-LLM: Fine-Tuned LLaMA-2 for Amharic
10
 
11
+ `Walia-LLM` is a fine-tuned LLaMA-2 model for the Amharic language, created by instruction tuning with task-specific and generative datasets. It is part of our effort to adapt and improve LLMs for low-resource languages.
12
 
13
+ This model was introduced in the EMNLP 2024 Findings paper:
14
+ > [Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets](https://aclanthology.org/2024.findings-emnlp.25/)
15
 
16
+ ## Model Details
 
 
17
 
18
+ - Base model: LLaMA-2
19
+ - Fine-tuning method: Supervised fine-tuning (SFT) using LoRA
20
+ - Language: Amharic
21
+ - Tasks:
22
+ - Sentiment analysis
23
+ - Question answering
24
+ - Named entity recognition
25
+ - News classification
26
+ - Summarization
27
+ - Machine translation
28
+ - Poem/story/lyrics generation
29
+ - Spelling correction
30
 
31
+ ## Training Data
32
 
33
+ The model was trained on a custom instruction dataset derived from:
34
+ - Existing NLP benchmarks (e.g., AfriSenti, AmharicQA, MasakhaNER, MasakhaNews, XL-Sum)
35
+ - Manually collected generative datasets (e.g., religious lyrics, stories, poems)
36
+ - Translated instruction datasets (e.g., Alpaca, Dolly)
 
 
 
 
 
 
37
 
38
+ See [EthioNLP/walia-amharic-instructions](https://huggingface.co/datasets/EthioNLP/walia-amharic-instructions) for the dataset used.
39
 
40
+ ## Intended Use
 
 
 
 
 
 
41
 
42
+ This model is intended for:
43
+ - Research on instruction tuning in low-resource languages
44
+ - Generative NLP tasks in Amharic
45
+ - Evaluating multilingual LLM capabilities
46
 
47
+ ## Limitations
 
 
 
48
 
49
+ - Some generative outputs may be verbose or imprecise.
50
+ - Limited understanding of highly specific Amharic poetic or lyrical structures.
51
+ - Spell correction and NER performance is still under exploration.
52
 
53
+ ## Example Usage
54
 
55
  ```python
56
+ from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
 
 
57
 
58
+ model = AutoModelForCausalLM.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
59
+ tokenizer = AutoTokenizer.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
 
60
 
61
+ prompt = "ሡለ αŠ αˆ›αˆ­αŠ› α‰‹αŠ•α‰‹ መግለጫ αŠ α‰…αˆ­α‰₯ፒ"
62
+ inputs = tokenizer(prompt, return_tensors="pt")
63
+ outputs = model.generate(**inputs, max_new_tokens=100)
64
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
65
+ ```
66
 
67
  ## Citation
68
 
 
 
69
  ```bibtex
70
  @inproceedings{azime-etal-2024-walia,
71
  title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
 
80
  doi = "10.18653/v1/2024.findings-emnlp.25",
81
  pages = "432--444"
82
  }
83
+ ```