File size: 3,126 Bytes
e356ef2
 
7f7cd88
 
 
e356ef2
 
 
56655ba
e356ef2
1a45aa2
e356ef2
1a45aa2
 
e356ef2
1a45aa2
e356ef2
1a45aa2
 
 
 
 
 
 
 
 
 
 
 
e356ef2
1a45aa2
e356ef2
1a45aa2
 
 
 
e356ef2
1a45aa2
e356ef2
1a45aa2
e356ef2
1a45aa2
 
 
 
e356ef2
1a45aa2
e356ef2
1a45aa2
 
 
e356ef2
1a45aa2
e356ef2
43097de
1a45aa2
e356ef2
1a45aa2
 
e356ef2
1a45aa2
 
 
 
 
e356ef2
43097de
e356ef2
43097de
 
 
 
 
 
 
 
 
 
 
 
 
 
1a45aa2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
base_model: llama-2-amharic-combined
language:
- am
pipeline_tag: text-generation
---


## Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets

`Walia-LLM` is a fine-tuned LLaMA-2 model for the Amharic language, created by instruction tuning with task-specific and generative datasets. It is part of our effort to adapt and improve LLMs for low-resource languages.

This model was introduced in the EMNLP 2024 Findings paper:
> [Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets](https://aclanthology.org/2024.findings-emnlp.25/)

## Model Details

- Base model: LLaMA-2
- Fine-tuning method: Supervised fine-tuning (SFT) using LoRA
- Language: Amharic
- Tasks:
  - Sentiment analysis
  - Question answering
  - Named entity recognition
  - News classification
  - Summarization
  - Machine translation
  - Poem/story/lyrics generation
  - Spelling correction

## Training Data

The model was trained on a custom instruction dataset derived from:
- Existing NLP benchmarks (e.g., AfriSenti, AmharicQA, MasakhaNER, MasakhaNews, XL-Sum)
- Manually collected generative datasets (e.g., religious lyrics, stories, poems)
- Translated instruction datasets (e.g., Alpaca, Dolly)

See [EthioNLP/walia-amharic-instructions](https://huggingface.co/datasets/EthioNLP/walia-amharic-instructions) for the dataset used.

## Intended Use

This model is intended for:
- Research on instruction tuning in low-resource languages
- Generative NLP tasks in Amharic
- Evaluating multilingual LLM capabilities

## Limitations

- Some generative outputs may be verbose or imprecise.
- Limited understanding of highly specific Amharic poetic or lyrical structures.
- Spell correction and NER performance is still under exploration.

## Example Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
tokenizer = AutoTokenizer.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")

prompt = "ሡለ αŠ αˆ›αˆ­αŠ› α‰‹αŠ•α‰‹ መግለጫ αŠ α‰…αˆ­α‰₯ፒ"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Citation

```bibtex
@inproceedings{azime-etal-2024-walia,
    title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
    author = "Azime, Israel Abebe  and Tonja, Atnafu Lambebo  and Belay, Tadesse Destaw  and Fuge, Mitiku Yohannes  and Wassie, Aman Kassahun  and Jada, Eyasu Shiferaw  and Chanie, Yonas  and Sewunetie, Walelign Tewabe  and Yimam, Seid Muhie",
    editor = "Al-Onaizan, Yaser  and Bansal, Mohit  and Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.25/",
    doi = "10.18653/v1/2024.findings-emnlp.25",
    pages = "432--444"
}
```