EthioNLP
/

Amharic_LLAMA_our_data

Text Generation

Amharic

Model card Files Files and versions Community

israel commited on May 30

Commit

1a45aa2

verified ·

1 Parent(s): 43097de

Update README.md

Browse files

Files changed (1) hide show

README.md +42 -55

README.md CHANGED Viewed

@@ -5,80 +5,67 @@ language:
 pipeline_tag: text-generation
 ---
-# Walia Instruction Dataset for Amharic
-This repository contains instruction-tuning datasets used in [Walia-LLM](https://aclanthology.org/2024.findings-emnlp.25/), a fine-tuned LLaMA-2 model for the Amharic language. The dataset was carefully constructed by integrating task-specific and generative datasets, and it supports a variety of natural language processing tasks in Amharic.
-## Dataset Summary
-The Walia dataset is designed to enhance large language models for the Amharic language by:
-- Converting existing task-specific datasets (e.g., sentiment analysis, QA, NER) into instruction format.
-- Creating new generative datasets (e.g., poem generation, religious lyrics, story generation).
-- Translating English instruction datasets (e.g., Alpaca, Dolly) into Amharic for comparative studies.
-Each data point follows a structured instruction format with:
-- `"instruction"` – a natural language task description,
-- `"input"` – optional input text for the task,
-- `"output"` – the expected model output in Amharic.
-## Supported Tasks
-| Task                       | Source/Type       | Notes                      |
-|---------------------------|-------------------|----------------------------|
-| Sentiment Analysis        | AfriSenti         | 3-class sentiment          |
-| Named Entity Recognition  | MasakhaNER        | Personal name extraction   |
-| News Classification       | MasakhaNews       | Multilingual topic classes |
-| QA                        | AmharicQA         | Wikipedia-based            |
-| Summarization             | XL-Sum            | Amharic summaries          |
-| Machine Translation       | NLLB, WMT19       | Both directions supported  |
-| Poem/Lyrics/Story Gen     | Custom            | Sourced from web/telegram  |
-| Spelling Correction       | Synthetic         | Character perturbations    |
-## Dataset Structure
-```json
-{
-  "instruction": "Translate the following sentence to Amharic.",
-  "input": "Hello, how are you?",
-  "output": "ሰላም፣ እንዴት ነህ?"
-}
-```
-## Data Statistics
-- ~122,000 instruction samples for training
-- ~15,000 for validation and test
-- 16+ task types and instruction templates
-- All responses are in Amharic (except source text in MT)
-## How to Use
-You can load the dataset using the Hugging Face `datasets` library:
 ```python
-from datasets import load_dataset
-dataset = load_dataset("EthioNLP/walia-amharic-instructions")
-print(dataset["train"][0])
-```
-## Applications
-- Supervised fine-tuning (SFT) of LLMs for Amharic
-- Cross-lingual instruction tuning experiments
-- Evaluation of generative capabilities in low-resource languages
-## Related Models
-The dataset is used to fine-tune:
-- [`EthioNLP/walia-llama-2`](https://huggingface.co/EthioNLP/walia-llama-2)
-- Other LLaMA variants for Amharic
 ## Citation
-Please cite the following paper if you use this dataset:
 ```bibtex
 @inproceedings{azime-etal-2024-walia,
     title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
@@ -93,4 +80,4 @@ Please cite the following paper if you use this dataset:
     doi = "10.18653/v1/2024.findings-emnlp.25",
     pages = "432--444"
 }
-```

 pipeline_tag: text-generation
 ---
+Walia-LLM: Fine-Tuned LLaMA-2 for Amharic
+`Walia-LLM` is a fine-tuned LLaMA-2 model for the Amharic language, created by instruction tuning with task-specific and generative datasets. It is part of our effort to adapt and improve LLMs for low-resource languages.
+This model was introduced in the EMNLP 2024 Findings paper:
+> [Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets](https://aclanthology.org/2024.findings-emnlp.25/)
+## Model Details
+- Base model: LLaMA-2
+- Fine-tuning method: Supervised fine-tuning (SFT) using LoRA
+- Language: Amharic
+- Tasks:
+  - Sentiment analysis
+  - Question answering
+  - Named entity recognition
+  - News classification
+  - Summarization
+  - Machine translation
+  - Poem/story/lyrics generation
+  - Spelling correction
+## Training Data
+The model was trained on a custom instruction dataset derived from:
+- Existing NLP benchmarks (e.g., AfriSenti, AmharicQA, MasakhaNER, MasakhaNews, XL-Sum)
+- Manually collected generative datasets (e.g., religious lyrics, stories, poems)
+- Translated instruction datasets (e.g., Alpaca, Dolly)
+See [EthioNLP/walia-amharic-instructions](https://huggingface.co/datasets/EthioNLP/walia-amharic-instructions) for the dataset used.
+## Intended Use
+This model is intended for:
+- Research on instruction tuning in low-resource languages
+- Generative NLP tasks in Amharic
+- Evaluating multilingual LLM capabilities
+## Limitations
+- Some generative outputs may be verbose or imprecise.
+- Limited understanding of highly specific Amharic poetic or lyrical structures.
+- Spell correction and NER performance is still under exploration.
+## Example Usage
 ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
+tokenizer = AutoTokenizer.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
+prompt = "ስለ አማርኛ ቋንቋ መግለጫ አቅርብ።"
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
 ## Citation
 ```bibtex
 @inproceedings{azime-etal-2024-walia,
     title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
     doi = "10.18653/v1/2024.findings-emnlp.25",
     pages = "432--444"
 }
+```