MaLA-LM
/

emma-500-llama2-7b

Text Generation

text-generation-inference

Model card Files Files and versions

jisx commited on Sep 26, 2024

Commit

6e295ea

·

verified ·

1 Parent(s): 8effe99

Update README.md

Files changed (1) hide show

README.md +61 -0

README.md CHANGED Viewed

@@ -1,3 +1,64 @@
 ---
 license: llama2
 ---

 ---
 license: llama2
+datasets:
+- MaLA-LM/PolyWrite
+- Davlan/sib200
+base_model:
+- meta-llama/Llama-2-7b-hf
 ---
+# EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
+## Model Description
+**EMMA-500** is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the **Llama 2 7B** architecture. Leveraging the **MaLA Corpus**, which spans over 500 languages and 74 billion tokens, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, open-ended generation, and text classification.
+**EMMA-500** outperforms other Llama 2-based models in diverse multilingual settings while maintaining robustness in specialized tasks.
+---
+## Model Details
+- **Architecture**: Built on Llama 2 7B with enhanced language adaptation through continual pre-training.
+- **Languages**: Supports **546 languages** with substantial training data (over 100k tokens each).
+- **Data Mix**: A diverse mix of text from domains like code, books, instruction data, and more.
+- **Key Tasks**: Commonsense reasoning, machine translation, text classification, natural language inference, code generation, and open-ended generation.
+### Data Access
+- [MaLA Corpus](https://huggingface.co/datasets/MaLA-LM/MaLA)
+- [PolyWrite Benchmark](https://huggingface.co/datasets/MaLA-LM/PolyWrite)
+---
+## Usage
+You can use **EMMA-500** for multilingual text generation. Below is an example to generate text using the model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "MaLA-LM/emma-500-llama2-7b"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+input_text = "Once upon a time"
+inputs = tokenizer(input_text, return_tensors="pt")
+outputs = model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## Model Performance
+**EMMA-500** was evaluated across multiple benchmarks and tasks, demonstrating:
+- **Lowest negative log-likelihood** in intrinsic evaluations.
+- Significant improvements in **commonsense reasoning**, **machine translation**, and **open-ended generation**.
+- **Outperformed** all Llama 2-based models in **text classification** and **natural language inference**.
+- Enhanced performance in **code generation** and **machine reading comprehension (MRC)**.
+Challenges remain in low-resource languages, where the model tends to have higher **Self-BLEU** scores, indicating reduced output diversity.
+---