Text Generation
Transformers
Safetensors
llama
text-generation-inference
jisx commited on
Commit
6e295ea
·
verified ·
1 Parent(s): 8effe99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md CHANGED
@@ -1,3 +1,64 @@
1
  ---
2
  license: llama2
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ datasets:
4
+ - MaLA-LM/PolyWrite
5
+ - Davlan/sib200
6
+ base_model:
7
+ - meta-llama/Llama-2-7b-hf
8
  ---
9
+
10
+ # EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models
11
+
12
+ ## Model Description
13
+
14
+ **EMMA-500** is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the **Llama 2 7B** architecture. Leveraging the **MaLA Corpus**, which spans over 500 languages and 74 billion tokens, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, open-ended generation, and text classification.
15
+
16
+ **EMMA-500** outperforms other Llama 2-based models in diverse multilingual settings while maintaining robustness in specialized tasks.
17
+
18
+ ---
19
+
20
+ ## Model Details
21
+
22
+ - **Architecture**: Built on Llama 2 7B with enhanced language adaptation through continual pre-training.
23
+ - **Languages**: Supports **546 languages** with substantial training data (over 100k tokens each).
24
+ - **Data Mix**: A diverse mix of text from domains like code, books, instruction data, and more.
25
+ - **Key Tasks**: Commonsense reasoning, machine translation, text classification, natural language inference, code generation, and open-ended generation.
26
+
27
+ ### Data Access
28
+ - [MaLA Corpus](https://huggingface.co/datasets/MaLA-LM/MaLA)
29
+ - [PolyWrite Benchmark](https://huggingface.co/datasets/MaLA-LM/PolyWrite)
30
+
31
+ ---
32
+
33
+ ## Usage
34
+
35
+ You can use **EMMA-500** for multilingual text generation. Below is an example to generate text using the model:
36
+
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+
40
+ model_name = "MaLA-LM/emma-500-llama2-7b"
41
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
42
+ model = AutoModelForCausalLM.from_pretrained(model_name)
43
+
44
+ input_text = "Once upon a time"
45
+ inputs = tokenizer(input_text, return_tensors="pt")
46
+ outputs = model.generate(**inputs)
47
+
48
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
49
+ ```
50
+
51
+ ---
52
+
53
+ ## Model Performance
54
+
55
+ **EMMA-500** was evaluated across multiple benchmarks and tasks, demonstrating:
56
+
57
+ - **Lowest negative log-likelihood** in intrinsic evaluations.
58
+ - Significant improvements in **commonsense reasoning**, **machine translation**, and **open-ended generation**.
59
+ - **Outperformed** all Llama 2-based models in **text classification** and **natural language inference**.
60
+ - Enhanced performance in **code generation** and **machine reading comprehension (MRC)**.
61
+
62
+ Challenges remain in low-resource languages, where the model tends to have higher **Self-BLEU** scores, indicating reduced output diversity.
63
+
64
+ ---