EuroLLM-1.7B / README.md

Update README.md

3f62b1b verified 3 months ago

12.9 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	- es
	- fr
	- it
	- pt
	- pl
	- nl
	- tr
	- sv
	- cs
	- el
	- hu
	- ro
	- fi
	- uk
	- sl
	- sk
	- da
	- lt
	- lv
	- et
	- bg
	- no
	- ca
	- hr
	- ga
	- mt
	- gl
	- zh
	- ru
	- ko
	- ja
	- ar
	- hi
	---
	# Model Card for EuroLLM-1.7B


	This is the model card for the first pre-trained model of the EuroLLM series: EuroLLM-1.7B. You can also check the instruction tuned version: [EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct).

	- Developed by: Unbabel, Instituto Superior Técnico, University of Edinburgh, Aveni, University of Paris-Saclay, University of Amsterdam, Naver Labs, Sorbonne Université, University of Turku, University of Oslo.
	- Funded by: European Union.
	- Model type: A 1.7B parameter multilingual transfomer LLM.
	- Language(s) (NLP): Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.
	- License: Apache License 2.0.

	## Model Details

	The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages.
	EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets.
	EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation.


	### Model Description

	EuroLLM uses a standard, dense Transformer architecture:
	- We use grouped query attention (GQA) with 8 key-value heads, since it has been shown to increase speed at inference time while maintaining downstream performance.
	- We perform pre-layer normalization, since it improves the training stability, and use the RMSNorm, which is faster.
	- We use the SwiGLU activation function, since it has been shown to lead to good results on downstream tasks.
	- We use rotary positional embeddings (RoPE) in every layer, since these have been shown to lead to good performances while allowing the extension of the context length.

	For pre-training, we use 256 Nvidia H100 GPUs of the Marenostrum 5 supercomputer, training the model with a constant batch size of 3,072 sequences, which corresponds to approximately 12 million tokens, using the Adam optimizer, and BF16 precision.
	Here is a summary of the model hyper-parameters:
	\| \| \|
	\|--------------------------------------\|----------------------\|
	\| Sequence Length \| 4,096 \|
	\| Number of Layers \| 24 \|
	\| Embedding Size \| 2,048 \|
	\| FFN Hidden Size \| 5,632 \|
	\| Number of Heads \| 16 \|
	\| Number of KV Heads (GQA) \| 8 \|
	\| Activation Function \| SwiGLU \|
	\| Position Encodings \| RoPE (\Theta=10,000) \|
	\| Layer Norm \| RMSNorm \|
	\| Tied Embeddings \| No \|
	\| Embedding Parameters \| 0.262B \|
	\| LM Head Parameters \| 0.262B \|
	\| Non-embedding Parameters \| 1.133B \|
	\| Total Parameters \| 1.657B \|

	## Run the model

	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "utter-project/EuroLLM-1.7B"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	text = "English: My name is EuroLLM. Portuguese:"

	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=20)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))


	## Results

	### Machine Translation

	We evaluate EuroLLM-1.7B-Instruct on several machine translation benchmarks: FLORES-200, WMT-23, and WMT-24 comparing it with [Gemma-2B](https://huggingface.co/google/gemma-2b) and [Gemma-7B](https://huggingface.co/google/gemma-7b) (also instruction tuned on EuroBlocks).
	The results show that EuroLLM-1.7B is substantially better than Gemma-2B in Machine Translation and competitive with Gemma-7B.

	#### Flores-200
	\| Model \| AVG \| AVG en-xx \| AVG xx-en \| en-ar \| en-bg \| en-ca \| en-cs \| en-da \| en-de \| en-el \| en-es-latam \| en-et \| en-fi \| en-fr \| en-ga \| en-gl \| en-hi \| en-hr \| en-hu \| en-it \| en-ja \| en-ko \| en-lt \| en-lv \| en-mt \| en-nl \| en-no \| en-pl \| en-pt-br \| en-ro \| en-ru \| en-sk \| en-sl \| en-sv \| en-tr \| en-uk \| en-zh-cn \| ar-en \| bg-en \| ca-en \| cs-en \| da-en \| de-en \| el-en \| es-latam-en \| et-en \| fi-en \| fr-en \| ga-en \| gl-en \| hi-en \| hr-en \| hu-en \| it-en \| ja-en \| ko-en \| lt-en \| lv-en \| mt-en \| nl-en \| no-en \| pl-en \| pt-br-en \| ro-en \| ru-en \| sk-en \| sl-en \| sv-en \| tr-en \| uk-en \| zh-cn-en \|
	\|--------------------------------\|------\|-----------\|-----------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|--------------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|----------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|----------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|--------------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|----------\|-------\|-------\|-------\|-------\|-------\|-------\|-------\|----------\|
	\| EuroLLM-1.7B-Instruct \| 86.10\| 85.53 \| 86.67 \| 83.87 \| 88.36 \| 84.42 \| 88.34 \| 88.77 \| 86.63 \| 86.71 \| 85.99 \| 86.98 \| 87.13 \| 87.21 \| 72.25 \| 85.97 \| 74.78 \| 82.96 \| 85.51 \| 87.77 \| 89.26 \| 86.27 \| 86.31 \| 86.22 \| 67.38 \| 86.95 \| 88.68 \| 87.38 \| 89.13 \| 88.39 \| 87.47 \| 87.51 \| 85.32 \| 89.20 \| 86.24 \| 86.33 \| 86.17 \| 85.80 \| 87.20 \| 87.53 \| 87.53 \| 89.26 \| 88.71 \| 86.49 \| 86.55 \| 87.60 \| 88.17 \| 88.90 \| 79.89 \| 87.59 \| 87.53 \| 86.10 \| 86.34 \| 87.54 \| 86.25 \| 86.08 \| 85.03 \| 85.60 \| 78.16 \| 86.80 \| 89.96 \| 85.24 \| 88.85 \| 88.42 \| 85.86 \| 87.17 \| 86.36 \| 89.48 \| 86.76 \| 86.06 \| 85.88 \|
	\| Gemma-2B-EuroBlocks \| 81.56\| 78.93 \| 84.18 \| 75.25 \| 82.46 \| 83.17 \| 82.17 \| 84.40 \| 83.20 \| 79.63 \| 84.15 \| 72.63 \| 81.00 \| 85.12 \| 38.79 \| 82.00 \| 67.00 \| 81.18 \| 78.24 \| 84.80 \| 87.08 \| 82.04 \| 73.02 \| 68.41 \| 56.67 \| 83.30 \| 86.69 \| 83.07 \| 86.82 \| 84.00 \| 84.55 \| 77.93 \| 76.19 \| 80.77 \| 79.76 \| 84.19 \| 84.10 \| 83.67 \| 85.73 \| 86.89 \| 86.38 \| 88.39 \| 88.11 \| 84.68 \| 86.11 \| 83.45 \| 86.45 \| 88.22 \| 50.88 \| 86.44 \| 85.87 \| 85.33 \| 85.16 \| 86.75 \| 85.62 \| 85.00 \| 81.55 \| 81.45 \| 67.90 \| 85.95 \| 89.05 \| 84.18 \| 88.27 \| 87.38 \| 85.13 \| 85.22 \| 83.86 \| 87.83 \| 84.96 \| 85.15 \| 85.10 \|
	\| Gemma-7B-EuroBlocks \| 86.16\| 85.49 \| 86.82 \| 83.39 \| 88.32 \| 85.82 \| 88.88 \| 89.01 \| 86.96 \| 86.62 \| 86.31 \| 84.42 \| 88.11 \| 87.46 \| 61.85 \| 86.10 \| 77.91 \| 87.01 \| 85.81 \| 87.57 \| 89.88 \| 87.24 \| 84.47 \| 83.15 \| 67.13 \| 86.50 \| 90.44 \| 87.57 \| 89.22 \| 89.13 \| 88.58 \| 86.73 \| 84.68 \| 88.16 \| 86.87 \| 88.40 \| 87.11 \| 86.65 \| 87.25 \| 88.17 \| 87.47 \| 89.59 \| 88.44 \| 86.76 \| 86.66 \| 87.55 \| 88.88 \| 88.86 \| 73.46 \| 87.63 \| 88.43 \| 87.12 \| 87.31 \| 87.49 \| 87.20 \| 87.15 \| 85.16 \| 85.96 \| 78.39 \| 86.73 \| 90.52 \| 85.38 \| 89.17 \| 88.75 \| 86.35 \| 86.82 \| 86.21 \| 89.39 \| 88.20 \| 86.45 \| 86.28 \|


	#### WMT-23
	\| Model \| AVG \| AVG en-xx \| AVG xx-en \| AVG xx-xx \| en-de \| en-cs \| en-uk \| en-ru \| en-zh-cn \| de-en \| uk-en \| ru-en \| zh-cn-en \| cs-uk \|
	\|--------------------------------\|------\|-----------\|-----------\|-----------\|-------\|-------\|-------\|-------\|----------\|-------\|-------\|-------\|----------\|-------\|
	\| EuroLLM-1.7B-Instruct \| 82.56\| 82.30 \| 82.07 \| 85.81 \| 80.99 \| 84.42 \| 80.74 \| 81.94 \| 83.42 \| 83.74 \| 85.06 \| 81.00 \| 78.49 \| 85.81 \|
	\| Gemma-2B-EuroBlocks \| 79.86\| 78.35 \| 81.32 \| 81.56 \| 76.54 \| 76.35 \| 77.62 \| 78.88 \| 82.36 \| 82.85 \| 83.83 \| 80.17 \| 78.42 \| 81.56 \|
	\| Gemma-7B-EuroBlocks \| 83.90\| 83.70 \| 83.21 \| 87.61 \| 82.15 \| 84.68 \| 83.05 \| 83.85 \| 84.79 \| 84.40 \| 85.86 \| 82.55 \| 80.01 \| 87.61 \|


	#### WMT-24
	\| Model \| AVG \| AVG en-xx \| AVG xx-xx \| en-es-latam \| en-cs \| en-ru \| en-uk \| en-ja \| en-zh-cn \| en-hi \| cs-uk \| ja-zh-cn \|
	\|---------\|------\|------\|-------\|-------\|-------\|-------\|--------\|--------\|-------\|-------\|-------\|-----\|
	\| EuroLLM-1.7B-Instruct\| 78.45\|78.65\|77.67\|79.05\|80.93\|80.33\|78.05\|78.72\|81.87\|80.15\|70.10\|82.65\|72.69\|
	\|Gemma-2B-EuroBlocks\| 74.71\|74.25\|76.57\|75.21\|78.84\|70.40\|74.44\|75.55\|78.32\|78.70\|62.51\|79.97\|73.17\|
	\|Gemma-7B-EuroBlocks\| 80.88\|80.45\|82.60\|80.43\|81.91\|80.14\|80.32\|82.17\|84.08\|81.86\|72.71\|85.55\|79.65\|

	### General Benchmarks
	We also compare EuroLLM-1.7B with [TinyLlama-1.1-3T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) and [Gemma-2B](https://huggingface.co/google/gemma-2b) on 3 general benchmarks: Arc Challenge, Hellaswag, and MMLU.
	For the non-english languages we use the [Okapi](https://aclanthology.org/2023.emnlp-demo.28.pdf) datasets.
	Results show that EuroLLM-1.7B is superior to TinyLlama-1.1-3T and similar to Gemma-2B on Hellaswag but worse on Arc Challenge and MMLU. This can be due to the lower number of parameters of EuroLLM-1.7B (1.133B non-embedding parameters against 1.981B).

	#### Arc Challenge
	\| Model \| Average \| English \| German \| Spanish \| French \| Italian \| Portuguese \| Chinese \| Russian \| Dutch \| Arabic \| Swedish \| Hindi \| Hungarian \| Romanian \| Ukrainian \| Danish \| Catalan \|
	\|--------------------\|---------\|---------\|--------\|---------\|--------\|---------\|------------\|---------\|---------\|-------\|--------\|---------\|--------\|-----------\|----------\|-----------\|--------\|---------\|
	\| EuroLLM-1.7B \| 0.3130 \| 0.4215 \| 0.3148 \| 0.3376 \| 0.3259 \| 0.3396 \| 0.3410 \| 0.3068 \| 0.2626 \| 0.3037\| 0.2652 \| 0.3279 \| 0.2688 \| 0.3039 \| 0.3085 \| 0.2943 \| 0.2956 \| 0.3027 \|
	\| TinyLlama-1.1-3T \| 0.2621 \| 0.3473 \| 0.2541 \| 0.2726 \| 0.2797 \| 0.2643 \| 0.2829 \| 0.2573 \| 0.2421 \| 0.2404\| 0.2335 \| 0.2661 \| 0.2337 \| 0.244 \| 0.2536 \| 0.2626 \| 0.2476 \| 0.2736 \|
	\| Gemma-2B \| 0.3617 \| 0.4846 \| 0.3755 \| 0.3940 \| 0.4080 \| 0.3687 \| 0.3872 \| 0.3726 \| 0.3456 \| 0.3328\| 0.3122 \| 0.3519 \| 0.2851 \| 0.3039 \| 0.3590 \| 0.3601 \| 0.3565 \| 0.3516 \|

	#### Hellaswag
	\| Model \| Average \| English \| German \| Spanish \| French \| Italian \| Portuguese \| Russian \| Dutch \| Arabic \| Swedish \| Hindi \| Hungarian \| Romanian \| Ukrainian \| Danish \| Catalan \|
	\|--------------------\|---------\|---------\|--------\|---------\|--------\|---------\|------------\|---------\|--------\|--------\|---------\|--------\|-----------\|----------\|-----------\|--------\|---------\|
	\| EuroLLM-1.7B \| 0.4653 \| 0.6199 \| 0.4653 \| 0.5187 \| 0.5173 \| 0.5024 \| 0.5116 \| 0.4582 \| 0.4821 \| 0.3939 \| 0.4722 \| 0.3505 \| 0.3970 \| 0.4441 \| 0.4224 \| 0.4556 \| 0.4329 \|
	\| TinyLlama-1.1-3T \| 0.3710 \| 0.6027 \| 0.3652 \| 0.4136 \| 0.4104 \| 0.3780 \| 0.4008 \| 0.3544 \| 0.3637 \| 0.2981 \| 0.3569 \| 0.2904 \| 0.3147 \| 0.3337 \| 0.3440 \| 0.3464 \| 0.3628 \|
	\| Gemma-2B \| 0.4666 \| 0.7165 \| 0.4756 \| 0.5414 \| 0.5180 \| 0.4841 \| 0.5081 \| 0.4664 \| 0.4655 \| 0.3868 \| 0.4383 \| 0.3413 \| 0.3710 \| 0.4316 \| 0.4291 \| 0.4471 \| 0.4448 \|

	#### MMLU
	\| Model \| Average \| English \| German \| Spanish \| French \| Italian \| Portuguese \| Chinese \| Russian \| Dutch \| Arabic \| Swedish \| Hindi \| Hungarian \| Romanian \| Ukrainian \| Danish \| Catalan \|
	\|--------------------\|---------\|---------\|--------\|---------\|--------\|---------\|------------\|---------\|---------\|--------\|--------\|---------\|--------\|-----------\|----------\|-----------\|--------\|---------\|
	\| EuroLLM-1.7B \| 0.2631 \| 0.2553 \| 0.2626 \| 0.2653 \| 0.2589 \| 0.2628 \| 0.2634 \| 0.2546 \| 0.2626 \| 0.2677 \| 0.2608 \| 0.2656 \| 0.2690 \| 0.2551 \| 0.2677 \| 0.2655 \| 0.2675 \| 0.2689 \|
	\| TinyLlama-1.1-3T \| 0.2546 \| 0.2604 \| 0.2498 \| 0.2528 \| 0.2535 \| 0.2531 \| 0.2511 \| 0.2629 \| 0.2541 \| 0.2521 \| 0.2591 \| 0.2528 \| 0.2550 \| 0.2566 \| 0.2548 \| 0.2651 \| 0.2419 \| 0.2528 \|
	\| Gemma-2B \| 0.3356 \| 0.4168 \| 0.3519 \| 0.3475 \| 0.3463 \| 0.3433 \| 0.3383 \| 0.3345 \| 0.3261 \| 0.3429 \| 0.3158 \| 0.3318 \| 0.2842 \| 0.3185 \| 0.3243 \| 0.3152 \| 0.3377 \| 0.3307 \|