--- datasets: - HuggingFaceFW/fineweb-2 - HuggingFaceFW/fineweb-edu - bigcode/starcoderdata - HuggingFaceTB/finemath language: - fi - en license: llama3.1 library_name: transformers pipeline_tag: text-generation --- # Poro 2 70B Base Model Card Poro 2 70B Base is a 70B parameter decoder-only transformer created through continued pretraining of Llama 3.1 70B to add Finnish language capabilities. It was trained on 165B tokens using a carefully balanced mix of Finnish, English, code, and math data. Poro 2 is a fully open source model and is made available under the Llama 3.1 Community License. Poro 2 was created in a collaboration between [AMD Silo AI](https://www.amd.com/en/solutions/ai/silo-ai.html), the [TurkuNLP group](https://turkunlp.org/) of the University of Turku, and [High Performance Language Technologies](https://hplt-project.org/) (HPLT). Training was conducted on the [LUMI supercomputer](https://www.lumi-supercomputer.eu/), using compute resources generously provided by [CSC](https://csc.fi/) - IT Center for Science, Finland. This model demonstrates how continued pretraining can efficiently add new language capabilities to existing models while maintaining performance in the original domains. Through the combination of English and Finnish training data, we achieve a model that substantially outperforms the base Llama 3.1 70B model in Finnish while maintaining excellent English proficiency. For more details on our training and data curation process, check out our [Continued Pretraining Playbook](https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html). ## Poro 2 Model Family The Poro 2 model family includes both 8B and 70B models, and there are three different versions released of the Poro 2 models: a base model, a post-training SFT-only checkpoint, and the final instruct model which is the SFT model plus a round of DPO. | Model | Based on | Base Model | SFT | Instruct | | :---: | :------: | :--------: | :-: | :------- | | Poro 2 8B | Llama 3.1 8B | [Poro 2 8B Base](https://huggingface.co/LumiOpen/Llama-Poro-2-8B-base) | [Poro 2 8B SFT](https://huggingface.co/LumiOpen/Llama-Poro-2-8B-SFT) | [Poro 2 8B Instruct](https://huggingface.co/LumiOpen/Llama-Poro-2-8B-Instruct) | | Poro 2 70B | Llama 3.1 70B | [Poro 2 70B Base](https://huggingface.co/LumiOpen/Llama-Poro-2-70B-base) | [Poro 2 70B SFT](https://huggingface.co/LumiOpen/Llama-Poro-2-70B-SFT) | [Poro 2 70B Instruct](https://huggingface.co/LumiOpen/Llama-Poro-2-70B-Instruct) | _What does Poro mean?_ Poro is the Finnish word for Reindeer! 🦌 These animals are native to Finland and hold a significant and historical role in Finnish culture. ## Model Overview **NOTE:** This is a base model which needs further fine tuning for most use cases. Poro 2 70B is based on the Llama 3.1 70B architecture and uses continued pretraining to add Finnish language capabilities. | Hyperparameter | Value | | :------------- | :----: | | n_parameters | 70.55B | | n_layers | 80 | | n_heads | 64 | | n_kv_heads | 8 | | d_model | 8192 | | vocab_size | 128256 | | max_sequence_length | 8192 | | base_model | Llama-3.1-70B | ## Training Poro 2 70B was created through continued pretraining on the LUMI supercomputer, using AMD MI250X GPUs. Training used a 3D parallelism strategy with TP=8, PP=8. Training was conducted using a custom version of the Megatron-LM framework. Our code is available at [https://github.com/LumiOpen/Megatron-LM-lumi](https://github.com/LumiOpen/Megatron-LM-lumi). ## Training Hyperparameters | Hyperparameter | Value | Comment | | :------------: | :---: | :------:| | Precision | bfloat16 | | | Optimizer | AdamW | | | Learning rate | 1.5e-4 | | | LR scheduler | cosine | Warmup ratio 0.05, min LR 1e-8 | | Weight decay | 1e-1 | | | Global batch size | 512 | | | Micro batch size | 1 | | | Max sequence length | 8192 | | | Total tokens | 165B | 1 epoch | ## Dataset Poro 2 70B was trained on a balanced 165B token dataset designed to maintain English, code, and math capabilities while adding Finnish proficiency. | Dataset | Source | Percentage | Tokens | | :-----: | :----: | :--------: | :----: | | Finnish | FineWeb2 | 30% | 50B | | English | FineWeb-Edu | 30% | 50B | | Code | StarCoder | 30% | 50B | | Math | FineMath | 10% | 16B | | **Total** | | **100%** | **165B** | ## Evaluation Results Poro 2 70B shows substantial improvements in Finnish capabilities over Llama 3.1 70B, while maintaining and in some cases improving English performance. ### Finnish Performance | | Poro 2 70B | Llama 3.1 70B | |-----------------|------------------|----------------| | ARC Challenge | **61.01** | 54.52 | | HellaSwag | **58.07** | 52.10 | | MMLU | **73.76** | 71.29 | | TruthfulQA | **55.53** | 53.64 | | GSM8K | **72.78** | 69.90 | ### English Performance | | Poro 2 70B | Llama 3.1 70B | |-----------------|------------------|----------------| | ARC Challenge | **69.97** | 69.45 | | HellaSwag | **87.85** | 87.81 | | MMLU | 78.20 | **78.59** | | TruthfulQA | **51.43** | 49.78 | | GSM8K | **81.35** | 81.05 | ### Translation Performance | | Poro 2 70B | Llama 3.1 70B | |----------------------------|--------|----------------| | EN→FI BLEU | **40.03** | 35.02 | | FI→EN BLEU | **43.04** | 41.67 | | EN→FI chrF | **62.50** | 59.16 | | FI→EN chrF | **64.16** | 63.03 | ### Code Performance | | Poro 2 70B | Llama 3.1 70B | |----------------------------|--------|----------------| | HumanEval pass@10 | **71.34** | 64.63 | **Overall**: ~4 percentage point average improvement in Finnish benchmarks while maintaining excellent English performance (slight average improvement of ~0.4 percentage points). ## Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "LumiOpen/Poro-2-70B-base" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) # Example usage prompt = "Kerro minulle Suomesta." # "Tell me about Finland" in Finnish inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=200, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Ethical Considerations and Limitations Poro 2 70B is an advanced language model optimized for English and Finnish, with additional capabilities in code and mathematics. As with most AI-driven systems, Poro 2 is a product of the vast data it has been trained on, which may reflect the imperfections, biases, and idiosyncrasies of the wider web. The model may, at times, produce outputs that can be considered inaccurate, prejudiced, or controversial. Key limitations: - Limited proficiency in languages other than English and Finnish - Potential for generating biased or inappropriate content - May produce factually incorrect information ## License Built with Llama Poro 2 70B is released under the Llama 3.1 Community License. Please review the license terms before use. ## Citation ```bibtex @misc{poro2_2025, title={Poro 2: Continued Pretraining for Language Acquisition}, author={Elaine Zosa and Jouni Louma and Kai Hakala and Antti Virtanen and Mika Koistinen and Risto Luukkonen and Akseli Reunamo and Sampo Pyysalo and Jonathan Burdge}, year={2025}, howpublished={LumiOpen} } ``` ## Acknowledgments We thank CSC - IT Center for Science, Finland for providing access to the LUMI supercomputer. This work was supported by the High Performance Language Technologies (HPLT) project and conducted in collaboration with TurkuNLP from the University of Turku. This project has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No 101070350.