--- license: apache-2.0 datasets: - faur-ai/fulg language: - ro --- # LLMic Model Card [LLMic: Romanian Foundation Language Model](https://arxiv.org/abs/2501.07721) ## Model Summary LLMic is a bilingual Romanian-English foundation model. LLmic is a 3B parameters dense decoder-only Transformer model based on Llama2. ## Architecture | Parameter | Value | |-----------|---------| | Sequence Length | 2048 | | Number of Layers | 24 | | Embedding Size | 2,560 | | FFN Hidden Size | 10,240 | | Number of Heads | 20 | | Number of KV Heads | 5 | | Activation Function | SiLU | | Position Encodings | RoPE (Θ=500,000) | | Layer Norm | RMSNorm (ε=10⁻⁵) | | Tied Embeddings | No | ## Intended Use Our model is designed to accelerate research on Romanian language models, serving as a building block for generative AI applications. ## Use with transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer device = "cuda" model_id = "faur-ai/LLMic" prompt = "Capitala României este" model = AutoModelForCausalLM.from_pretrained(model_id).to(device) tokenizer = AutoTokenizer.from_pretrained(model_id) streamer = TextStreamer(tokenizer) inputs = tokenizer.encode( prompt, add_special_tokens=False, return_tensors='pt', ).to(device) outputs = model.generate( streamer=streamer, input_ids=inputs, temperature=0.8, do_sample=True ) ``` ## Data Overview ### Training Datasets | Source | Size | |---------|------| | *Romanian (300B)* | | | Web Sources | 621 GB | | Discussions, Curated & Parallel | 10 GB | | *English (700B)* | | | FineWebEdu | -- | | Dolma Subset | 109 GB | #### Benchmark datasets We evaluated LLMic on the WMT16 language translation benchmark for English-to-Romanian. | Model | Score | |--------|--------| | LLMIC | 41.01 | | mBART | 38.50 | | Llama-3.1-8B-Instruct | 29.02 | | RoMistral-7b-Instruct | 27.70 | | RoLlama3-8b-Instruct | 27.31 | | Mistral-7B-Instruct-v0.2 | 26.19 | | RoGemma-7b-Instruct | 25.96 | | Gemma-1.1-7b-it | 25.48 | ## Citation **BibTeX:** ``` @misc{bădoiu2025llmicromanianfoundationlanguage, title={LLMic: Romanian Foundation Language Model}, author={Vlad-Andrei Bădoiu and Mihai-Valentin Dumitru and Alexandru M. Gherghescu and Alexandru Agache and Costin Raiciu}, year={2025}, eprint={2501.07721}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.07721}, } ```