--- license: gemma pipeline_tag: text-generation language: - en - es - fr - de - pt - ja - ko - zh - ar - ru - hi library_name: transformers tags: - gemma - eeve - rosetta - grpo - yanolja --- # yanolja/EEVE-Rosetta-4B-2507 This model is a fine-tuned version of [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture. While the model name includes "EEVE," our well-known model brand, this specific model does not feature an expanded tokenizer. The `EEVE` branding reflects our commitment to developing high-quality, multilingual models. - **Model Name:** `yanolja/EEVE-Rosetta-4B-2507` - **Base Model:** `google/gemma-3-4b-pt` ## Model Description This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure. The model was trained on a multilingual dataset covering the following languages: - English - Spanish - French - German - Portuguese - Japanese - Korean - Chinese - Arabic - Russian - Hindi While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model. ## How to use You can use this model with the `transformers` library as follows: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "yanolja/EEVE-Rosetta-4B-2507" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype=torch.bfloat16 ) # Example prompt target_language = "Spanish" messages = [ {"role": "system", "content": f"Translate the user's text to {target_language}.\nThink through the translation step by step: first, consider the overall context, then cultural nuances, terminology, initial translation, and self-review.\nAfter this thought process, provide the final translation immediately."}, {"role": "user", "content": "Yanolja NEXT is a company that provides global cutting-edge technology for the hospitality industry."} ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=4096) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` The model first outputs its thought process within `` tags, followed by the final `{JSON translation}`. The output format is as follows: ``` though process will be here {JSON translation} ``` ## Training Procedure ### Training Data The translation datasets were compiled from several sources, including: - [AI Hub](https://aihub.or.kr/) - [Europarl](https://www.statmt.org/europarl/) To enhance the model's performance with chain-of-thought capabilities, we generated a synthetic reasoning dataset. The process involved: 1. Using `DeepSeek-R1` to translate text from a source to a target language. 2. Capturing the internal reasoning steps from `DeepSeek-R1` *only* when its translation perfectly matched the ground-truth target text. 3. Using this collected reasoning data to fine-tune `google/gemma-3-27b-it`. This fine-tuned model was then used to generate a comprehensive reasoning dataset for training `EEVE-Rosetta-4B-2507`. ## Intended Uses & Limitations This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation. ### Limitations The model's primary focus is on JSON data. Performance on unstructured text or other data formats may vary. ### License This model is released under the Gemma license, inherited from its base model, [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). Please consult the official [Gemma license terms](https://ai.google.dev/gemma/terms) for detailed usage guidelines. ## Citation If you use this model, please consider citing: ``` @misc{yanolja2025eeverosetta, author = {Yanolja NEXT}, title = {EEVE-Rosetta-4B-2507}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face repository}, howpublished = {\\url{https://huggingface.co/yanolja/EEVE-Rosetta-4B-2507}} } ``` ## References This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field. ``` @misc{gemma3, author = {Google}, title = {Gemma 3}, year = {2024}, publisher = {Google DeepMind}, howpublished = {\\url{https://deepmind.google/models/gemma/gemma-3/}} } @misc{deepseekai2025deepseekr1incentivizingreasoningcapability, title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, author={DeepSeek-AI}, year={2025}, eprint={2501.12948}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.12948}, } @misc{aihub, author = {National Information Society Agency (NIA)}, title = {AI-Hub: AI Integrated Platform}, year = {2025}, publisher = {National Information Society Agency}, howpublished = {\\url{https://aihub.or.kr}} } @article{europarl, author = {Koehn, Philipp}, title = {Europarl: A Parallel Corpus for Statistical Machine Translation}, journal = {MT Summit}, year = {2005}, pages = {79--86} } ```