---
license: gemma
pipeline_tag: text-generation
language:
- en
- es
- fr
- de
- pt
- ja
- ko
- zh
- ar
- ru
- hi
library_name: transformers
tags:
- gemma
- eeve
- rosetta
- grpo
- yanolja
---

# yanolja/EEVE-Rosetta-4B-2507

This model is a fine-tuned version of [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). As it is intended solely for text generation, we have extracted and utilized only the `Gemma3ForCausalLM` component from the original architecture.

While the model name includes "EEVE," our well-known model brand, this specific model does not feature an expanded tokenizer. The `EEVE` branding reflects our commitment to developing high-quality, multilingual models.

- **Model Name:** `yanolja/EEVE-Rosetta-4B-2507`
- **Base Model:** `google/gemma-3-4b-pt`

## Model Description

This model is a 4-billion parameter, decoder-only language model built on the Gemma3 architecture and fine-tuned by Yanolja NEXT. It is specifically designed to translate structured data (JSON format) while preserving the original data structure.

The model was trained on a multilingual dataset covering the following languages:
- English
- Spanish
- French
- German
- Portuguese
- Japanese
- Korean
- Chinese
- Arabic
- Russian
- Hindi

While optimized for these languages, it may also perform effectively on other languages supported by the base Gemma3 model.

## How to use

You can use this model with the `transformers` library as follows:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "yanolja/EEVE-Rosetta-4B-2507"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Example prompt
target_language = "Spanish"
messages = [
    {"role": "system", "content": f"Translate the user's text to {target_language}.\nThink through the translation step by step: first, consider the overall context, then cultural nuances, terminology, initial translation, and self-review.\nAfter this thought process, provide the final translation immediately."},
    {"role": "user", "content": "Yanolja NEXT is a company that provides global cutting-edge technology for the hospitality industry."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=4096)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

The model first outputs its thought process within `<think>` tags, followed by the final `{JSON translation}`. The output format is as follows:

```
<think>
though process will be here
</think>

{JSON translation}
```

## Training Procedure

### Training Data
The translation datasets were compiled from several sources, including:
- [AI Hub](https://aihub.or.kr/)
- [Europarl](https://www.statmt.org/europarl/)

To enhance the model's performance with chain-of-thought capabilities, we generated a synthetic reasoning dataset. The process involved:
1. Using `DeepSeek-R1` to translate text from a source to a target language.
2. Capturing the internal reasoning steps from `DeepSeek-R1` *only* when its translation perfectly matched the ground-truth target text.
3. Using this collected reasoning data to fine-tune `google/gemma-3-27b-it`. This fine-tuned model was then used to generate a comprehensive reasoning dataset for training `EEVE-Rosetta-4B-2507`.

## Intended Uses & Limitations

This model is intended for translating structured data (JSON format) while preserving the original structure. It is particularly well-suited for tasks such as localizing product catalogs, translating hotel reviews, or handling any other structured content that requires accurate translation.

### Limitations
The model's primary focus is on JSON data. Performance on unstructured text or other data formats may vary.

### License
This model is released under the Gemma license, inherited from its base model, [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt). Please consult the official [Gemma license terms](https://ai.google.dev/gemma/terms) for detailed usage guidelines.

## Citation

If you use this model, please consider citing:

```
@misc{yanolja2025eeverosetta,
  author = {Yanolja NEXT},
  title = {EEVE-Rosetta-4B-2507},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\\url{https://huggingface.co/yanolja/EEVE-Rosetta-4B-2507}}
}
```

## References

This work utilizes several models and datasets. We would like to acknowledge the original authors for their valuable contributions to the field.

```
@misc{gemma3,
  author = {Google},
  title = {Gemma 3},
  year = {2024},
  publisher = {Google DeepMind},
  howpublished = {\\url{https://deepmind.google/models/gemma/gemma-3/}}
}

@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
  title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
  author={DeepSeek-AI},
  year={2025},
  eprint={2501.12948},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.12948}, 
}

@misc{aihub,
  author = {National Information Society Agency (NIA)},
  title = {AI-Hub: AI Integrated Platform},
  year = {2025},
  publisher = {National Information Society Agency},
  howpublished = {\\url{https://aihub.or.kr}}
}

@article{europarl,
  author    = {Koehn, Philipp},
  title     = {Europarl: A Parallel Corpus for Statistical Machine Translation},
  journal   = {MT Summit},
  year      = {2005},
  pages     = {79--86}
}
```