Llama 3.1 Future Code Ja
Llama 3.1 Future Code Ja is a large language model with 8B parameters built on top of the Meta Llama 3.1 model. The model was first experienced continual pre-trained on the mixture of code and mostly-Japanese natural language data. The training data is mainly from The Stack V2 dataset and the subset of LLM-jp Corpus v3, which comprises 204.9B code and 85.7B natural language tokens after carefully designed data cleaning. The model was then merged with the instruct variant of the Meta Llama 3.1 model to acquire abilities to follow general task instructions, followed by supervised fine-tuning (SFT) and direct preference optimization (DPO) on our own magpie-generated code instruction data.
The model officially supports Japanese and English for natural languages and more than 40 programming languages ranging from popular Python, Java etc. to some legacy languages such as COBOL. In addition to causal (left-to-right) inference, the model supports Fill-in-the-Middle (FIM) capability, where the model fills in the blank attending to bidirectional context, a common use case in IDEs.
The model outperforms the original Llama 3.1 model in both Japanese, and English-instructed code completion tasks in various programming languages, and outperforms Qwen families in Japanese generation tasks, attaining a good balance between specialty in code-related tasks and general ability in Japanese.
Usage
Here are the sample inference scripts with transformers. We recommend using vLLM for faster inference.
pip install torch transformers accelerate
Chat
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# we recommend using the following system prompt:
# for Japanese completion : "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
# for English completion : "You are an AI assistant who support various software development tasks."
message = [
{
"role": "system",
"content": "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
},
{
"role": "user",
"content": "PythonでFizzBuzzを書いてください。",
},
]
input_ids = tokenizer.apply_chat_template(
message, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))
Fill-in-the-Middle
With the idea that the users may not want line breaks just after their cursor positions, we did not create any middle splits that start with newline symbols (\n
), but included them at the end of the prefix instead.
This also holds true for the boundaries of suffix and middle splits, causing great sensitivity against which split to include newline symbols.
Please remove one new line symbol (if exists) from the beginning of the suffix for improved performance.
You may set a larger repetition penalty to avoid nonsense generations with too many signs.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepend <|begin_of_text|> to inform that this is the beginning of "the content" (not whole sequence with special tokens)
prefix = "<|begin_of_text|>def fizzbuzz(n"
suffix = "return n"
# PSM mode (infilling)
input_txt = FIM_PREFIX + prefix + FIM_SUFFIX + suffix + FIM_MIDDLE
# SPM mode (reverse infilling)
# input_txt = FIM_PREFIX + FIM_SUFFIX + suffix + FIM_MIDDLE + prefix
# set add_special_tokens to False, so that the tokenizer does NOT add <|begin_of_text|> before special tokens
input_ids = tokenizer(input_txt, add_special_tokens=False, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, max_new_tokens=1024, temperature=0.2, top_p=0.95)
print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))
Model Performance
Code completion (Japanese)
- JHumanEval (Sato et al., 2024)
- JMultiPL-E (Taneguchi et al., 2025)
Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.
model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama 3.1 Future Code Ja | 8B | 0.6335 | 0.5267 | 0.3633 | 0.1564 | 0.6286 | 0.4696 | 0.5528 | 0.4814 | 0.2919 | 0.2969 | 0.1870 | 0.4487 | 0.4425 | 0.3285 | 0.3861 | 0.5623 |
Llama 3.1 | 8B | 0.5061 | 0.4391 | 0.2835 | 0.2147 | 0.5519 | 0.3753 | 0.4640 | 0.4248 | 0.2584 | 0.2360 | 0.3112 | 0.3269 | 0.3175 | 0.2665 | 0.3323 | 0.4799 |
Llama 3.1 Swallow | 8B | 0.4213 | 0.3329 | 0.2456 | 0.1026 | 0.6370 | 0.3468 | 0.3112 | 0.3273 | 0.1758 | 0.1807 | 0.0503 | 0.2090 | 0.2487 | 0.1525 | 0.2354 | 0.3258 |
Qwen2.5 | 7B | 0.6018 | 0.5106 | 0.3601 | 0.2353 | 0.7500 | 0.5044 | 0.5416 | 0.5267 | 0.3075 | 0.3466 | 0.3683 | 0.5071 | 0.3969 | 0.3380 | 0.4576 | 0.6025 |
Qwen2.5-Coder | 7B | 0.6695 | 0.6379 | 0.4601 | 0.1660 | 0.7110 | 0.5468 | 0.6696 | 0.5894 | 0.3497 | 0.4174 | 0.3565 | 0.6032 | 0.4950 | 0.3544 | 0.5285 | 0.6358 |
Qwen3 | 8B | 0.6256 | 0.5683 | 0.3709 | 0.1583 | 0.5156 | 0.4778 | 0.5814 | 0.5547 | 0.3969 | 0.2466 | 0.3217 | 0.4763 | 0.4075 | 0.3418 | 0.3715 | 0.5239 |
Gemma 2 | 9B | 0.5549 | 0.4590 | 0.3608 | 0.0897 | 0.7052 | 0.4601 | 0.2863 | 0.4733 | 0.1099 | 0.1615 | 0.1205 | 0.3417 | 0.3850 | 0.1209 | 0.3272 | 0.2346 |
Code completion (English)
Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.
model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Llama 3.1 Future Code Ja | 8B | 0.6835 | 0.5795 | 0.3829 | 0.1692 | 0.6279 | 0.4987 | 0.6149 | 0.5565 | 0.3652 | 0.3317 | 0.1752 | 0.4846 | 0.4662 | 0.3595 | 0.4525 | 0.6390 |
Llama 3.1 | 8B | 0.6311 | 0.4795 | 0.3184 | 0.2083 | 0.5909 | 0.4715 | 0.5571 | 0.4658 | 0.3236 | 0.2696 | 0.4267 | 0.3744 | 0.3856 | 0.2994 | 0.3741 | 0.5717 |
Llama 3.1 Swallow | 8B | 0.4701 | 0.3720 | 0.2646 | 0.1224 | 0.6519 | 0.3759 | 0.3006 | 0.3733 | 0.1752 | 0.1447 | 0.0590 | 0.2103 | 0.2744 | 0.1614 | 0.2190 | 0.3786 |
Qwen2.5 | 7B | 0.6732 | 0.5491 | 0.4253 | 0.2455 | 0.7000 | 0.6013 | 0.6137 | 0.5913 | 0.3373 | 0.3832 | 0.4429 | 0.5923 | 0.4263 | 0.3715 | 0.5095 | 0.6535 |
Qwen2.5-Coder | 7B | 0.7890 | 0.7373 | 0.5152 | 0.1936 | 0.3935 | 0.6184 | 0.7385 | 0.6528 | 0.3969 | 0.4224 | 0.4230 | 0.6545 | 0.5725 | 0.4158 | 0.5797 | 0.7434 |
Qwen3 | 8B | 0.7134 | 0.6702 | 0.4285 | 0.2295 | 0.4721 | 0.5747 | 0.6602 | 0.6236 | 0.4441 | 0.3627 | 0.4261 | 0.6154 | 0.5363 | 0.4089 | 0.4304 | 0.6082 |
Gemma 2 | 9B | 0.6128 | 0.5118 | 0.3728 | 0.1045 | 0.6552 | 0.4791 | 0.3758 | 0.4863 | 0.0783 | 0.1186 | 0.0795 | 0.3853 | 0.4162 | 0.1437 | 0.3506 | 0.3723 |
Fill-in-the-Middle
- SantaCoder-FIM (Allal et al., 2023)
Note: The models with asterisk (*) do not support FIM. We used the SPM prompt in Gong et al., 2024 and truncated the generated output just before the point that matched the beginning of the provided suffix. The scores of Llama models on PSM mode are not reported here since we got almost 0 scores for all those settings. All the scores below are exact match (EM) with 1 trial.
model | size | PSM (py) | SPM (py) | PSM (js) | SPM (js) | PSM (java) | SPM (java) |
---|---|---|---|---|---|---|---|
Llama 3.1 Future Code Ja | 8B | 0.5216 | 0.5139 | 0.6018 | 0.6049 | 0.5517 | 0.5478 |
Qwen2.5-Coder | 7B | 0.5829 | 0.4084 | 0.6612 | 0.5597 | 0.6433 | 0.6180 |
Llama 3.1 8B * | 8B | - | 0.4468 | - | 0.3951 | - | 0.3506 |
Llama 3.1 70B * | 70B | - | 0.5964 | - | 0.5084 | - | 0.2910 |
Japanese tasks
- JCommonSenseQA (Kurihara et al., 2022, Exact Match)
- JEMHopQA (Ishii et al., 2024, chr-F1)
- NIILC (Sekine, 2003, chr-F1)
- JSQuAD (Kurihara et al., 2022, chr-F1)
- XL-Sum (Hasan et al., 2021, ROUGE-2)
- MGSM (Shi et al., 2023, Exact Match)
- WMT20 en-ja (Barrault et al., 2020, BLEU)
- WMT20 ja-en (Barrault et al., 2020, BLEU)
model | size | JCommonsenseQA | JEMHopQA | NIILC | JSQuAD | XL-SUM | MGSM | WMT20 en-ja | WMT20 ja-en |
---|---|---|---|---|---|---|---|---|---|
Llama 3.1 Future Code Ja | 8B | 0.9124 | 0.4983 | 0.5118 | 0.8758 | 0.1779 | 0.5480 | 0.2624 | 0.2028 |
Llama 3.1 | 8B | 0.8829 | 0.4537 | 0.4050 | 0.8868 | 0.1486 | 0.5080 | 0.2195 | 0.2008 |
Llama 3.1 Swallow | 8B | 0.9240 | 0.5228 | 0.5805 | 0.8957 | 0.1920 | 0.5480 | 0.2818 | 0.2263 |
Qwen2.5 | 7B | 0.9142 | 0.4394 | 0.3998 | 0.8908 | 0.1690 | 0.6240 | 0.2091 | 0.1909 |
Qwen2.5-Coder | 7B | 0.8472 | 0.3014 | 0.3045 | 0.8906 | 0.1533 | 0.5360 | 0.1816 | 0.1598 |
Qwen3 | 8B | 0.9169 | 0.4265 | 0.4197 | 0.8943 | 0.1882 | 0.7720 | 0.2450 | 0.2133 |
Gemma 2 | 9B | 0.9312 | 0.5288 | 0.5306 | 0.8774 | 0.0873 | 0.4680 | 0.2305 | 0.2017 |
English tasks
- TriviaQA (Joshi et al., 2017, Exact Match)
- SQuAD2 (Rajpurkar et al., 2018, Exact Match)
- GSM8K (Cobbe et al., 2021, Exact Match)
model | size | TriviaQA | SQuAD2 | GSM8K |
---|---|---|---|---|
Llama 3.1 Future Code Ja | 8B | 0.6233 | 0.3754 | 0.7111 |
Llama 3.1 | 8B | 0.6991 | 0.3784 | 0.7475 |
Llama 3.1 Swallow | 8B | 0.6296 | 0.3628 | 0.6126 |
Qwen2.5 | 7B | 0.5176 | 0.2624 | 0.7430 |
Qwen2.5-Coder | 7B | 0.4517 | 0.3388 | 0.7020 |
Qwen3 | 8B | 0.5631 | 0.3922 | 0.8749 |
Gemma 2 | 9B | 0.6573 | 0.3944 | 0.7908 |
Evaluation Details
We used the Code Generation LM Evaluation Harness toolkit to evaluate code completion and FIM capabilities.
We adopted the settings below for decoding.
We mostly followed the recommendations however, we set max_new_tokens
instead of max_tokens
to avoid truncation while handling long input sequences.
- Temperature: 0.2
- Top-p: 0.95
- Number of completions to generate: 10 (for completion tasks), 1 (for FIM tasks)
- Maximum number of new tokens: 512
We followed the evaluation strategy adopted in the Swallow project for Japanese and English tasks. More specifically, we used the llm-jp-eval toolkit for Japanese tasks and the Language Model Evaluation Harness toolkit for English (and some Japanese) tasks.
We adopted the default decoding strategy for all the tasks.
Risks and Limitations
The model is trained on general tasks related to software development, not on organization-specific, and/or non-standardized tasks. We recommend further fine-tuning the model to make it work better with those tasks. The model may produce incorrect output and all the suggestions from the model must be carefully examined before adopting in real-world applications.
Acknowledgements
The model is developed as part of the Generative AI Accelerator Challenge (GENIAC) project. We thank great support from the New Energy and Industrial Technology Development Organization (NEDO) and the Ministry of Economy, Trade and Industry (METI) for financial support.
Contact
- pj-geniac at future.co.jp
License
META LLAMA 3.1 COMMUNITY LICENSE
Copyright © 2025 by Future Corporation
- Downloads last month
- 7