Llama 3.1 Future Code Ja

Llama 3.1 Future Code Ja is a large language model with 8B parameters built on top of the Meta Llama 3.1 model. The model was first experienced continual pre-trained on the mixture of code and mostly-Japanese natural language data. The training data is mainly from The Stack V2 dataset and the subset of LLM-jp Corpus v3, which comprises 204.9B code and 85.7B natural language tokens after carefully designed data cleaning. The model was then merged with the instruct variant of the Meta Llama 3.1 model to acquire abilities to follow general task instructions, followed by supervised fine-tuning (SFT) and direct preference optimization (DPO) on our own magpie-generated code instruction data.

The model officially supports Japanese and English for natural languages and more than 40 programming languages ranging from popular Python, Java etc. to some legacy languages such as COBOL. In addition to causal (left-to-right) inference, the model supports Fill-in-the-Middle (FIM) capability, where the model fills in the blank attending to bidirectional context, a common use case in IDEs.

The model outperforms the original Llama 3.1 model in both Japanese, and English-instructed code completion tasks in various programming languages, and outperforms Qwen families in Japanese generation tasks, attaining a good balance between specialty in code-related tasks and general ability in Japanese.

Usage

Here are the sample inference scripts with transformers. We recommend using vLLM for faster inference.

pip install torch transformers accelerate

Chat

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# we recommend using the following system prompt:
# for Japanese completion : "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
# for English completion : "You are an AI assistant who support various software development tasks."

message = [
    {
        "role": "system",
        "content": "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
    },
    {
        "role": "user",
        "content": "PythonでFizzBuzzを書いてください。",
    },
]

input_ids = tokenizer.apply_chat_template(
    message, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(model.device)

output = model.generate(**input_ids, max_new_tokens=1024)

print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))

Fill-in-the-Middle

With the idea that the users may not want line breaks just after their cursor positions, we did not create any middle splits that start with newline symbols (\n), but included them at the end of the prefix instead. This also holds true for the boundaries of suffix and middle splits, causing great sensitivity against which split to include newline symbols. Please remove one new line symbol (if exists) from the beginning of the suffix for improved performance.

You may set a larger repetition penalty to avoid nonsense generations with too many signs.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"

model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

# prepend <|begin_of_text|> to inform that this is the beginning of "the content" (not whole sequence with special tokens)
prefix = "<|begin_of_text|>def fizzbuzz(n"
suffix = "return n"

# PSM mode (infilling)
input_txt = FIM_PREFIX + prefix + FIM_SUFFIX + suffix + FIM_MIDDLE
# SPM mode (reverse infilling)
# input_txt = FIM_PREFIX + FIM_SUFFIX + suffix + FIM_MIDDLE + prefix

# set add_special_tokens to False, so that the tokenizer does NOT add <|begin_of_text|> before special tokens
input_ids = tokenizer(input_txt, add_special_tokens=False, return_tensors="pt").to(model.device)

output = model.generate(**input_ids, max_new_tokens=1024, temperature=0.2, top_p=0.95)

print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))

Model Performance

Code completion (Japanese)

JHumanEval (Sato et al., 2024)
JMultiPL-E (Taneguchi et al., 2025)

Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.

model	size	py	cpp	cs	d	go	java	js	php	pl	r	rb	rs	scala	sh	swift	ts
Llama 3.1 Future Code Ja	8B	0.6335	0.5267	0.3633	0.1564	0.6286	0.4696	0.5528	0.4814	0.2919	0.2969	0.1870	0.4487	0.4425	0.3285	0.3861	0.5623
Llama 3.1	8B	0.5061	0.4391	0.2835	0.2147	0.5519	0.3753	0.4640	0.4248	0.2584	0.2360	0.3112	0.3269	0.3175	0.2665	0.3323	0.4799
Llama 3.1 Swallow	8B	0.4213	0.3329	0.2456	0.1026	0.6370	0.3468	0.3112	0.3273	0.1758	0.1807	0.0503	0.2090	0.2487	0.1525	0.2354	0.3258
Qwen2.5	7B	0.6018	0.5106	0.3601	0.2353	0.7500	0.5044	0.5416	0.5267	0.3075	0.3466	0.3683	0.5071	0.3969	0.3380	0.4576	0.6025
Qwen2.5-Coder	7B	0.6695	0.6379	0.4601	0.1660	0.7110	0.5468	0.6696	0.5894	0.3497	0.4174	0.3565	0.6032	0.4950	0.3544	0.5285	0.6358
Qwen3	8B	0.6256	0.5683	0.3709	0.1583	0.5156	0.4778	0.5814	0.5547	0.3969	0.2466	0.3217	0.4763	0.4075	0.3418	0.3715	0.5239
Gemma 2	9B	0.5549	0.4590	0.3608	0.0897	0.7052	0.4601	0.2863	0.4733	0.1099	0.1615	0.1205	0.3417	0.3850	0.1209	0.3272	0.2346

Code completion (English)

HumanEval (Chen et al., 2021)
MultiPL-E (Cassano et al., 2022)

Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.

model	size	py	cpp	cs	d	go	java	js	php	pl	r	rb	rs	scala	sh	swift	ts
Llama 3.1 Future Code Ja	8B	0.6835	0.5795	0.3829	0.1692	0.6279	0.4987	0.6149	0.5565	0.3652	0.3317	0.1752	0.4846	0.4662	0.3595	0.4525	0.6390
Llama 3.1	8B	0.6311	0.4795	0.3184	0.2083	0.5909	0.4715	0.5571	0.4658	0.3236	0.2696	0.4267	0.3744	0.3856	0.2994	0.3741	0.5717
Llama 3.1 Swallow	8B	0.4701	0.3720	0.2646	0.1224	0.6519	0.3759	0.3006	0.3733	0.1752	0.1447	0.0590	0.2103	0.2744	0.1614	0.2190	0.3786
Qwen2.5	7B	0.6732	0.5491	0.4253	0.2455	0.7000	0.6013	0.6137	0.5913	0.3373	0.3832	0.4429	0.5923	0.4263	0.3715	0.5095	0.6535
Qwen2.5-Coder	7B	0.7890	0.7373	0.5152	0.1936	0.3935	0.6184	0.7385	0.6528	0.3969	0.4224	0.4230	0.6545	0.5725	0.4158	0.5797	0.7434
Qwen3	8B	0.7134	0.6702	0.4285	0.2295	0.4721	0.5747	0.6602	0.6236	0.4441	0.3627	0.4261	0.6154	0.5363	0.4089	0.4304	0.6082
Gemma 2	9B	0.6128	0.5118	0.3728	0.1045	0.6552	0.4791	0.3758	0.4863	0.0783	0.1186	0.0795	0.3853	0.4162	0.1437	0.3506	0.3723

Fill-in-the-Middle

SantaCoder-FIM (Allal et al., 2023)

Note: The models with asterisk (*) do not support FIM. We used the SPM prompt in Gong et al., 2024 and truncated the generated output just before the point that matched the beginning of the provided suffix. The scores of Llama models on PSM mode are not reported here since we got almost 0 scores for all those settings. All the scores below are exact match (EM) with 1 trial.

model	size	PSM (py)	SPM (py)	PSM (js)	SPM (js)	PSM (java)	SPM (java)
Llama 3.1 Future Code Ja	8B	0.5216	0.5139	0.6018	0.6049	0.5517	0.5478
Qwen2.5-Coder	7B	0.5829	0.4084	0.6612	0.5597	0.6433	0.6180
Llama 3.1 8B *	8B	-	0.4468	-	0.3951	-	0.3506
Llama 3.1 70B *	70B	-	0.5964	-	0.5084	-	0.2910

Japanese tasks

JCommonSenseQA (Kurihara et al., 2022, Exact Match)
JEMHopQA (Ishii et al., 2024, chr-F1)
NIILC (Sekine, 2003, chr-F1)
JSQuAD (Kurihara et al., 2022, chr-F1)
XL-Sum (Hasan et al., 2021, ROUGE-2)
MGSM (Shi et al., 2023, Exact Match)
WMT20 en-ja (Barrault et al., 2020, BLEU)
WMT20 ja-en (Barrault et al., 2020, BLEU)

model	size	JCommonsenseQA	JEMHopQA	NIILC	JSQuAD	XL-SUM	MGSM	WMT20 en-ja	WMT20 ja-en
Llama 3.1 Future Code Ja	8B	0.9124	0.4983	0.5118	0.8758	0.1779	0.5480	0.2624	0.2028
Llama 3.1	8B	0.8829	0.4537	0.4050	0.8868	0.1486	0.5080	0.2195	0.2008
Llama 3.1 Swallow	8B	0.9240	0.5228	0.5805	0.8957	0.1920	0.5480	0.2818	0.2263
Qwen2.5	7B	0.9142	0.4394	0.3998	0.8908	0.1690	0.6240	0.2091	0.1909
Qwen2.5-Coder	7B	0.8472	0.3014	0.3045	0.8906	0.1533	0.5360	0.1816	0.1598
Qwen3	8B	0.9169	0.4265	0.4197	0.8943	0.1882	0.7720	0.2450	0.2133
Gemma 2	9B	0.9312	0.5288	0.5306	0.8774	0.0873	0.4680	0.2305	0.2017

English tasks

TriviaQA (Joshi et al., 2017, Exact Match)
SQuAD2 (Rajpurkar et al., 2018, Exact Match)
GSM8K (Cobbe et al., 2021, Exact Match)

model	size	TriviaQA	SQuAD2	GSM8K
Llama 3.1 Future Code Ja	8B	0.6233	0.3754	0.7111
Llama 3.1	8B	0.6991	0.3784	0.7475
Llama 3.1 Swallow	8B	0.6296	0.3628	0.6126
Qwen2.5	7B	0.5176	0.2624	0.7430
Qwen2.5-Coder	7B	0.4517	0.3388	0.7020
Qwen3	8B	0.5631	0.3922	0.8749
Gemma 2	9B	0.6573	0.3944	0.7908

Evaluation Details

We used the Code Generation LM Evaluation Harness toolkit to evaluate code completion and FIM capabilities.

We adopted the settings below for decoding. We mostly followed the recommendations however, we set max_new_tokens instead of max_tokens to avoid truncation while handling long input sequences.

Temperature: 0.2
Top-p: 0.95
Number of completions to generate: 10 (for completion tasks), 1 (for FIM tasks)
Maximum number of new tokens: 512

We followed the evaluation strategy adopted in the Swallow project for Japanese and English tasks. More specifically, we used the llm-jp-eval toolkit for Japanese tasks and the Language Model Evaluation Harness toolkit for English (and some Japanese) tasks.

We adopted the default decoding strategy for all the tasks.

Risks and Limitations

The model is trained on general tasks related to software development, not on organization-specific, and/or non-standardized tasks. We recommend further fine-tuning the model to make it work better with those tasks. The model may produce incorrect output and all the suggestions from the model must be carefully examined before adopting in real-world applications.

Acknowledgements

The model is developed as part of the Generative AI Accelerator Challenge (GENIAC) project. We thank great support from the New Energy and Industrial Technology Development Organization (NEDO) and the Ministry of Economy, Trade and Industry (METI) for financial support.

Contact

pj-geniac at future.co.jp

License

META LLAMA 3.1 COMMUNITY LICENSE

future-architect
/

Llama-3.1-Future-Code-Ja-8B

Llama 3.1 Future Code Ja

Usage

Chat

Fill-in-the-Middle

Model Performance

Code completion (Japanese)

Code completion (English)

Fill-in-the-Middle

Japanese tasks

English tasks

Evaluation Details

Risks and Limitations

Acknowledgements

Contact

License

Datasets used to train future-architect/Llama-3.1-Future-Code-Ja-8B