Tri-70B-preview-SFT

Introduction

We introduce Tri-70B-preview-SFT, a research preview of our latest and largest flagship language model that redefines the efficiency frontier in LLM training. By achieving frontier performance for it's compute size (1.5T training tokens from scratch), we demonstrate that exceptional capabilities don't require excessive computational resources.

We are releasing a minimally post-trained version to enable open research and community experimentation. This preview version has only undergone supervised fine-tuning and has not been subjected to extensive RLHF. This enables researchers to explore various RL-based alignment techniques with this model. Stay tuned for the base model release coming soon!

Key Highlights

Architecture optimized for long context
- 32k context window
- Sliding window attention with window size 4096
- iRoPE: Interleaved local (RoPE) and global (temperature-scaled) attention
- Scalable softmax
Multi-lingual capabilities: Specially optimized for English, Korean, and Japanese
Enhanced reasoning: Modified training dataset mixture specifically designed for reasoning capabilities, with emphasis on step-by-step problem solving
Minimal post-training: This preview release features only supervised fine-tuning, enabling researchers to explore custom alignment techniques and RLHF/RLVR approaches

Model Specifications

Tri-70B-preview-SFT

Specification	Value
Type	Causal Language Model
Training Stage	Pre-training & Supervised Fine-Tuning
Architecture	Transformer Decoder with iRoPE (global attention frequency of 4), SwiGLU, RMSNorm, and GQA
Number of Parameters	70B
Number of Layers	80
Number of Attention Heads	64 (Query) / 8 (Key, Value)
Context Length	32,768
Number of Tokens Seen	1.5T
Vocab Size	124,416

Quickstart

Here is a code snippet with apply_chat_template that demonstrates how to load the tokenizer and model and generate text:

Tri-70B-SFT Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "trillionlabs/Tri-70B-preview-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain the concept of central limit theorem in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

vLLM Deployment

We are working with vLLM team on native vLLM support of our Trillion model. In the meantime, we offer an off-the-tree integration method (link).

# install off-the-tree model integration plugin
git clone https://github.com/trillion-labs/trillion-deployment-utils.git
cd trillion-deployment-utils && pip install -e .

vllm serve trillionlabs/Tri-70B-preview-SFT --trust-remote-code

The plugin has been tested on vLLM version 0.10.1.1. Check out the repository for detailed usages.

Evaluation

We evaluated Tri-70B-preview-SFT across a suite of benchmarks assessing general reasoning, knowledge recall, coding abilities, mathematical reasoning, and instruction-following capabilities. We compare our model against state-of-the-art models of similar scale: Qwen-2.5-72B-instruct and Llama-3.1-70B.

Full evaluation settings

Benchmark Evaluation Settings

Benchmark	Language	Evaluation Setting	Metric
• HAERAE	Korean	3-shot	accuracy
• KMMLU	Korean	0-shot, CoT	accuracy (exact-match)
• MMLU	English	0-shot, CoT	accuracy (exact-match)
• MMLU-Pro	English	0-shot, CoT	exact-match
• HumanEval	English	0-shot	pass@1
• MBPPPlus	English	0-shot	pass@1
• GSM8k	English	0-shot, CoT	exact-match
• MATH	English	0-shot, CoT	exact-match
• GPQA Diamond	English	0-shot, CoT	accuracy
• HRM8k	Korean	0-shot, CoT	exact-match
• MT-Bench	English	LLM-as-a-judge (gpt-4o)	LLM score

**Note that MT-Bench uses a 10-point scale.

Benchmark Results

Models compared:

Tri-70B-preview-SFT: Our flagship 70B parameter model
Qwen-2.5-72B-instruct: Qwen's 72B parameter instruction-tuned model
Llama-3.1-70B: Meta's instruction-tuned 70B model

Benchmark	Tri-70B-SFT	Qwen-2.5-72B-instruct	Llama-3.1-70B
HAERAE	83.96	75.44	78.09
KMMLU	62.38	65.07	54.62
MMLU	74.42	87.29	85.47
MMLU-Pro	62.48	69.40	62.79
HumanEval	-	89.02	82.93
MBPPPlus	68.52	88.2	84.13
GSM8k	87.37	91.51	72.48
MATH	64.40	80.80	62.40
GPQA-Diamond	-	54.04	44.44
HRM8k	82.26	66.24	63.90
MT-Bench	7.54	8.71	8.2

Limitations

Language Support: The model is optimized for English, Korean, and Japanese. Usage with other languages may result in degraded performance.
Knowledge Cutoff: The model's knowledge is limited to information available up to February 2025.
Minimal Post-Training: As this is a supervised fine-tuning (SFT) release without RLHF, responses may occasionally lack the polish and safety alignment of fully post-trained models.

License

This model repository is licensed under the Trillion License.

Contact

For inquiries, please contact: [email protected]

Downloads last month: 610

Safetensors

Model size

70B params

Tensor type

F32

Collection including trillionlabs/Tri-70B-preview-SFT

Tri Series

Collection

Introducing our new series of models: Tri-7B, Tri-21B, and Tri-70B-preview-SFT • 10 items • Updated Sep 10 • 8