Tri-70B-preview-SFT

Introduction

We introduce Tri-70B-preview-SFT, a research preview of our latest and largest flagship language model that redefines the efficiency frontier in LLM training. By achieving frontier performance for it's compute size (1.5T training tokens from scratch), we demonstrate that exceptional capabilities don't require excessive computational resources.

We are releasing a minimally post-trained version to enable open research and community experimentation. This preview version has only undergone supervised fine-tuning and has not been subjected to extensive RLHF. This enables researchers to explore various RL-based alignment techniques with this model. Stay tuned for the base model release coming soon!

Key Highlights

  • Architecture optimized for long context
    • 32k context window
    • Sliding window attention with window size 4096
    • iRoPE: Interleaved local (RoPE) and global (temperature-scaled) attention
    • Scalable softmax
  • Multi-lingual capabilities: Specially optimized for English, Korean, and Japanese
  • Enhanced reasoning: Modified training dataset mixture specifically designed for reasoning capabilities, with emphasis on step-by-step problem solving
  • Minimal post-training: This preview release features only supervised fine-tuning, enabling researchers to explore custom alignment techniques and RLHF/RLVR approaches

Model Specifications

Tri-70B-preview-SFT

Specification Value
Type Causal Language Model
Training Stage Pre-training & Supervised Fine-Tuning
Architecture Transformer Decoder with iRoPE (global attention frequency of 4), SwiGLU, RMSNorm, and GQA
Number of Parameters 70B
Number of Layers 80
Number of Attention Heads 64 (Query) / 8 (Key, Value)
Context Length 32,768
Number of Tokens Seen 1.5T
Vocab Size 124,416

Quickstart

Here is a code snippet with apply_chat_template that demonstrates how to load the tokenizer and model and generate text:

Tri-70B-SFT Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "trillionlabs/Tri-70B-preview-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain the concept of central limit theorem in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

vLLM Deployment

We are working with vLLM team on native vLLM support of our Trillion model. In the meantime, we offer an off-the-tree integration method (link).

# install off-the-tree model integration plugin
git clone https://github.com/trillion-labs/trillion-deployment-utils.git
cd trillion-deployment-utils && pip install -e .

vllm serve trillionlabs/Tri-70B-preview-SFT --trust-remote-code

The plugin has been tested on vLLM version 0.10.1.1. Check out the repository for detailed usages.

Evaluation

We evaluated Tri-70B-preview-SFT across a suite of benchmarks assessing general reasoning, knowledge recall, coding abilities, mathematical reasoning, and instruction-following capabilities. We compare our model against state-of-the-art models of similar scale: Qwen-2.5-72B-instruct and Llama-3.1-70B.

Full evaluation settings

Benchmark Evaluation Settings

Benchmark Language Evaluation Setting Metric
• HAERAE Korean 3-shot accuracy
• KMMLU Korean 0-shot, CoT accuracy (exact-match)
• MMLU English 0-shot, CoT accuracy (exact-match)
• MMLU-Pro English 0-shot, CoT exact-match
• HumanEval English 0-shot pass@1
• MBPPPlus English 0-shot pass@1
• GSM8k English 0-shot, CoT exact-match
• MATH English 0-shot, CoT exact-match
• GPQA Diamond English 0-shot, CoT accuracy
• HRM8k Korean 0-shot, CoT exact-match
• MT-Bench English LLM-as-a-judge (gpt-4o) LLM score

**Note that MT-Bench uses a 10-point scale.

Benchmark Results

Models compared:

  • Tri-70B-preview-SFT: Our flagship 70B parameter model
  • Qwen-2.5-72B-instruct: Qwen's 72B parameter instruction-tuned model
  • Llama-3.1-70B: Meta's instruction-tuned 70B model
Benchmark Tri-70B-SFT Qwen-2.5-72B-instruct Llama-3.1-70B
HAERAE 83.96 75.44 78.09
KMMLU 62.38 65.07 54.62
MMLU 74.42 87.29 85.47
MMLU-Pro 62.48 69.40 62.79
HumanEval - 89.02 82.93
MBPPPlus 68.52 88.2 84.13
GSM8k 87.37 91.51 72.48
MATH 64.40 80.80 62.40
GPQA-Diamond - 54.04 44.44
HRM8k 82.26 66.24 63.90
MT-Bench 7.54 8.71 8.2

Limitations

  • Language Support: The model is optimized for English, Korean, and Japanese. Usage with other languages may result in degraded performance.
  • Knowledge Cutoff: The model's knowledge is limited to information available up to February 2025.
  • Minimal Post-Training: As this is a supervised fine-tuning (SFT) release without RLHF, responses may occasionally lack the polish and safety alignment of fully post-trained models.

License

This model repository is licensed under the Trillion License.

Contact

For inquiries, please contact: [email protected]

Downloads last month
610
Safetensors
Model size
70B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including trillionlabs/Tri-70B-preview-SFT