Tri-70B-preview-SFT
Introduction
We introduce Tri-70B-preview-SFT, a research preview of our latest and largest flagship language model that redefines the efficiency frontier in LLM training. By achieving frontier performance for it's compute size (1.5T training tokens from scratch), we demonstrate that exceptional capabilities don't require excessive computational resources.
We are releasing a minimally post-trained version to enable open research and community experimentation. This preview version has only undergone supervised fine-tuning and has not been subjected to extensive RLHF. This enables researchers to explore various RL-based alignment techniques with this model. Stay tuned for the base model release coming soon!
Key Highlights
- Architecture optimized for long context
- 32k context window
- Sliding window attention with window size 4096
- iRoPE: Interleaved local (RoPE) and global (temperature-scaled) attention
- Scalable softmax
- Multi-lingual capabilities: Specially optimized for English, Korean, and Japanese
- Enhanced reasoning: Modified training dataset mixture specifically designed for reasoning capabilities, with emphasis on step-by-step problem solving
- Minimal post-training: This preview release features only supervised fine-tuning, enabling researchers to explore custom alignment techniques and RLHF/RLVR approaches
Model Specifications
Tri-70B-preview-SFT
Specification | Value |
---|---|
Type | Causal Language Model |
Training Stage | Pre-training & Supervised Fine-Tuning |
Architecture | Transformer Decoder with iRoPE (global attention frequency of 4), SwiGLU, RMSNorm, and GQA |
Number of Parameters | 70B |
Number of Layers | 80 |
Number of Attention Heads | 64 (Query) / 8 (Key, Value) |
Context Length | 32,768 |
Number of Tokens Seen | 1.5T |
Vocab Size | 124,416 |
Quickstart
Here is a code snippet with apply_chat_template
that demonstrates how to load the tokenizer and model and generate text:
Tri-70B-SFT Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "trillionlabs/Tri-70B-preview-SFT"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Explain the concept of central limit theorem in simple terms."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
vLLM Deployment
We are working with vLLM team on native vLLM support of our Trillion model. In the meantime, we offer an off-the-tree integration method (link).
# install off-the-tree model integration plugin
git clone https://github.com/trillion-labs/trillion-deployment-utils.git
cd trillion-deployment-utils && pip install -e .
vllm serve trillionlabs/Tri-70B-preview-SFT --trust-remote-code
The plugin has been tested on vLLM version 0.10.1.1. Check out the repository for detailed usages.
Evaluation
We evaluated Tri-70B-preview-SFT across a suite of benchmarks assessing general reasoning, knowledge recall, coding abilities, mathematical reasoning, and instruction-following capabilities. We compare our model against state-of-the-art models of similar scale: Qwen-2.5-72B-instruct and Llama-3.1-70B.
Full evaluation settings
Benchmark Evaluation Settings
Benchmark | Language | Evaluation Setting | Metric |
---|---|---|---|
• HAERAE | Korean | 3-shot | accuracy |
• KMMLU | Korean | 0-shot, CoT | accuracy (exact-match) |
• MMLU | English | 0-shot, CoT | accuracy (exact-match) |
• MMLU-Pro | English | 0-shot, CoT | exact-match |
• HumanEval | English | 0-shot | pass@1 |
• MBPPPlus | English | 0-shot | pass@1 |
• GSM8k | English | 0-shot, CoT | exact-match |
• MATH | English | 0-shot, CoT | exact-match |
• GPQA Diamond | English | 0-shot, CoT | accuracy |
• HRM8k | Korean | 0-shot, CoT | exact-match |
• MT-Bench | English | LLM-as-a-judge (gpt-4o) | LLM score |
**Note that MT-Bench uses a 10-point scale.
Benchmark Results
Models compared:
- Tri-70B-preview-SFT: Our flagship 70B parameter model
- Qwen-2.5-72B-instruct: Qwen's 72B parameter instruction-tuned model
- Llama-3.1-70B: Meta's instruction-tuned 70B model
Benchmark | Tri-70B-SFT | Qwen-2.5-72B-instruct | Llama-3.1-70B |
---|---|---|---|
HAERAE | 83.96 | 75.44 | 78.09 |
KMMLU | 62.38 | 65.07 | 54.62 |
MMLU | 74.42 | 87.29 | 85.47 |
MMLU-Pro | 62.48 | 69.40 | 62.79 |
HumanEval | - | 89.02 | 82.93 |
MBPPPlus | 68.52 | 88.2 | 84.13 |
GSM8k | 87.37 | 91.51 | 72.48 |
MATH | 64.40 | 80.80 | 62.40 |
GPQA-Diamond | - | 54.04 | 44.44 |
HRM8k | 82.26 | 66.24 | 63.90 |
MT-Bench | 7.54 | 8.71 | 8.2 |
Limitations
- Language Support: The model is optimized for English, Korean, and Japanese. Usage with other languages may result in degraded performance.
- Knowledge Cutoff: The model's knowledge is limited to information available up to February 2025.
- Minimal Post-Training: As this is a supervised fine-tuning (SFT) release without RLHF, responses may occasionally lack the polish and safety alignment of fully post-trained models.
License
This model repository is licensed under the Trillion License.
Contact
For inquiries, please contact: [email protected]
- Downloads last month
- 610