X-EcoMLA: pcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
X-EcoMLA is an efficient KV cache compression technique for large language models (LLMs) proposed by AMD that upcycles transformer blocks into Multi-head Latent Attention (MLA) for extreme KV cache compression and computational efficiency.
Instead of training a MLA model from scratch, the proposed X-EcoMLA first initializes the MLA weights based on Singular Value Decomposition (SVD) of the existing transformer weights, followed by lightweight pre-training or post-training distillation.
This model, X-EcoMLA-1B1B-fixed-kv512-DPO
, is created by efficiently adapting the pre-trained Llama-3.2-1B-Instruct
model conducted post-training on AMD Instinct™ MI300X GPUs. This training approach bypasses the need for costly pre-training from scratch.
Key Takeaways
- Announcing X-EcoMLA, an efficient approach to upcycle existing transformer blocks into MLA.
- Extreme KV Cache Compression: X-EcoMLA dramatically reduces the KV cache size by 6.4x - 10.6x with only 3.6B - 7B training tokens, while preserving almost 100% of its average zero-shot performance on LM Harness tasks.
- Novel SVD Initialization: X-EcoMLA employs an efficient SVD-based weight initialization which dramatically improves the training efficiency and model performance.
Model Composition Pipeline
The X-EcoMLA models are not trained from scratch. Instead, they are composed from powerful pre-trained Transformers through a lightweight and efficient pipeline. The creation of this model followed these stages:
Stage | Action | Description |
---|---|---|
1. Base Model | Llama-3.2-1B-Instruct | The starting point is a high-quality, pre-trained Transformer model. |
2. Initialization | Structured Weight Mapping | MLA models are initialized from the base model's weights using SVD. |
3. SFT | End-to-End Knowledge Distillation | The initialized model is fine-tuned via knowledge distillation. |
4. Alignment | Direct Preference Optimization (DPO) | In the final stage, DPO is used to align the model's preferences, with the distilled student model itself serving as the reference model for stability. |
Training Data
Getting Started
Installation
git clone https://github.com/AMD-AIG-AIMA/AMD-Hybrid-Models.git
cd AMD-Hybrid-Models/X-EcoMLA
Then follow the installation instruction in AMD-AIG-AIMA/AMD-Hybrid-Models
repo.
Example Usage
Once the installation completed, we can try the following code for a quick test
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from mla.hybrid_wrapper import MLATransformerHybridModelWrapper
checkpoint = "amd/X-EcoMLA-1B1B-fixed-kv512-DPO"
model = MLATransformerHybridModelWrapper.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model.eval()
# Format the prompt using the chat template
prompt = [{"role": "user", "content": "What are the benefits of hybrid language models?"}]
input_ids = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
).cuda()
# Generate a response
tokens = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
Model Evaluation:
python benchmark/llm_eval/lm_harness_eval.py \
--model mla_hybrid \
--model_args pretrained="amd/X-EcoMLA-1B1B-fixed-kv512-DPO" \
--tasks mmlu,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa,pubmedqa,race \
--num_fewshot 0 --device cuda --batch_size 16
Model details
Model | KV Size | Target Model | Teacher Model | Training Tokens | Pre-/Post-Training | rkv | rq | drope | dnope |
---|---|---|---|---|---|---|---|---|---|
X-EcoMLA-1B1B-fixed-kv512-DPO | 53.1% | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | 7B | Post | 512 | 864 | 32 | 32 |
X-EcoMLA-1B1B-dynamic-0.95-DPO | 54.7% | Llama-3.2-1B-Instruct | Llama-3.2-1B-Instruct | 7B | Post | 0.95 | 0.95 | 32 | 32 |
X-EcoMLA-1B8B-fixed-kv64-DPO | 9.4% | Llama-3.2-1B-Instruct | Llama-3.1-8B-Instruct | 7B | Post | 64 | 1424 | 32 | 32 |
X-EcoMLA-3B3B-fixed-kv816-DPO | 43% | Llama-3.2-3B-Instruct | Llama-3.2-3B-Instruct | 7B | Post | 816 | 1536 | 64 | 64 |
X-EcoMLA-3B3B-dynamic-0.95-DPO | 43% | Llama-3.2-3B-Instruct | Llama-3.2-3B-Instruct | 7B | Post | 0.95 | 0.95 | 64 | 64 |
X-EcoMLA-SmolLM-1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | - | 6B | Pre | 480 | 2048 | 32 | 32 |
X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-Pretrain | 12.5% | SmolLM-1.7B | SmolLM-1.7B | 6B | Pre | 480 | 2048 | 32 | 32 |
X-EcoMLA-SmolLM-1.7B1.7B-fixed-kv480-DPO | 12.5% | SmolLM-1.7B-Instruct | SmolLM-1.7B-Instruct | 7B | Post | 480 | 2048 | 32 | 32 |
Benchmark results
X-EcoMLA was evaluated on the Language Model Harness benchmark for zero-shot tasks and compared against its base model and other post-training methods. The results demonstrate that Zebra-Llama provides a superior balance of performance and efficiency.
Tasks | Metric | Llama-3.2-1B-Instruct | X-EcoMLA-1B1B-fixed-kv512-DPO | X-EcoMLA-1B1B-dynamic-0.95-DPO | X-EcoMLA-1B8B-fixed-kv64-DPO |
---|---|---|---|---|---|
arc_challenge | acc | 0.3575 (±0.0140) | 0.3643 (±0.0141) | 0.3686 (±0.0141) | 0.3729 (±0.0141) |
acc_norm | 0.3797 (±0.0142) | 0.3993 (±0.0143) | 0.4121 (±0.0144) | 0.3985 (±0.0143) | |
arc_easy | acc | 0.6843 (±0.0095) | 0.6873 (±0.0095) | 0.6932 (±0.0095) | 0.7256 (±0.0092) |
acc_norm | 0.6351 (±0.0099) | 0.6389 (±0.0099) | 0.6486 (±0.0098) | 0.6713 (±0.0096) | |
hellaswag | acc | 0.4506 (±0.0050) | 0.4483 (±0.0050) | 0.4459 (±0.0050) | 0.4398 (±0.0050) |
acc_norm | 0.6077 (±0.0049) | 0.6073 (±0.0049) | 0.6096 (±0.0049) | 0.5845 (±0.0049) | |
mmlu | acc | 0.4609 (±0.0918) | 0.4239 (±0.0785) | 0.4286 (±0.0809) | 0.3851 (±0.0684) |
- humanities | acc | 0.4397 (±0.0763) | 0.4064 (±0.0663) | 0.4013 (±0.0733) | 0.3609 (±0.0565) |
- other | acc | 0.5204 (±0.0868) | 0.4583 (±0.0760) | 0.4747 (±0.0774) | 0.4052 (±0.0632) |
- social_sciences | acc | 0.5109 (±0.0843) | 0.4686 (±0.0735) | 0.4729 (±0.0734) | 0.4277 (±0.0676) |
- stem | acc | 0.3850 (±0.0900) | 0.3723 (±0.0818) | 0.3806 (±0.0798) | 0.3600 (±0.0768) |
openbookqa | acc | 0.2440 (±0.0192) | 0.2560 (±0.0195) | 0.2660 (±0.0198) | 0.2600 (±0.0196) |
acc_norm | 0.3500 (±0.0214) | 0.3780 (±0.0217) | 0.3760 (±0.0217) | 0.3740 (±0.0217) | |
piqa | acc | 0.7405 (±0.0102) | 0.7443 (±0.0102) | 0.7301 (±0.0104) | 0.7334 (±0.0103) |
acc_norm | 0.7437 (±0.0102) | 0.7492 (±0.0101) | 0.7443 (±0.0102) | 0.7383 (±0.0103) | |
pubmedqa | acc | 0.6020 (±0.0219) | 0.5880 (±0.0220) | 0.5860 (±0.0220) | 0.5800 (±0.0221) |
race | acc | 0.3809 (±0.0150) | 0.4077 (±0.0152) | 0.3923 (±0.0151) | 0.3981 (±0.0151) |
winogrande | acc | 0.5967 (±0.0138) | 0.6054 (±0.0137) | 0.5833 (±0.0139) | 0.5927 (±0.0138) |
Conclusion
X-EcoMLA demonstrates an efficient technique to upcycle pre-trained Transformers into MLA modules to compress KV cache. This work highlights the viability of post-training hybridization as a cost-effective and environmentally sustainable alternative to full retraining, paving the way for the deployment of powerful LLMs in resource-constrained environments.
Bias, Risks, and Limitations
- This model is a research artifact and has not been evaluated for safety in production use cases.
- The model's performance is dependent on the quality of its pre-trained base model and the teacher model used during distillation. Its capabilities and biases are inherited from these sources.
- The model may generate content that is factually inaccurate, biased, or otherwise objectionable. Users should be aware of these risks and implement appropriate safeguards for their applications.
- One limitation of this work is the reliance on a strong teacher model for knowledge transfer, which may not always be available. Distillation from a teacher also adds to the resource requirements during the post-training phase.
Citation
If you find this model useful, please consider citing the original paper:
@article{li2025x,
title={X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression},
author={Li, Guihong and Rezagholizadeh, Mehdi and Yang, Mingyu and Appia, Vikram and Barsoum, Emad},
journal={arXiv preprint arXiv:2503.11132},
year={2025}
}
- Downloads last month
- 88
Model tree for amd/X-EcoMLA-1B1B-fixed-kv512-DPO
Base model
meta-llama/Llama-3.2-1B-Instruct