DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`

This model uses DFloat11 lossless compression. It's 32% smaller than the original BFloat16 model, yet produces bit-identical outputs and runs efficiently on GPUs.

📊 Performance Comparison

Metric	DeepSeek-R1-0528-Qwen3-8B (BFloat16)	DeepSeek-R1-0528-Qwen3-8B (DFloat11)
Model Size	16.38 GB	11.16 GB
Peak GPU Memory (1024 tokens generation)	16.53 GB	12.56 GB
Generation Time (on an A100 GPU)	47 seconds	75 seconds

🔍 How It Works

We apply Huffman coding to the exponent bits of BFloat16 model weights, which are highly compressible. We leverage hardware-aware algorithmic designs to enable highly efficient, on-the-fly weight decompression directly on the GPU. Find out more in our research paper.

🔧 How to Use

Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
```
pip install -U dfloat11[cuda12]
# or if you have CUDA version 11:
# pip install -U dfloat11[cuda11]
```

To use the DFloat11 model, run the following example code in Python:

import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfloat11 import DFloat11Model

model_name = "DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = DFloat11Model.from_pretrained(model_name, device_map="auto")

prompt = "Give me an introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

torch.cuda.reset_peak_memory_stats()

torch.cuda.synchronize()
start_time = time.time()
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024,
)
torch.cuda.synchronize()
end_time = time.time()

output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

print(f"Latency: {end_time - start_time:.2f} seconds")
print(f"GPU Peak Memory Usage: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
print(f'Prompt: {prompt}')
print(f'Response: {content}')

📄 Learn More

Paper: 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float
GitHub: https://github.com/LeanModels/DFloat11
HuggingFace: https://huggingface.co/DFloat11

DFloat11
/

DeepSeek-R1-0528-Qwen3-8B-DF11

DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`

📊 Performance Comparison

🔍 How It Works

🔧 How to Use

📄 Learn More

Model tree for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

DFloat11 Compressed Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

📊 Performance Comparison

🔍 How It Works

🔧 How to Use

📄 Learn More

Model tree for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

DFloat11 Compressed Model: `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`