DFloat11 Compressed Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
This model uses DFloat11 lossless compression. It's 32% smaller than the original BFloat16 model, yet produces bit-identical outputs and runs efficiently on GPUs.
📊 Performance Comparison
Metric | DeepSeek-R1-0528-Qwen3-8B (BFloat16) | DeepSeek-R1-0528-Qwen3-8B (DFloat11) |
---|---|---|
Model Size | 16.38 GB | 11.16 GB |
Peak GPU Memory (1024 tokens generation) |
16.53 GB | 12.56 GB |
Generation Time (on an A100 GPU) |
47 seconds | 75 seconds |
🔍 How It Works
We apply Huffman coding to the exponent bits of BFloat16 model weights, which are highly compressible. We leverage hardware-aware algorithmic designs to enable highly efficient, on-the-fly weight decompression directly on the GPU. Find out more in our research paper.
🔧 How to Use
Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
pip install -U dfloat11[cuda12] # or if you have CUDA version 11: # pip install -U dfloat11[cuda11]
To use the DFloat11 model, run the following example code in Python:
import time import torch from transformers import AutoModelForCausalLM, AutoTokenizer from dfloat11 import DFloat11Model model_name = "DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11" tokenizer = AutoTokenizer.from_pretrained(model_name) model = DFloat11Model.from_pretrained(model_name, device_map="auto") prompt = "Give me an introduction to large language model." messages = [ {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) torch.cuda.reset_peak_memory_stats() torch.cuda.synchronize() start_time = time.time() generated_ids = model.generate( **model_inputs, max_new_tokens=1024, ) torch.cuda.synchronize() end_time = time.time() output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n") print(f"Latency: {end_time - start_time:.2f} seconds") print(f"GPU Peak Memory Usage: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB") print(f'Prompt: {prompt}') print(f'Response: {content}')
📄 Learn More
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B