DFloat11 Compressed Model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

This model uses DFloat11 lossless compression. It's 32% smaller than the original BFloat16 model, yet produces bit-identical outputs and runs efficiently on GPUs.

📊 Performance Comparison

Metric DeepSeek-R1-0528-Qwen3-8B (BFloat16) DeepSeek-R1-0528-Qwen3-8B (DFloat11)
Model Size 16.38 GB 11.16 GB
Peak GPU Memory
(1024 tokens generation)
16.53 GB 12.56 GB
Generation Time
(on an A100 GPU)
47 seconds 75 seconds

🔍 How It Works

We apply Huffman coding to the exponent bits of BFloat16 model weights, which are highly compressible. We leverage hardware-aware algorithmic designs to enable highly efficient, on-the-fly weight decompression directly on the GPU. Find out more in our research paper.

🔧 How to Use

  1. Install the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    # or if you have CUDA version 11:
    # pip install -U dfloat11[cuda11]
    
  2. To use the DFloat11 model, run the following example code in Python:

    import time
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer
    from dfloat11 import DFloat11Model
    
    model_name = "DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11"
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = DFloat11Model.from_pretrained(model_name, device_map="auto")
    
    prompt = "Give me an introduction to large language model."
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    torch.cuda.reset_peak_memory_stats()
    
    torch.cuda.synchronize()
    start_time = time.time()
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=1024,
    )
    torch.cuda.synchronize()
    end_time = time.time()
    
    output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
    
    content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
    
    print(f"Latency: {end_time - start_time:.2f} seconds")
    print(f"GPU Peak Memory Usage: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")
    print(f'Prompt: {prompt}')
    print(f'Response: {content}')
    

📄 Learn More

Downloads last month
7
Safetensors
Model size
4.1k params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DFloat11/DeepSeek-R1-0528-Qwen3-8B-DF11

Quantized
(62)
this model