novita
/

DeepSeek-R1-Distill-Llama-70B-w8a8kv8-s888

compressed-tensors

Model card Files Files and versions Community

novita-ai commited on Feb 2

Commit

1239071

·

verified ·

1 Parent(s): 0b5bf00

Update README.md

Update the README

Files changed (1) hide show

README.md +76 -6

README.md CHANGED Viewed

@@ -1,6 +1,76 @@
----
-license: mit
----
-=== About
-This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.

+---
+license: mit
+---
+# About
+This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
+# Model Description
+DeepSeek-R1-Distill-Llama-70B is available in two configurations:
+- Standard configuration (bf16)
+- Optimized configuration with weight and KV cache quantization (w8a8kv8)
+# Key Features
+- Model Architecture: Based on the Llama architecture with 70B parameters
+- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
+- Quantization Innovation:
+  - Weight quantization
+  - KV cache optimization using fp8
+- Context Length: Supports up to 131,072 tokens
+- Precision Options:
+  - bf16 for standard version
+  - w8a8kv8 for optimized version
+# Methods
+The model employs advanced quantization techniques:
+- Weight quantization for model compression
+- KV cache optimization using fp8
+- Backend optimization with FLASHINFER for enhanced performance
+# Model Usage
+## Quick Start
+For optimal performance with w8a8kv8 configuration:
+```python
+# Environment setup
+export VLLM_ATTENTION_BACKEND=FLASHINFER
+# Model configuration
+model_config = {
+    "max_model_len": 131072,
+    "max_gen_tokens": 1024,
+    "tensor_parallel_size": 2,
+    "kv_cache_dtype": "fp8"
+}
+```
+# Hardware Requirements
+- Standard (bf16): 4 GPUs, tensor parallel size = 4
+- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
+# Model Evaluation
+## Benchmark Results
+1. Throughput Performance:
+  - w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
+2. MMLU Benchmark Scores:
+  - bf16: 0.5158 (exact match)
+  - w8a8kv8: 0.5169 (exact match)
+3. Subject-specific Performance:
+  - Notable improvements in:
+    - Biology (+1.11%)
+    - Economics (+0.83%)
+    - Physics (+0.92%)
+  - Slight variations in:
+    - History (-1.57%)
+    - Law (-1.46%)
+# Limitations and Bias
+- Requires specific backend optimizations for fp8 KV cache
+- Performance may vary depending on hardware configuration
+- Subject-specific performance shows slight variations across different domains
+# Community
+Join our community discussions and get support:
+- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)