Update README.md
Browse filesUpdate the README
README.md
CHANGED
@@ -1,6 +1,76 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
4 |
-
|
5 |
-
|
6 |
-
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
---
|
4 |
+
|
5 |
+
# About
|
6 |
+
This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
|
7 |
+
|
8 |
+
# Model Description
|
9 |
+
DeepSeek-R1-Distill-Llama-70B is available in two configurations:
|
10 |
+
- Standard configuration (bf16)
|
11 |
+
- Optimized configuration with weight and KV cache quantization (w8a8kv8)
|
12 |
+
|
13 |
+
# Key Features
|
14 |
+
- Model Architecture: Based on the Llama architecture with 70B parameters
|
15 |
+
- Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
|
16 |
+
- Quantization Innovation:
|
17 |
+
- Weight quantization
|
18 |
+
- KV cache optimization using fp8
|
19 |
+
- Context Length: Supports up to 131,072 tokens
|
20 |
+
- Precision Options:
|
21 |
+
- bf16 for standard version
|
22 |
+
- w8a8kv8 for optimized version
|
23 |
+
|
24 |
+
# Methods
|
25 |
+
The model employs advanced quantization techniques:
|
26 |
+
- Weight quantization for model compression
|
27 |
+
- KV cache optimization using fp8
|
28 |
+
- Backend optimization with FLASHINFER for enhanced performance
|
29 |
+
|
30 |
+
# Model Usage
|
31 |
+
|
32 |
+
## Quick Start
|
33 |
+
|
34 |
+
For optimal performance with w8a8kv8 configuration:
|
35 |
+
```python
|
36 |
+
# Environment setup
|
37 |
+
export VLLM_ATTENTION_BACKEND=FLASHINFER
|
38 |
+
|
39 |
+
# Model configuration
|
40 |
+
model_config = {
|
41 |
+
"max_model_len": 131072,
|
42 |
+
"max_gen_tokens": 1024,
|
43 |
+
"tensor_parallel_size": 2,
|
44 |
+
"kv_cache_dtype": "fp8"
|
45 |
+
}
|
46 |
+
```
|
47 |
+
|
48 |
+
# Hardware Requirements
|
49 |
+
- Standard (bf16): 4 GPUs, tensor parallel size = 4
|
50 |
+
- Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
|
51 |
+
|
52 |
+
# Model Evaluation
|
53 |
+
|
54 |
+
## Benchmark Results
|
55 |
+
1. Throughput Performance:
|
56 |
+
- w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
|
57 |
+
2. MMLU Benchmark Scores:
|
58 |
+
- bf16: 0.5158 (exact match)
|
59 |
+
- w8a8kv8: 0.5169 (exact match)
|
60 |
+
3. Subject-specific Performance:
|
61 |
+
- Notable improvements in:
|
62 |
+
- Biology (+1.11%)
|
63 |
+
- Economics (+0.83%)
|
64 |
+
- Physics (+0.92%)
|
65 |
+
- Slight variations in:
|
66 |
+
- History (-1.57%)
|
67 |
+
- Law (-1.46%)
|
68 |
+
|
69 |
+
# Limitations and Bias
|
70 |
+
- Requires specific backend optimizations for fp8 KV cache
|
71 |
+
- Performance may vary depending on hardware configuration
|
72 |
+
- Subject-specific performance shows slight variations across different domains
|
73 |
+
|
74 |
+
# Community
|
75 |
+
Join our community discussions and get support:
|
76 |
+
- Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)
|