novita-ai commited on
Commit
1239071
·
verified ·
1 Parent(s): 0b5bf00

Update README.md

Browse files

Update the README

Files changed (1) hide show
  1. README.md +76 -6
README.md CHANGED
@@ -1,6 +1,76 @@
1
- ---
2
- license: mit
3
- ---
4
-
5
- === About
6
- This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # About
6
+ This model is a research project by Novita AI, focusing on optimizing large language model inference efficiency while maintaining high performance. The DeepSeek-R1-Distill-Llama-70B model implements innovative quantization techniques to achieve significant throughput improvements without compromising accuracy.
7
+
8
+ # Model Description
9
+ DeepSeek-R1-Distill-Llama-70B is available in two configurations:
10
+ - Standard configuration (bf16)
11
+ - Optimized configuration with weight and KV cache quantization (w8a8kv8)
12
+
13
+ # Key Features
14
+ - Model Architecture: Based on the Llama architecture with 70B parameters
15
+ - Optimized Performance: Achieves 1.6× higher throughput with w8a8kv8 configuration
16
+ - Quantization Innovation:
17
+ - Weight quantization
18
+ - KV cache optimization using fp8
19
+ - Context Length: Supports up to 131,072 tokens
20
+ - Precision Options:
21
+ - bf16 for standard version
22
+ - w8a8kv8 for optimized version
23
+
24
+ # Methods
25
+ The model employs advanced quantization techniques:
26
+ - Weight quantization for model compression
27
+ - KV cache optimization using fp8
28
+ - Backend optimization with FLASHINFER for enhanced performance
29
+
30
+ # Model Usage
31
+
32
+ ## Quick Start
33
+
34
+ For optimal performance with w8a8kv8 configuration:
35
+ ```python
36
+ # Environment setup
37
+ export VLLM_ATTENTION_BACKEND=FLASHINFER
38
+
39
+ # Model configuration
40
+ model_config = {
41
+ "max_model_len": 131072,
42
+ "max_gen_tokens": 1024,
43
+ "tensor_parallel_size": 2,
44
+ "kv_cache_dtype": "fp8"
45
+ }
46
+ ```
47
+
48
+ # Hardware Requirements
49
+ - Standard (bf16): 4 GPUs, tensor parallel size = 4
50
+ - Optimized (w8a8kv8): 2 GPUs, tensor parallel size = 2
51
+
52
+ # Model Evaluation
53
+
54
+ ## Benchmark Results
55
+ 1. Throughput Performance:
56
+ - w8a8kv8 configuration achieves 1.6× higher throughput compared to bf16
57
+ 2. MMLU Benchmark Scores:
58
+ - bf16: 0.5158 (exact match)
59
+ - w8a8kv8: 0.5169 (exact match)
60
+ 3. Subject-specific Performance:
61
+ - Notable improvements in:
62
+ - Biology (+1.11%)
63
+ - Economics (+0.83%)
64
+ - Physics (+0.92%)
65
+ - Slight variations in:
66
+ - History (-1.57%)
67
+ - Law (-1.46%)
68
+
69
+ # Limitations and Bias
70
+ - Requires specific backend optimizations for fp8 KV cache
71
+ - Performance may vary depending on hardware configuration
72
+ - Subject-specific performance shows slight variations across different domains
73
+
74
+ # Community
75
+ Join our community discussions and get support:
76
+ - Discord: [Novita AI Discord Community](https://discord.com/invite/YyPRAzwp7P)