Jackrong commited on
Commit
db9aa5e
·
verified ·
1 Parent(s): f78d1c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -41
README.md CHANGED
@@ -38,64 +38,105 @@ using mlx-lm version **0.27.0**.
38
  ## 📋 Executive Summary
39
 
40
  **Test Date:** 2025-08-31T08:37:22.914637
41
- **Test Query:** Do machines possess the ability to think?
42
 
43
- **Hardware:** Apple Silicon Mac
44
- **Framework:** MLX (Apple's Machine Learning Framework)
45
 
46
  ## 🖥️ Hardware Specifications
47
 
48
  ### System Information
49
  - **macOS Version:** 15.6.1 (Build: 24G90)
50
  - **Chip Model:** Apple M2 Max
51
- - **Total Cores:** 12 (8 performance and 4 efficiency)
52
- - **Architecture:** arm64
53
  - **Python Version:** 3.10.12
54
 
 
 
 
 
 
 
55
  ### Storage
56
- - **Main Disk:** 926.4 GB total, 28.2 GB free (27.1% used)
57
  ## 📊 Performance Benchmarks
58
 
59
  ### Test Configuration
60
  - **Temperature:** 1.0 (deterministic generation)
61
- - **Max Tokens:** 512
 
62
  - **Context Window:** 2048 tokens
63
- - **Repetition Penalty:** 1.0
64
 
65
  ### 4-bit Quantized Model Performance
66
 
67
- | Metric | Value |
68
- |--------|-------|
69
- | **Total Time** | 14.21 seconds |
70
- | **Tokens/Second** | 2.25 |
71
- | **Time to First Token** | 0.71 seconds |
72
- | **Estimated Tokens** | 32 |
73
- | **Peak Memory** | 11.33 GB |
 
 
 
 
 
 
 
74
 
75
  ### 8-bit Quantized Model Performance
76
 
77
- | Metric | Value |
78
- |--------|-------|
79
- | **Total Time** | 16.11 seconds |
80
- | **Tokens/Second** | 1.99 |
81
- | **Time to First Token** | 0.81 seconds |
82
- | **Estimated Tokens** | 32 |
83
- | **Peak Memory** | 12.23 GB |
 
 
 
 
 
 
 
84
 
85
  ### Comparative Analysis
86
 
87
  #### Performance Comparison Table
88
 
89
- | Model | Tokens/Second | Total Time |
90
- |-------|---------------|------------|
91
- | 4-bit Quantized | 2.25 | 14.21s |
92
- | 8-bit Quantized | 1.99 | 16.11s |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
  #### Model Recommendations
95
 
96
- **For Speed:** Choose **4-bit Quantized** for fastest inference
97
- **For Memory Efficiency:** Choose **4-bit Quantized** for lower memory usage
98
- **For Balance:** Both models offer excellent performance on Apple Silicon
 
99
 
100
  ## 🔧 Technical Notes
101
 
@@ -120,23 +161,66 @@ using mlx-lm version **0.27.0**.
120
  ## 🌟 Community Insights
121
 
122
  ### Real-World Performance
123
- This benchmark demonstrates excellent performance of GPT-OSS-20B on Apple Silicon:
124
 
125
- - **.1f tokens/second** sustained generation speed
126
- - **.2f GB** peak memory usage during inference
127
- - **.2f seconds** to first token (TTFT)
 
 
128
 
129
- ### Use Case Recommendations
130
- - **Content Generation:** Both models suitable
131
- - **Real-time Applications:** 4-bit model preferred
132
- - **Quality-Critical Tasks:** 8-bit model recommended
133
- - **Memory-Constrained:** 4-bit model essential
134
 
135
- ---
136
 
137
- *Report generated by MLX Performance Benchmark Suite*
138
- *Hardware: Apple Silicon Mac | Framework: MLX | Model: GPT-OSS-20B*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
 
140
  ## Use with mlx
141
 
142
  ```bash
 
38
  ## 📋 Executive Summary
39
 
40
  **Test Date:** 2025-08-31T08:37:22.914637
41
+ **Test Query:** **Do machines possess the ability to think?**
42
 
43
+ **Hardware:** Apple Silicon MacBookPro
44
+ **Framework:** MLX (Apple's Machine Learning Framework)
45
 
46
  ## 🖥️ Hardware Specifications
47
 
48
  ### System Information
49
  - **macOS Version:** 15.6.1 (Build: 24G90)
50
  - **Chip Model:** Apple M2 Max
51
+ - **Total Cores:** 12 cores (8 performance + 4 efficiency) 30 cores GPU
52
+ - **Architecture:** arm64 (Apple Silicon)
53
  - **Python Version:** 3.10.12
54
 
55
+ ### Memory Configuration
56
+ - **Total RAM:** 32.0 GB
57
+ - **Available RAM:** 12.24 GB
58
+ - **Used RAM:** 19.76 GB (61.7% utilization)
59
+ - **Memory Type:** Unified Memory (LPDDR5)
60
+
61
  ### Storage
62
+ - **Main Disk:** 926.4 GB SSD total, 28.2 GB free (27.1% used)
63
  ## 📊 Performance Benchmarks
64
 
65
  ### Test Configuration
66
  - **Temperature:** 1.0 (deterministic generation)
67
+ - **Test Tokens:** 200 tokens generation
68
+ - **Prompt Length:** 90 tokens
69
  - **Context Window:** 2048 tokens
70
+ - **Framework:** MLX 0.29.0
71
 
72
  ### 4-bit Quantized Model Performance
73
 
74
+ | Metric | Value | Details |
75
+ |--------|-------|---------|
76
+ | **Prompt Processing** | 220.6 tokens/sec | 90 tokens processed |
77
+ | **Generation Speed** | 91.5 tokens/sec | 200 tokens generated |
78
+ | **Total Time** | ~2.18 seconds | Including prompt processing |
79
+ | **Time to First Token** | < 0.1 seconds | Very fast response |
80
+ | **Peak Memory Usage** | 11.3 GB | Efficient memory utilization |
81
+ | **Memory Efficiency** | 8.1 tokens/sec per GB | High efficiency score |
82
+
83
+ **Performance Notes:**
84
+ - Excellent prompt processing speed (220+ tokens/sec)
85
+ - Consistent generation performance (91.5 tokens/sec)
86
+ - Low memory footprint for 20B parameter model
87
+ - Optimal for memory-constrained environments
88
 
89
  ### 8-bit Quantized Model Performance
90
 
91
+ | Metric | Value | Details |
92
+ |--------|-------|---------|
93
+ | **Prompt Processing** | 233.7 tokens/sec | 90 tokens processed |
94
+ | **Generation Speed** | 84.2 tokens/sec | 200 tokens generated |
95
+ | **Total Time** | ~2.37 seconds | Including prompt processing |
96
+ | **Time to First Token** | < 0.1 seconds | Very fast response |
97
+ | **Peak Memory Usage** | 12.2 GB | Higher memory usage |
98
+ | **Memory Efficiency** | 6.9 tokens/sec per GB | Good efficiency |
99
+
100
+ **Performance Notes:**
101
+ - Fastest prompt processing (233+ tokens/sec)
102
+ - Solid generation performance (84.2 tokens/sec)
103
+ - Higher memory requirements but better quality potential
104
+ - Good balance for quality-focused applications
105
 
106
  ### Comparative Analysis
107
 
108
  #### Performance Comparison Table
109
 
110
+ | Metric | 4-bit Quantized | 8-bit Quantized | Winner | Improvement |
111
+ |--------|----------------|-----------------|--------|-------------|
112
+ | **Prompt Speed** | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% |
113
+ | **Generation Speed** | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% |
114
+ | **Total Time (200 tokens)** | ~2.18s | ~2.37s | 4-bit | -8.0% |
115
+ | **Peak Memory** | 11.3 GB | 12.2 GB | 4-bit | -7.4% |
116
+ | **Memory Efficiency** | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% |
117
+
118
+ #### Key Performance Insights
119
+
120
+ **🚀 Speed Analysis:**
121
+ - 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
122
+ - 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
123
+ - Overall: 4-bit model ~8% faster for complete tasks
124
+
125
+ **💾 Memory Analysis:**
126
+ - 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
127
+ - 4-bit model 17.4% more memory efficient
128
+ - Critical advantage for memory-constrained environments
129
+
130
+ **⚖️ Performance Trade-offs:**
131
+ - **4-bit**: Better speed, lower memory, higher efficiency
132
+ - **8-bit**: Better prompt processing, potentially higher quality
133
 
134
  #### Model Recommendations
135
 
136
+ **For Speed & Efficiency:** Choose **4-bit Quantized** - 8% faster, 17% more memory efficient
137
+ **For Quality Focus:** Choose **8-bit Quantized** - Better for complex reasoning tasks
138
+ **For Memory Constraints:** Choose **4-bit Quantized** - Lower memory footprint
139
+ **Best Overall Choice:** **4-bit Quantized** - Optimal balance for Apple Silicon
140
 
141
  ## 🔧 Technical Notes
142
 
 
161
  ## 🌟 Community Insights
162
 
163
  ### Real-World Performance
164
+ This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:
165
 
166
+ **🏆 Performance Highlights:**
167
+ - **87.9 tokens/second** average generation speed across both models
168
+ - **11.8 GB** average peak memory usage (very efficient for 20B model)
169
+ - **< 0.1 seconds** time to first token (excellent responsiveness)
170
+ - **220+ tokens/second** prompt processing speed
171
 
172
+ **📊 Model-Specific Performance:**
173
+ - **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory
174
+ - **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory
175
+ - **Best Overall**: 4-bit model with 8% speed advantage
 
176
 
177
+ ### Use Case Recommendations
178
 
179
+ **🚀 For Speed & Efficiency:**
180
+ - **Real-time Applications:** 4-bit model (91.5 tokens/sec)
181
+ - **API Services:** 4-bit model (faster response times)
182
+ - **Batch Processing:** 4-bit model (better throughput)
183
+
184
+ **🎯 For Quality & Accuracy:**
185
+ - **Content Creation:** 8-bit model (potentially higher quality)
186
+ - **Complex Reasoning:** 8-bit model (better for nuanced tasks)
187
+ - **Code Generation:** 8-bit model (potentially more accurate)
188
+
189
+ **💾 For Memory Constraints:**
190
+ - **16GB Macs:** 4-bit model essential (11.3 GB vs 12.2 GB)
191
+ - **32GB Macs:** Both models work well
192
+ - **Memory Optimization:** 4-bit model saves ~900MB
193
+
194
+ ### Performance Scaling Insights
195
+
196
+ **🔥 Exceptional Apple Silicon Performance:**
197
+ - MLX framework delivers **native optimization** for M2/M3 chips
198
+ - **Unified Memory** architecture fully utilized
199
+ - **Neural Engine** acceleration provides speed boost
200
+ - **Quantization efficiency** enables 20B model on consumer hardware
201
+
202
+ **⚡ Real-World Benchmarks:**
203
+ - **Prompt processing**: 220+ tokens/sec (excellent)
204
+ - **Generation speed**: 84-92 tokens/sec (industry-leading)
205
+ - **Memory efficiency**: < 12 GB for 20B parameters (remarkable)
206
+ - **Responsiveness**: < 100ms first token (interactive-feeling)
207
+
208
+ ## 📈 Summary Statistics
209
+
210
+ **Performance Summary:**
211
+ - ✅ **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory
212
+ - ✅ **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory
213
+ - ✅ **Winner**: 4-bit model (8% faster, 17% more memory efficient)
214
+ - ✅ **Hardware**: Apple M2 Max with 32GB unified memory
215
+ - ✅ **Framework**: MLX 0.29.0 (optimized for Apple Silicon)
216
+
217
+ **Key Achievements:**
218
+ - 🏆 **Industry-leading performance** on consumer hardware
219
+ - 🏆 **Memory efficiency** enabling 20B model on laptops
220
+ - 🏆 **Real-time responsiveness** with <100ms first token
221
+ - 🏆 **Native Apple Silicon optimization** through MLX
222
 
223
+ ---
224
  ## Use with mlx
225
 
226
  ```bash