language:
- en
- zh
license: apache-2.0
library_name: mlx
tags:
- text-generation
- mlx
- apple-silicon
- gpt
- quantized
- 4bit-quantization
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
model-index:
- name: gpt-oss-20b-MLX-4bit
results:
- task:
type: text-generation
dataset:
name: GPT-OSS-20B Evaluation
type: openai/gpt-oss-20b
metrics:
- type: bits_per_weight
value: 4.276
name: Bits per weight (4-bit)
gpt-oss-20b-MLX-4bit
This model Jackrong/gpt-oss-20b-MLX-4bit was converted to MLX format from openai/gpt-oss-20b using mlx-lm version 0.27.0.
🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon
📋 Executive Summary
Test Date: 2025-08-31T08:37:22.914637
Test Query: Do machines possess the ability to think?
Hardware: Apple Silicon MacBookPro Framework: MLX (Apple's Machine Learning Framework)
🖥️ Hardware Specifications
System Information
- macOS Version: 15.6.1 (Build: 24G90)
- Chip Model: Apple M2 Max
- Total Cores: 12 cores (8 performance + 4 efficiency) 30 cores GPU
- Architecture: arm64 (Apple Silicon)
- Python Version: 3.10.12
Memory Configuration
- Total RAM: 32.0 GB
- Available RAM: 12.24 GB
- Used RAM: 19.76 GB (61.7% utilization)
- Memory Type: Unified Memory (LPDDR5)
Storage
- Main Disk: 926.4 GB SSD total, 28.2 GB free (27.1% used)
📊 Performance Benchmarks
Test Configuration
- Temperature: 1.0 (deterministic generation)
- Test Tokens: 200 tokens generation
- Prompt Length: 90 tokens
- Context Window: 2048 tokens
- Framework: MLX 0.29.0
4-bit Quantized Model Performance
Metric | Value | Details |
---|---|---|
Prompt Processing | 220.6 tokens/sec | 90 tokens processed |
Generation Speed | 91.5 tokens/sec | 200 tokens generated |
Total Time | ~2.18 seconds | Including prompt processing |
Time to First Token | < 0.1 seconds | Very fast response |
Peak Memory Usage | 11.3 GB | Efficient memory utilization |
Memory Efficiency | 8.1 tokens/sec per GB | High efficiency score |
Performance Notes:
- Excellent prompt processing speed (220+ tokens/sec)
- Consistent generation performance (91.5 tokens/sec)
- Low memory footprint for 20B parameter model
- Optimal for memory-constrained environments
8-bit Quantized Model Performance
Metric | Value | Details |
---|---|---|
Prompt Processing | 233.7 tokens/sec | 90 tokens processed |
Generation Speed | 84.2 tokens/sec | 200 tokens generated |
Total Time | ~2.37 seconds | Including prompt processing |
Time to First Token | < 0.1 seconds | Very fast response |
Peak Memory Usage | 12.2 GB | Higher memory usage |
Memory Efficiency | 6.9 tokens/sec per GB | Good efficiency |
Performance Notes:
- Fastest prompt processing (233+ tokens/sec)
- Solid generation performance (84.2 tokens/sec)
- Higher memory requirements but better quality potential
- Good balance for quality-focused applications
Comparative Analysis
Performance Comparison Table
Metric | 4-bit Quantized | 8-bit Quantized | Winner | Improvement |
---|---|---|---|---|
Prompt Speed | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% |
Generation Speed | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% |
Total Time (200 tokens) | ~2.18s | ~2.37s | 4-bit | -8.0% |
Peak Memory | 11.3 GB | 12.2 GB | 4-bit | -7.4% |
Memory Efficiency | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% |
Key Performance Insights
🚀 Speed Analysis:
- 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
- 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
- Overall: 4-bit model ~8% faster for complete tasks
💾 Memory Analysis:
- 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
- 4-bit model 17.4% more memory efficient
- Critical advantage for memory-constrained environments
⚖️ Performance Trade-offs:
- 4-bit: Better speed, lower memory, higher efficiency
- 8-bit: Better prompt processing, potentially higher quality
Model Recommendations
For Speed & Efficiency: Choose 4-bit Quantized - 8% faster, 17% more memory efficient For Quality Focus: Choose 8-bit Quantized - Better for complex reasoning tasks For Memory Constraints: Choose 4-bit Quantized - Lower memory footprint Best Overall Choice: 4-bit Quantized - Optimal balance for Apple Silicon
🔧 Technical Notes
MLX Framework Benefits
- Native Apple Silicon Optimization: Leverages Neural Engine and GPU
- Unified Memory Architecture: Efficient memory management
- Low Latency: Optimized for real-time inference
- Quantization Support: 4-bit and 8-bit quantization for different use cases
Model Architecture
- Base Model: GPT-OSS-20B (OpenAI's 20B parameter model)
- Quantization: Mixed precision quantization
- Context Length: Up to 131,072 tokens
- Architecture: Mixture of Experts (MoE) with sliding attention
Performance Characteristics
- 4-bit Quantization: Lower memory usage, slightly faster inference
- 8-bit Quantization: Higher quality, balanced performance
- Memory Requirements: 16GB+ RAM recommended, 32GB+ optimal
- Storage Requirements: ~40GB per quantized model
🌟 Community Insights
Real-World Performance
This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:
🏆 Performance Highlights:
- 87.9 tokens/second average generation speed across both models
- 11.8 GB average peak memory usage (very efficient for 20B model)
- < 0.1 seconds time to first token (excellent responsiveness)
- 220+ tokens/second prompt processing speed
📊 Model-Specific Performance:
- 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
- 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
- Best Overall: 4-bit model with 8% speed advantage
Use Case Recommendations
🚀 For Speed & Efficiency:
- Real-time Applications: 4-bit model (91.5 tokens/sec)
- API Services: 4-bit model (faster response times)
- Batch Processing: 4-bit model (better throughput)
🎯 For Quality & Accuracy:
- Content Creation: 8-bit model (potentially higher quality)
- Complex Reasoning: 8-bit model (better for nuanced tasks)
- Code Generation: 8-bit model (potentially more accurate)
💾 For Memory Constraints:
- 16GB Macs: 4-bit model essential (11.3 GB vs 12.2 GB)
- 32GB Macs: Both models work well
- Memory Optimization: 4-bit model saves ~900MB
Performance Scaling Insights
🔥 Exceptional Apple Silicon Performance:
- MLX framework delivers native optimization for M2/M3 chips
- Unified Memory architecture fully utilized
- Neural Engine acceleration provides speed boost
- Quantization efficiency enables 20B model on consumer hardware
⚡ Real-World Benchmarks:
- Prompt processing: 220+ tokens/sec (excellent)
- Generation speed: 84-92 tokens/sec (industry-leading)
- Memory efficiency: < 12 GB for 20B parameters (remarkable)
- Responsiveness: < 100ms first token (interactive-feeling)
📈 Summary Statistics
Performance Summary:
- ✅ 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
- ✅ 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
- ✅ Winner: 4-bit model (8% faster, 17% more memory efficient)
- ✅ Hardware: Apple M2 Max with 32GB unified memory
- ✅ Framework: MLX 0.29.0 (optimized for Apple Silicon)
Key Achievements:
- 🏆 Industry-leading performance on consumer hardware
- 🏆 Memory efficiency enabling 20B model on laptops
- 🏆 Real-time responsiveness with <100ms first token
- 🏆 Native Apple Silicon optimization through MLX
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-4bit")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)