gpt-oss-20b-MLX-4bit / README.md

Jackrong

Update README.md

db9aa5e verified 20 days ago

preview code

raw

history blame

8.57 kB

metadata

language:
  - en
  - zh
license: apache-2.0
library_name: mlx
tags:
  - text-generation
  - mlx
  - apple-silicon
  - gpt
  - quantized
  - 4bit-quantization
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
model-index:
  - name: gpt-oss-20b-MLX-4bit
    results:
      - task:
          type: text-generation
        dataset:
          name: GPT-OSS-20B Evaluation
          type: openai/gpt-oss-20b
        metrics:
          - type: bits_per_weight
            value: 4.276
            name: Bits per weight (4-bit)

gpt-oss-20b-MLX-4bit

This model Jackrong/gpt-oss-20b-MLX-4bit was converted to MLX format from openai/gpt-oss-20b using mlx-lm version 0.27.0.

🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon

📋 Executive Summary

Test Date: 2025-08-31T08:37:22.914637
Test Query: Do machines possess the ability to think?

Hardware: Apple Silicon MacBookPro Framework: MLX (Apple's Machine Learning Framework)

🖥️ Hardware Specifications

System Information

macOS Version: 15.6.1 (Build: 24G90)
Chip Model: Apple M2 Max
Total Cores: 12 cores (8 performance + 4 efficiency) 30 cores GPU
Architecture: arm64 (Apple Silicon)
Python Version: 3.10.12

Memory Configuration

Total RAM: 32.0 GB
Available RAM: 12.24 GB
Used RAM: 19.76 GB (61.7% utilization)
Memory Type: Unified Memory (LPDDR5)

Storage

Main Disk: 926.4 GB SSD total, 28.2 GB free (27.1% used)

📊 Performance Benchmarks

Test Configuration

Temperature: 1.0 (deterministic generation)
Test Tokens: 200 tokens generation
Prompt Length: 90 tokens
Context Window: 2048 tokens
Framework: MLX 0.29.0

4-bit Quantized Model Performance

Metric	Value	Details
Prompt Processing	220.6 tokens/sec	90 tokens processed
Generation Speed	91.5 tokens/sec	200 tokens generated
Total Time	~2.18 seconds	Including prompt processing
Time to First Token	< 0.1 seconds	Very fast response
Peak Memory Usage	11.3 GB	Efficient memory utilization
Memory Efficiency	8.1 tokens/sec per GB	High efficiency score

Performance Notes:

Excellent prompt processing speed (220+ tokens/sec)
Consistent generation performance (91.5 tokens/sec)
Low memory footprint for 20B parameter model
Optimal for memory-constrained environments

8-bit Quantized Model Performance

Metric	Value	Details
Prompt Processing	233.7 tokens/sec	90 tokens processed
Generation Speed	84.2 tokens/sec	200 tokens generated
Total Time	~2.37 seconds	Including prompt processing
Time to First Token	< 0.1 seconds	Very fast response
Peak Memory Usage	12.2 GB	Higher memory usage
Memory Efficiency	6.9 tokens/sec per GB	Good efficiency

Performance Notes:

Fastest prompt processing (233+ tokens/sec)
Solid generation performance (84.2 tokens/sec)
Higher memory requirements but better quality potential
Good balance for quality-focused applications

Comparative Analysis

Performance Comparison Table

Metric	4-bit Quantized	8-bit Quantized	Winner	Improvement
Prompt Speed	220.6 tokens/sec	233.7 tokens/sec	8-bit	+6.0%
Generation Speed	91.5 tokens/sec	84.2 tokens/sec	4-bit	+8.7%
Total Time (200 tokens)	~2.18s	~2.37s	4-bit	-8.0%
Peak Memory	11.3 GB	12.2 GB	4-bit	-7.4%
Memory Efficiency	8.1 tokens/sec/GB	6.9 tokens/sec/GB	4-bit	+17.4%

Key Performance Insights

🚀 Speed Analysis:

4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec)
8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec)
Overall: 4-bit model ~8% faster for complete tasks

💾 Memory Analysis:

4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB)
4-bit model 17.4% more memory efficient
Critical advantage for memory-constrained environments

⚖️ Performance Trade-offs:

4-bit: Better speed, lower memory, higher efficiency
8-bit: Better prompt processing, potentially higher quality

Model Recommendations

For Speed & Efficiency: Choose 4-bit Quantized - 8% faster, 17% more memory efficient For Quality Focus: Choose 8-bit Quantized - Better for complex reasoning tasks For Memory Constraints: Choose 4-bit Quantized - Lower memory footprint Best Overall Choice: 4-bit Quantized - Optimal balance for Apple Silicon

🔧 Technical Notes

MLX Framework Benefits

Native Apple Silicon Optimization: Leverages Neural Engine and GPU
Unified Memory Architecture: Efficient memory management
Low Latency: Optimized for real-time inference
Quantization Support: 4-bit and 8-bit quantization for different use cases

Model Architecture

Base Model: GPT-OSS-20B (OpenAI's 20B parameter model)
Quantization: Mixed precision quantization
Context Length: Up to 131,072 tokens
Architecture: Mixture of Experts (MoE) with sliding attention

Performance Characteristics

4-bit Quantization: Lower memory usage, slightly faster inference
8-bit Quantization: Higher quality, balanced performance
Memory Requirements: 16GB+ RAM recommended, 32GB+ optimal
Storage Requirements: ~40GB per quantized model

🌟 Community Insights

Real-World Performance

This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max:

🏆 Performance Highlights:

87.9 tokens/second average generation speed across both models
11.8 GB average peak memory usage (very efficient for 20B model)
< 0.1 seconds time to first token (excellent responsiveness)
220+ tokens/second prompt processing speed

📊 Model-Specific Performance:

4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
Best Overall: 4-bit model with 8% speed advantage

Use Case Recommendations

🚀 For Speed & Efficiency:

Real-time Applications: 4-bit model (91.5 tokens/sec)
API Services: 4-bit model (faster response times)
Batch Processing: 4-bit model (better throughput)

🎯 For Quality & Accuracy:

Content Creation: 8-bit model (potentially higher quality)
Complex Reasoning: 8-bit model (better for nuanced tasks)
Code Generation: 8-bit model (potentially more accurate)

💾 For Memory Constraints:

16GB Macs: 4-bit model essential (11.3 GB vs 12.2 GB)
32GB Macs: Both models work well
Memory Optimization: 4-bit model saves ~900MB

Performance Scaling Insights

🔥 Exceptional Apple Silicon Performance:

MLX framework delivers native optimization for M2/M3 chips
Unified Memory architecture fully utilized
Neural Engine acceleration provides speed boost
Quantization efficiency enables 20B model on consumer hardware

⚡ Real-World Benchmarks:

Prompt processing: 220+ tokens/sec (excellent)
Generation speed: 84-92 tokens/sec (industry-leading)
Memory efficiency: < 12 GB for 20B parameters (remarkable)
Responsiveness: < 100ms first token (interactive-feeling)

📈 Summary Statistics

Performance Summary:

✅ 4-bit Model: 91.5 tokens/sec generation, 11.3 GB memory
✅ 8-bit Model: 84.2 tokens/sec generation, 12.2 GB memory
✅ Winner: 4-bit model (8% faster, 17% more memory efficient)
✅ Hardware: Apple M2 Max with 32GB unified memory
✅ Framework: MLX 0.29.0 (optimized for Apple Silicon)

Key Achievements:

🏆 Industry-leading performance on consumer hardware
🏆 Memory efficiency enabling 20B model on laptops
🏆 Real-time responsiveness with <100ms first token
🏆 Native Apple Silicon optimization through MLX

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-4bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)