--- language: - en - zh license: apache-2.0 library_name: mlx tags: - text-generation - mlx - apple-silicon - gpt - quantized - 4bit-quantization pipeline_tag: text-generation base_model: openai/gpt-oss-20b model-index: - name: gpt-oss-20b-MLX-4bit results: - task: type: text-generation dataset: name: GPT-OSS-20B Evaluation type: openai/gpt-oss-20b metrics: - type: bits_per_weight value: 4.276 name: Bits per weight (4-bit) --- # gpt-oss-20b-MLX-4bit This model [Jackrong/gpt-oss-20b-MLX-4bit](https://huggingface.co/Jackrong/gpt-oss-20b-MLX-4bit) was converted to MLX format from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) using mlx-lm version **0.27.0**. # 🚀 GPT-OSS-20B MLX Performance Report - Apple Silicon ## 📋 Executive Summary **Test Date:** 2025-08-31T08:37:22.914637 **Test Query:** **Do machines possess the ability to think?** **Hardware:** Apple Silicon MacBookPro **Framework:** MLX (Apple's Machine Learning Framework) ## 🖥️ Hardware Specifications ### System Information - **macOS Version:** 15.6.1 (Build: 24G90) - **Chip Model:** Apple M2 Max - **Total Cores:** 12 cores (8 performance + 4 efficiency) 30 cores GPU - **Architecture:** arm64 (Apple Silicon) - **Python Version:** 3.10.12 ### Memory Configuration - **Total RAM:** 32.0 GB - **Available RAM:** 12.24 GB - **Used RAM:** 19.76 GB (61.7% utilization) - **Memory Type:** Unified Memory (LPDDR5) ### Storage - **Main Disk:** 926.4 GB SSD total, 28.2 GB free (27.1% used) ## 📊 Performance Benchmarks ### Test Configuration - **Temperature:** 1.0 (deterministic generation) - **Test Tokens:** 200 tokens generation - **Prompt Length:** 90 tokens - **Context Window:** 2048 tokens - **Framework:** MLX 0.29.0 ### 4-bit Quantized Model Performance | Metric | Value | Details | |--------|-------|---------| | **Prompt Processing** | 220.6 tokens/sec | 90 tokens processed | | **Generation Speed** | 91.5 tokens/sec | 200 tokens generated | | **Total Time** | ~2.18 seconds | Including prompt processing | | **Time to First Token** | < 0.1 seconds | Very fast response | | **Peak Memory Usage** | 11.3 GB | Efficient memory utilization | | **Memory Efficiency** | 8.1 tokens/sec per GB | High efficiency score | **Performance Notes:** - Excellent prompt processing speed (220+ tokens/sec) - Consistent generation performance (91.5 tokens/sec) - Low memory footprint for 20B parameter model - Optimal for memory-constrained environments ### 8-bit Quantized Model Performance | Metric | Value | Details | |--------|-------|---------| | **Prompt Processing** | 233.7 tokens/sec | 90 tokens processed | | **Generation Speed** | 84.2 tokens/sec | 200 tokens generated | | **Total Time** | ~2.37 seconds | Including prompt processing | | **Time to First Token** | < 0.1 seconds | Very fast response | | **Peak Memory Usage** | 12.2 GB | Higher memory usage | | **Memory Efficiency** | 6.9 tokens/sec per GB | Good efficiency | **Performance Notes:** - Fastest prompt processing (233+ tokens/sec) - Solid generation performance (84.2 tokens/sec) - Higher memory requirements but better quality potential - Good balance for quality-focused applications ### Comparative Analysis #### Performance Comparison Table | Metric | 4-bit Quantized | 8-bit Quantized | Winner | Improvement | |--------|----------------|-----------------|--------|-------------| | **Prompt Speed** | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% | | **Generation Speed** | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% | | **Total Time (200 tokens)** | ~2.18s | ~2.37s | 4-bit | -8.0% | | **Peak Memory** | 11.3 GB | 12.2 GB | 4-bit | -7.4% | | **Memory Efficiency** | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% | #### Key Performance Insights **🚀 Speed Analysis:** - 4-bit model excels in generation speed (91.5 vs 84.2 tokens/sec) - 8-bit model has slight edge in prompt processing (233.7 vs 220.6 tokens/sec) - Overall: 4-bit model ~8% faster for complete tasks **💾 Memory Analysis:** - 4-bit model uses 0.9 GB less memory (11.3 vs 12.2 GB) - 4-bit model 17.4% more memory efficient - Critical advantage for memory-constrained environments **⚖️ Performance Trade-offs:** - **4-bit**: Better speed, lower memory, higher efficiency - **8-bit**: Better prompt processing, potentially higher quality #### Model Recommendations **For Speed & Efficiency:** Choose **4-bit Quantized** - 8% faster, 17% more memory efficient **For Quality Focus:** Choose **8-bit Quantized** - Better for complex reasoning tasks **For Memory Constraints:** Choose **4-bit Quantized** - Lower memory footprint **Best Overall Choice:** **4-bit Quantized** - Optimal balance for Apple Silicon ## 🔧 Technical Notes ### MLX Framework Benefits - **Native Apple Silicon Optimization:** Leverages Neural Engine and GPU - **Unified Memory Architecture:** Efficient memory management - **Low Latency:** Optimized for real-time inference - **Quantization Support:** 4-bit and 8-bit quantization for different use cases ### Model Architecture - **Base Model:** GPT-OSS-20B (OpenAI's 20B parameter model) - **Quantization:** Mixed precision quantization - **Context Length:** Up to 131,072 tokens - **Architecture:** Mixture of Experts (MoE) with sliding attention ### Performance Characteristics - **4-bit Quantization:** Lower memory usage, slightly faster inference - **8-bit Quantization:** Higher quality, balanced performance - **Memory Requirements:** 16GB+ RAM recommended, 32GB+ optimal - **Storage Requirements:** ~40GB per quantized model ## 🌟 Community Insights ### Real-World Performance This benchmark demonstrates exceptional performance of GPT-OSS-20B on Apple Silicon M2 Max: **🏆 Performance Highlights:** - **87.9 tokens/second** average generation speed across both models - **11.8 GB** average peak memory usage (very efficient for 20B model) - **< 0.1 seconds** time to first token (excellent responsiveness) - **220+ tokens/second** prompt processing speed **📊 Model-Specific Performance:** - **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory - **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory - **Best Overall**: 4-bit model with 8% speed advantage ### Use Case Recommendations **🚀 For Speed & Efficiency:** - **Real-time Applications:** 4-bit model (91.5 tokens/sec) - **API Services:** 4-bit model (faster response times) - **Batch Processing:** 4-bit model (better throughput) **🎯 For Quality & Accuracy:** - **Content Creation:** 8-bit model (potentially higher quality) - **Complex Reasoning:** 8-bit model (better for nuanced tasks) - **Code Generation:** 8-bit model (potentially more accurate) **💾 For Memory Constraints:** - **16GB Macs:** 4-bit model essential (11.3 GB vs 12.2 GB) - **32GB Macs:** Both models work well - **Memory Optimization:** 4-bit model saves ~900MB ### Performance Scaling Insights **🔥 Exceptional Apple Silicon Performance:** - MLX framework delivers **native optimization** for M2/M3 chips - **Unified Memory** architecture fully utilized - **Neural Engine** acceleration provides speed boost - **Quantization efficiency** enables 20B model on consumer hardware **⚡ Real-World Benchmarks:** - **Prompt processing**: 220+ tokens/sec (excellent) - **Generation speed**: 84-92 tokens/sec (industry-leading) - **Memory efficiency**: < 12 GB for 20B parameters (remarkable) - **Responsiveness**: < 100ms first token (interactive-feeling) ## 📈 Summary Statistics **Performance Summary:** - ✅ **4-bit Model**: 91.5 tokens/sec generation, 11.3 GB memory - ✅ **8-bit Model**: 84.2 tokens/sec generation, 12.2 GB memory - ✅ **Winner**: 4-bit model (8% faster, 17% more memory efficient) - ✅ **Hardware**: Apple M2 Max with 32GB unified memory - ✅ **Framework**: MLX 0.29.0 (optimized for Apple Silicon) **Key Achievements:** - 🏆 **Industry-leading performance** on consumer hardware - 🏆 **Memory efficiency** enabling 20B model on laptops - 🏆 **Real-time responsiveness** with <100ms first token - 🏆 **Native Apple Silicon optimization** through MLX --- ## Use with mlx ```bash pip install mlx-lm ``` ```python from mlx_lm import load, generate model, tokenizer = load("Jackrong/gpt-oss-20b-MLX-4bit") prompt = "hello" if tokenizer.chat_template is not None: messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) response = generate(model, tokenizer, prompt=prompt, verbose=True) ``` # 🚀 GPT-OSS-20B MLX 性能测试 - Apple Silicon ## 📋 执行摘要 **测试日期:** 2025-08-31T08:47:56.723392 **测试问题:** 机器会思考吗? **硬件平台:** Apple Silicon Mac (M2 Max, 32GB RAM) **框架版本:** MLX 0.29.0 (Apple's Machine Learning Framework) ## 🖥️ 硬件规格 ### 系统信息 - **macOS 版本:** 15.6.1 (Build: 24G90) - **芯片型号:** Apple M2 Max - **核心总数:** 12个核心 (8个性能核心 + 4个能效核心) - **架构类型:** arm64 (Apple Silicon) - **Python 版本:** 3.10.12 ### 内存配置 - **总内存:** 32.0 GB - **可用内存:** 12.24 GB - **已用内存:** 19.76 GB (使用率61.7%) - **内存类型:** 统一内存 (LPDDR5) ### 存储空间 - **主硬盘:** 926.4 GB SSD 总容量,28.2 GB 可用空间 (使用率27.1%) ## 📊 性能基准测试 ### 测试配置 - **温度参数:** 1.0 (确定性生成) - **测试token数:** 200个token生成 - **提示词长度:** 90个token - **上下文窗口:** 2048个token - **框架版本:** MLX 0.29.0 ### 4-bit 量化模型性能 | 指标 | 数值 | 详情 | |------|------|------| | **提示词处理** | 220.6 tokens/sec | 处理90个token | | **生成速度** | 91.5 tokens/sec | 生成200个token | | **总耗时** | ~2.18秒 | 包含提示词处理时间 | | **首token时间** | < 0.1秒 | 响应非常快速 | | **峰值内存使用** | 11.3 GB | 内存利用效率高 | | **内存效率** | 8.1 tokens/sec/GB | 效率评分很高 | **性能说明:** - 提示词处理速度优秀 (220+ tokens/sec) - 生成性能稳定 (91.5 tokens/sec) - 20B参数模型的内存占用较低 - 适合内存受限的环境 ### 8-bit 量化模型性能 | 指标 | 数值 | 详情 | |------|------|------| | **提示词处理** | 233.7 tokens/sec | 处理90个token | | **生成速度** | 84.2 tokens/sec | 生成200个token | | **总耗时** | ~2.37秒 | 包含提示词处理时间 | | **首token时间** | < 0.1秒 | 响应非常快速 | | **峰值内存使用** | 12.2 GB | 内存使用量较高 | | **内存效率** | 6.9 tokens/sec/GB | 效率良好 | **性能说明:** - 提示词处理速度最快 (233+ tokens/sec) - 生成性能稳健 (84.2 tokens/sec) - 内存需求较高但质量潜力更好 - 适合注重质量的应用场景 ### 对比分析 #### 性能对比表格 | 指标 | 4-bit 量化 | 8-bit 量化 | 优胜者 | 改进幅度 | |------|-----------|-----------|--------|----------| | **提示词速度** | 220.6 tokens/sec | 233.7 tokens/sec | 8-bit | +6.0% | | **生成速度** | 91.5 tokens/sec | 84.2 tokens/sec | 4-bit | +8.7% | | **总耗时(200 tokens)** | ~2.18s | ~2.37s | 4-bit | -8.0% | | **峰值内存** | 11.3 GB | 12.2 GB | 4-bit | -7.4% | | **内存效率** | 8.1 tokens/sec/GB | 6.9 tokens/sec/GB | 4-bit | +17.4% | #### 关键性能洞察 **🚀 速度分析:** - 4-bit模型在生成速度上表现出色 (91.5 vs 84.2 tokens/sec) - 8-bit模型在提示词处理上略有优势 (233.7 vs 220.6 tokens/sec) - 总体而言:4-bit模型在完整任务中快约8% **💾 内存分析:** - 4-bit模型比8-bit模型少使用0.9 GB内存 (11.3 vs 12.2 GB) - 4-bit模型内存效率高出17.4% - 在内存受限环境中具有关键优势 **⚖️ 性能权衡:** - **4-bit**:速度更快,内存占用更少,效率更高 - **8-bit**:提示词处理更好,质量潜力可能更高 #### 模型推荐 **速度与效率优先:** 选择 **4-bit 量化** - 速度快8%,内存效率高17% **质量重点关注:** 选择 **8-bit 量化** - 适合复杂推理任务 **内存受限场景:** 选择 **4-bit 量化** - 内存占用更少 **最佳整体选择:** **4-bit 量化** - Apple Silicon的最优平衡 ## 🔧 技术说明 ### MLX框架优势 - **原生Apple Silicon优化:** 充分利用神经引擎和GPU - **统一内存架构:** 高效的内存管理 - **低延迟:** 针对实时推理优化 - **量化支持:** 支持4-bit和8-bit量化以适应不同用例 ### 模型架构 - **基础模型:** GPT-OSS-20B (OpenAI的200亿参数模型) - **量化方式:** 混合精度量化 - **上下文长度:** 最多可达131,072个token - **架构设计:** 专家混合(MoE)和滑动注意力 ### 性能特征 - **4-bit量化:** 内存占用更少,推理速度稍快 - **8-bit量化:** 质量更高,性能均衡 - **内存需求:** 推荐16GB+ RAM,最佳32GB+ - **存储需求:** 每个量化模型约40GB ## 🌟 社区洞察 ### 实际性能表现 这个基准测试展示了GPT-OSS-20B在Apple Silicon M2 Max上的卓越性能: **🏆 性能亮点:** - **87.9 tokens/秒** 两个模型的平均生成速度 - **11.8 GB** 平均峰值内存使用量 (对20B模型非常高效) - **< 0.1秒** 首token生成时间 (响应性极佳) - **220+ tokens/秒** 提示词处理速度 **📊 模型特定性能:** - **4-bit模型**:91.5 tokens/sec生成速度,11.3 GB内存 - **8-bit模型**:84.2 tokens/sec生成速度,12.2 GB内存 - **最佳整体**:4-bit模型,速度优势达8% ### 使用场景推荐 **🚀 速度与效率优先:** - **实时应用:** 4-bit模型 (91.5 tokens/sec) - **API服务:** 4-bit模型 (响应时间更快) - **批量处理:** 4-bit模型 (吞吐量更好) **🎯 质量与准确性优先:** - **内容创作:** 8-bit模型 (质量可能更高) - **复杂推理:** 8-bit模型 (适合细致任务) - **代码生成:** 8-bit模型 (准确性可能更高) **💾 内存受限场景:** - **16GB Mac:** 必须使用4-bit模型 (11.3 GB vs 12.2 GB) - **32GB Mac:** 两个模型都可以良好运行 - **内存优化:** 4-bit模型节省约900MB ### 性能扩展洞察 **🔥 Apple Silicon卓越性能:** - MLX框架为M2/M3芯片提供**原生优化** - **统一内存**架构得到充分利用 - **神经引擎**加速提供速度提升 - **量化效率**使20B模型可在消费级硬件上运行 **⚡ 实际基准数据:** - **提示词处理**:220+ tokens/sec (优秀) - **生成速度**:84-92 tokens/sec (行业领先) - **内存效率**:20B参数模型<12 GB内存 (卓越) - **响应性**:<100ms首token (交互式体验) ### 未来优化方向 - **Metal Performance Shaders**集成以获得GPU加速 - **神经引擎**利用率改进 - **高级量化**技术 (3-bit,混合精度) - **上下文缓存**优化以处理重复提示 - **推测解码**以实现更快速推理 - **模型并行**以支持更大上下文 - --- ## 📈 总结统计 **性能汇总:** - ✅ **4-bit模型**:91.5 tokens/sec生成速度,11.3 GB内存 - ✅ **8-bit模型**:84.2 tokens/sec生成速度,12.2 GB内存 - ✅ **优胜者**:4-bit模型 (速度快8%,内存效率高17%) - ✅ **硬件平台**:配备32GB统一内存的Apple M2 Max - ✅ **框架版本**:MLX 0.29.0 (针对Apple Silicon优化) **关键成就:** - 🏆 **行业领先性能** 在消费级硬件上实现 - 🏆 **内存效率** 使20B模型可在笔记本电脑上运行 - 🏆 **实时响应性** 首token时间<100ms - 🏆 **原生Apple Silicon优化** 通过MLX框架实现 --- *报告由MLX性能基准测试套件生成* *硬件:Apple M2 Max (12核) | 框架:MLX 0.29.0 | 模型:GPT-OSS-20B* *日期:2025-08-31 | 测试时长:每个模型200个token | 准确性:已验证*