42 14 89

Ivan Fioravanti PRO

ivanfioravanti

AI & ML interests

None yet

Recent Activity

liked a model about 5 hours ago

Qwen/Qwen3-235B-A22B-MLX-bf16

liked a model about 5 hours ago

Qwen/Qwen3-235B-A22B-MLX-6bit

liked a model about 5 hours ago

Qwen/Qwen3-235B-A22B-MLX-4bit

View all activity

Organizations

liked 3 models about 5 hours ago

liked a model 3 days ago

chandar-lab/NeoBERT

Feature Extraction • 0.2B • Updated Mar 25 • 2.4k • 158

updated a model 29 days ago

mlx-community/DeepSeek-R1-0528-3bit

Text Generation • 84B • Updated 29 days ago • 2.69k

published a model 29 days ago

mlx-community/DeepSeek-R1-0528-3bit

Text Generation • 84B • Updated 29 days ago • 2.69k

published a model about 1 month ago

mlx-community/DeepSeek-R1-0528-4bit

Text Generation • 105B • Updated 30 days ago • 5.02k • 13

liked a model about 1 month ago

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated 30 days ago • 173k • • 2.13k

upvoted an article about 1 month ago

Article

You could have designed state of the art positional encoding

•

Nov 25, 2024

• 305

reacted to wolfram's post with 🔥 about 2 months ago

Post

7263

Finally finished my extensive **Qwen 3 evaluations** across a range of formats and quantisations, focusing on **MMLU-Pro** (Computer Science).

A few take-aways stood out - especially for those interested in local deployment and performance trade-offs:

1️⃣ **Qwen3-235B-A22B** (via Fireworks API) tops the table at **83.66%** with ~55 tok/s.
2️⃣ But the **30B-A3B Unsloth** quant delivered **82.20%** while running locally at ~45 tok/s and with zero API spend.
3️⃣ The same Unsloth build is ~5x faster than Qwen's **Qwen3-32B**, which scores **82.20%** as well yet crawls at <10 tok/s.
4️⃣ On Apple silicon, the **30B MLX** port hits **79.51%** while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣ The **0.6B** micro-model races above 180 tok/s but tops out at **37.56%** - that's why it's not even on the graph (50 % performance cut-off).

All local runs were done with LM Studio on an M4 MacBook Pro, using Qwen's official recommended settings.

**Conclusion:** Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.

Well done, Qwen - you really whipped the llama's ass! And to OpenAI: for your upcoming open model, please make it MoE, with toggleable reasoning, and release it in many sizes. *This* is the future!