Ed Addario PRO

eaddario

EAddario

AI & ML interests

Finding ways to optimize LLMs' inference performance in resource-constrained environments (e.g. commodity hardware, desktops, laptops, mobiles, edge devices, etc.)

Recent Activity

updated a model 3 days ago

eaddario/Olmo-3-7B-Instruct-GGUF

updated a model 3 days ago

eaddario/Olmo-3-7B-Think-GGUF

updated a model 3 days ago

eaddario/GLM-4.6V-Flash-GGUF

View all activity

Organizations

updated 4 models 3 days ago

posted an update 7 days ago

Post

1619

Experimental global target bits‑per‑weight quantization of allenai/Olmo-3-7B-Instruct and allenai/Olmo-3-7B-Think

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Olmo-3-7B-Instruct-GGUF
eaddario/Olmo-3-7B-Think-GGUF

published 2 models 8 days ago

eaddario/Olmo-3-7B-Think-GGUF

Text Generation • 7B • Updated 3 days ago • 354

eaddario/Olmo-3-7B-Instruct-GGUF

Text Generation • 7B • Updated 3 days ago • 328

posted an update 9 days ago

Post

2071

Experimental global target bits‑per‑weight quantization of ServiceNow-AI/Apriel-1.6-15b-Thinker and zai-org/GLM-4.6V-Flash

Unlike standard llama.cpp quantizations that rely on fixed type heuristics (e.g., Q4_K_M), the Target BPW approach optimizes per-tensor precision where it matters the most, and produces high quality models that meet a precise global file size target.

Key Advantages:
- VRAM Maximization: Can generate high quality models sized exactly to fit hardware constraints (e.g., fitting the model into exactly 24GB VRAM).
- Data-Driven Precision: Quantization mix is determined by actual weight error sensitivity rather than hardcoded rules, often yielding better PPL/KLD size trade-offs.

Full benchmarks (PPL, KLD, ARC, MMLU, etc.) and methodology in the models' cards

eaddario/Apriel-1.6-15b-Thinker-GGUF
eaddario/GLM-4.6V-Flash-GGUF