--- title: On-Device LLM Throughput Calculator emoji: 🚀 colorFrom: pink colorTo: blue sdk: gradio sdk_version: 4.36.0 app_file: src/app.py pinned: false license: mit --- # On-Device LLM Throughput Calculator A Gradio web application that helps visualize LLM throughput on memory-bandwidth-constrained devices. ## Overview This tool calculates and visualizes the theoretical throughput (tokens per second) that can be achieved by a Large Language Model (LLM) running on devices with memory bandwidth constraints. It supports different attention mechanisms: - Grouped Query Attention (GQA) - Multi-Query Attention (MQA) - Memory-Latent Attention (MLA) It also visualizes how sliding window attention impacts throughput at different context lengths. ## Features - Customize device specifications (memory bandwidth) - Configure model parameters (size, layers, heads) - Compare different attention mechanisms - Visualize performance across different context lengths - Sliding window attention support ## Usage 1. Configure your device details (name, memory bandwidth) 2. Set model parameters (number of parameters, layer count, etc.) 3. Choose which attention mechanism configurations to compare 4. Generate a visualization of expected throughput ## Installation ```bash pip install -r requirements.txt ``` ## Running Locally ```bash cd src python app.py ``` ## Theory The calculations are based on memory bandwidth bottlenecks as described in the [JAX ML Scaling Book](https://jax-ml.github.io/scaling-book/inference/#theoretical-estimates-for-llm-latency-and-throughput). The basic formula for tokens per second: ``` tokens_per_second = (batch_size * memory_bandwidth) / (batch_size * total_kv_size + parameter_size) ``` ## License MIT