π Real-Time On-Device AI Agent with Polaris-4B β Run It Yourself, No Cloud, No Cost
We just deployed a real-time on-device AI agent using the Polaris-4B-Preview model β one of the top-performing <6B open LLMs on Hugging Face.
π± Whatβs remarkable? This model runs entirely on a mobile device, without cloud, and without any manual optimization. It was built using ZETIC.MLange, and the best part?
β‘οΈ Itβs totally automated, free to use, and anyone can do it. You donβt need to write deployment code, tweak backends, or touch device-specific SDKs. Just upload your model β and ZETIC.MLange handles the rest.
π§ About the Model - Model: Polaris-4B-Preview - Size: ~4B parameters - Ranking: Top 3 on Hugging Face LLM Leaderboard (<6B) - Tokenizer: Token-incremental inference supported - Modifications: None β stock weights, just optimized for mobile
βοΈ What ZETIC.MLange Does ZETIC.MLange is a fully automated deployment framework for On-Device AI, built for AI engineers who want to focus on models β not infrastructure.
Hereβs what it does in minutes: - π Analyzes model structure - βοΈ Converts to mobile-optimized format (e.g., GGUF, ONNX) - π¦ Generates a runnable runtime environment with pre/post-processing - π± Targets real mobile hardware (CPU, GPU, NPU β including Qualcomm, MediaTek, Apple) - π― Gives you a downloadable SDK or mobile app component β ready to run And yes β this is available now, for free, at https://mlange.zetic.ai
π§ͺ For AI Engineers Like You, If you want to: - Test LLMs directly on-device - Run models offline with no latency - Avoid cloud GPU costs - Deploy to mobile without writing app-side inference code
Then this is your moment. You can do exactly what we did, using your own models β all in a few clicks.
Iβve been running small language models (SLLMs) directly on smartphones β completely offline, with no cloud backend or server API calls.
I wanted to share: 1. β‘Β Tokens/sec performance across several SLLMs 2. π€Β Observations on hardware utilization (where the workload actually runs) 3. πΒ Trade-offs between model size, latency, and feasibility for mobile apps
There are reports for below models - QWEN3 0.6B - NVIDIA/Nemotron QWEN 1.5B - SimpleScaling S1 - TinyLlama - Unsloth tuned Llama 3.2 1B - Naver HyperClova 0.5B
πComparable Benchmark reports (no cloud, all on-device): Iβd really value your thoughts on: - Creative ideas to further optimize inference under these hardware constraints - Other compact LLMs worth testing on-device - Experiences youβve had trying to deploy LLMs at the edge
If thereβs interest, Iβm happy to share more details on the test setup, hardware specs, or the tooling we used for these comparisons.
Thanks for taking a look, and you can build your own through at "https://mlange.zetic.ai"!