Specification

#76
by zarifhaikal01 - opened

What's the best cheap setup to host this model locally ?

i need to know which GPU,CPU and how many ram to handle maybe around 300 requests daily.

Google org

Hi @zarifhaikal01 , Gemma 3 quickly established itself as a leading model capable of running on a single high-end GPU like the NVIDIA H100 using its native BFloat16 (BF16) precision.

To make Gemma 3 even more accessible, Google has announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.

NVIDIA GeForce RTX 3090 (24GB VRAM) : This is currently the sweet spot for VRAM in consumer cards. A single RTX 3090 can comfortably load the Gemma 3 27B IT QAT INT4 model (which needs ~14.1 GB VRAM) and leave plenty of room for the KV cache (which stores conversation context) and vLLM's overhead. You might even be able to squeeze in a slightly larger quantization if needed.

It Will provide good inference speed for 300 daily requests, especially with vLLM's optimized serving. You can expect 20-30+ tokens/second for shorter outputs, which is perfectly acceptable for daily use.

Kindly refer this link for more information. if you have any concerns let us know will assist you. Thank you.

Sign up or log in to comment