Any plans for some tiny models? (<4b)

by phly95 - opened Feb 5

Feb 5

Mistral models are great, but they are unfortunately lacking models <4b. Options like Qwen 2.5, Gemma 2 2b, and Llama 3.2 3b and 1b exist, but I feel like having a Mistral model in that area would really make deploying local llm powered apps a lot easier, especially if deploying to basic laptops in a workplace (good luck convincing IT to deploy Nvidia laptops to an entire company).

exc3ll3ntrhythm

Feb 21

Were you thinking of quantizing the models to have them run on the lesser hardware?

phly95

Feb 21

I had the idea that it would be quantized as gguf to run it on standard issue company laptops using llama.cpp. I think it would be a cool way to circulate llm-based application within specific departments while reducing overhead of setting up servers or managing private information.

exc3ll3ntrhythm

Feb 22

That sounds like a good idea. But is it that these larger models are too big to run in CPU memory even with quantizing? Also I feel like the actual storage of the models would be a problem too if the issued laptops dont have very much space.

phly95

Feb 22

The problem is responsiveness mainly. While the larger models can run on these laptops since they've got plenty of unified memory, the GPUs have a hard time once you throw a decent amount of context at it. If the goal is to have a complete summary of a transcript within 1 minute (since the computer slows to a crawl during processing), then these smaller models are the only option, at least when using llama.cpp with the Vulkan backend that is.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment