ONEKQ AI

company

AI & ML interests

Benchmark, Code Generation, LLM

Recent Activity

onekq  updated a Space about 9 hours ago
onekq-ai/README
onekq  updated a Space 1 day ago
onekq-ai/WebApp1K-models-leaderboard
onekq  updated a model 15 days ago
onekq-ai/OneSQL-v0.1-Qwen-1.5B-GGUF
View all activity

onekq-ai's activity

onekq 
posted an update about 7 hours ago
onekq 
updated a Space about 9 hours ago
onekq 
posted an update 1 day ago
onekq 
posted an update 2 days ago
onekq 
posted an update 3 days ago
onekq 
posted an update 4 days ago
view post
Post
1966
I used three posts to explain GPU/CPU and LLM performances, now finally circle back to my own model.😅

OneSQL needs GPU because it processes long prompt. It is not a chatbot which replies short prompts with long answers. I call models of my kind workhorse models.

We all have to scramble for GPUs to get adoption. Below are a few ways.

You can inherit it. If you have a new Mac machine. Congratulations, you have GPU.

You can leverage it. Get inference providers to adopt your model, then you switch from CapEx to OpEx.

Or you buy it. Go frugal. Find older GPUs with enough HBMs to house your model.
onekq 
posted an update 6 days ago
view post
Post
684
I just compared tasks with different input/output lengths. CPU/GPU performances are very different here.

The LLMs we use today are autoregressive or causal inference models, meaning the generation of each output token depends on all previous tokens. Since the model must generate one token at a time, it sets a hard limit on parallelism. The chatbot simulating human typing is in fact a UI trick to gloss over this fundamental limit. This is great news for CPUs because it levels the playing field.

But when processing input tokens, this limit doesn't exist. The GPU can fire up thousands of cores (vs dozens of CPU cores) to process as many input tokens as it can, all at once. Here, GPU enjoys a significant speed margin over CPU. The longer the prompt, the bigger the margin.

So, when it comes to user experience, both GPU and CPU can output text at decent speed. What really distinguishes them is the initial wait time, i.e. prompt processing delay.
  • 1 reply
·
onekq 
posted an update 8 days ago
view post
Post
983
I just compared CPU vs GPU. CPU is actually good for tasks with short prompt and long answer. For such tasks, we usually treat LLM as consultant or teacher.

Say you are filing taxes and ask "what is form XXXX?" The chat bot will return an essay to explain the form and walk you through scenarios.

But when you decide to file this form, LLM becomes your assistant/agent. Suddenly the prompt becomes (much) longer than the answer. You throw in bunch of documents, and ask the LLM to fill out the form for you.

This is when we need GPU. I will get into details in the next post.
  • 1 reply
·
onekq 
posted an update 9 days ago
view post
Post
2565
We desperately need GPU for model inference. CPU can't replace GPU.

I will start with the basics. GPU is designed to serve predictable workloads with many parallel units (pixels, tensors, tokens). So a GPU allocates as much transistor budget as possible to build thousands of compute units (Cuda cores in NVidia or execution units in Apple Silicon), each capable of running a thread.

But CPU is designed to handle all kinds of workloads. CPU cores are much larger (hence a lot fewer) with branch prediction and other complex things. In addition, more and more transistors are allocated to build larger cache (~50% now) to house the unpredictable, devouring the compute budget.

Generalists can't beat specialists.
·
onekq 
posted an update 11 days ago
onekq 
posted an update 12 days ago
onekq 
posted an update 14 days ago