ONEKQ AI

company

https://onekq.ai

onekq_ai

onekq

Activity Feed

AI & ML interests

Benchmark, Code Generation, LLM

Recent Activity

onekq updated a Space about 9 hours ago

onekq-ai/README

onekq updated a Space 1 day ago

onekq-ai/WebApp1K-models-leaderboard

onekq updated a model 15 days ago

onekq-ai/OneSQL-v0.1-Qwen-1.5B-GGUF

View all activity

onekq-ai's activity

onekq

posted an update about 7 hours ago

Post

135

This is bitter lesson 2.0
https://storage.googleapis.com/deepmind-media/Era-of-Experience%20/The%20Era%20of%20Experience%20Paper.pdf

If this reads too lofty to you, consider some low-hanging fruits. Experiences here are reward signals we send to LLMs, e.g. human score in RLHF, verification in AlphaProof, or test results for code generation.

RFT (reinforced finetuning) will become main stream, and IMO make LLMs behave more like agents.

onekq

updated a Space about 9 hours ago

README

🌖

onekq

posted an update 1 day ago

Post

364

o4-mini beats o3-mini, and gets very close to SOTA 😄

onekq-ai/WebApp1K-models-leaderboard

2 replies

onekq

updated a Space 1 day ago

WebApp1K Models Leaderboard

🥇

Generate leaderboard for model performance metrics

onekq

posted an update 2 days ago

Post

251

GPT 4.1 slightly beats R1. but behind o3 mini.

The SOTA of WebApp1K hasn't changed for 6 months 😑

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 3 days ago

Post

586

I made an app for Mac users to (1) make the SQL task easy with the help of UI and (2) leverage their on-device GPU.

You can download the DMG here
https://github.com/onekq/onekq-apple/releases/latest

Thoughts and suggestions are highly appreciated.

onekq

posted an update 4 days ago

Post

1966

I used three posts to explain GPU/CPU and LLM performances, now finally circle back to my own model.😅

OneSQL needs GPU because it processes long prompt. It is not a chatbot which replies short prompts with long answers. I call models of my kind workhorse models.

We all have to scramble for GPUs to get adoption. Below are a few ways.

You can inherit it. If you have a new Mac machine. Congratulations, you have GPU.

You can leverage it. Get inference providers to adopt your model, then you switch from CapEx to OpEx.

Or you buy it. Go frugal. Find older GPUs with enough HBMs to house your model.

onekq

posted an update 6 days ago

Post

684

I just compared tasks with different input/output lengths. CPU/GPU performances are very different here.

The LLMs we use today are autoregressive or causal inference models, meaning the generation of each output token depends on all previous tokens. Since the model must generate one token at a time, it sets a hard limit on parallelism. The chatbot simulating human typing is in fact a UI trick to gloss over this fundamental limit. This is great news for CPUs because it levels the playing field.

But when processing input tokens, this limit doesn't exist. The GPU can fire up thousands of cores (vs dozens of CPU cores) to process as many input tokens as it can, all at once. Here, GPU enjoys a significant speed margin over CPU. The longer the prompt, the bigger the margin.

So, when it comes to user experience, both GPU and CPU can output text at decent speed. What really distinguishes them is the initial wait time, i.e. prompt processing delay.

1 reply

onekq

posted an update 8 days ago

Post

983

I just compared CPU vs GPU. CPU is actually good for tasks with short prompt and long answer. For such tasks, we usually treat LLM as consultant or teacher.

Say you are filing taxes and ask "what is form XXXX?" The chat bot will return an essay to explain the form and walk you through scenarios.

But when you decide to file this form, LLM becomes your assistant/agent. Suddenly the prompt becomes (much) longer than the answer. You throw in bunch of documents, and ask the LLM to fill out the form for you.

This is when we need GPU. I will get into details in the next post.

1 reply

onekq

posted an update 9 days ago

Post

2565

We desperately need GPU for model inference. CPU can't replace GPU.

I will start with the basics. GPU is designed to serve predictable workloads with many parallel units (pixels, tensors, tokens). So a GPU allocates as much transistor budget as possible to build thousands of compute units (Cuda cores in NVidia or execution units in Apple Silicon), each capable of running a thread.

But CPU is designed to handle all kinds of workloads. CPU cores are much larger (hence a lot fewer) with branch prediction and other complex things. In addition, more and more transistors are allocated to build larger cache (~50% now) to house the unpredictable, devouring the compute budget.

Generalists can't beat specialists.

4 replies

onekq

posted an update 11 days ago

Post

1520

Llama 4 is ... better than Llama 3.x

onekq-ai/WebApp1K-models-leaderboard

1 reply

onekq

posted an update 12 days ago

Post

2166

10K downloads! 🚀 Wow, unbelievable. Thank you everyone! Only on 🤗HF🤗

onekq-ai/onesql-v01-qwen-67d8e3eb1611c5532bb90c5f

onekq

posted an update 14 days ago

Post

1896

Here is OneSQL 1.5B and quantizations. Accuracy loss is compensated by speed and ubiquity. Now SQL generation on Apple M1 is around 4 seconds, closed to real time.

onekq-ai/onesql-v01-qwen-67d8e3eb1611c5532bb90c5f