23 19 1

Yi Cui

onekq

https://onekq.ai

AI & ML interests

Benchmark, Code Generation Model

Recent Activity

posted an update 6 days ago

The new R1 is on a par with the old R1. Meet the expectation. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

updated a Space 6 days ago

onekq-ai/WebApp1K-models-leaderboard

reacted to AtAndDev's post with 🤗 7 days ago

deepseek-ai/DeepSeek-R1-0528 This is the end

View all activity

Organizations

onekq's activity

posted an update 6 days ago

Post

316

The new R1 is on a par with the old R1. Meet the expectation.
onekq-ai/WebApp1K-models-leaderboard

reacted to AtAndDev's post with 🤗 7 days ago

Post

2667

deepseek-ai/DeepSeek-R1-0528

This is the end

1 reply

posted an update 8 days ago

Post

329

I'm now testing the new 🐋DeepSeek🐋 R1 and like all reasoning models, it's awfully slow. 🐢🐢

I don't expect it to break SOTA. In fact, it will be a win if it beats the old R1, which already stands very high in the leaderboard.

onekq-ai/WebApp1K-models-leaderboard

IMO the world needs a better vanilla LLM, e.g. 🐋DeepSeek🐋 v4 or v3.5, which we will use in daily life. That's the direction Gemini Flash took which I praised.

reacted to clem's post with 🤗 12 days ago

Post

3219

It's just become easier to share your apps on the biggest AI app store (aka HF spaces) for unlimited storage, more visibility and community interactions.

Just pick a React, Svelte, or Vue template when you create your space or add app_build_command: npm run build in your README's YAML and app_file: build/index.html in your README's YAML block.

Or follow this link: https://huggingface.co/new-space?sdk=static

Let's build!

1 reply

posted an update 14 days ago

Post

2193

🎉🥳 SOTA!!! 🚀👑

🥇 Claude 4 Opus !!🥇

7 months!! ⌛⌛

I thought the day would never come. But here it is.

onekq-ai/WebApp1K-models-leaderboard

Cost me quite a bit of 💵money 💵 but it is all worth it.

Enjoy and make out of this as much as you can!

4 replies

posted an update 16 days ago

Post

2192

Highly recommend the latest Gemini Flash. My favorite Google I/O gift. It ranks behind reasoning models but runs a lot faster than them. It beats DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

Reasoning is good for coding, but not mandatory.

1 reply

reacted to ProCreations's post with 🤗 19 days ago

Post

3180

Eyyy thank you guys for 40 followers!

posted an update 20 days ago

Post

476

Hmm,

codex-mini is a finetuned version of o4-mini, but on my leaderboard it performs worse than its base.

onekq-ai/WebApp1K-models-leaderboard

posted an update 23 days ago

Post

943

This paper introduced the notion of "Tests as Prompt". It compiled results and findings of WebApp1K published in previous three papers.

https://huggingface.co/papers?q=2505.09027

The central argument here is that test-driven development is a natural fit to LLMs, which scale better than humans. I bet the future will see thousands of such leaderboards (many more proprietary ones), each dominated by a specialized model.

reacted to clem's post with 🔥 24 days ago

Post

3132

Very cool to see

pytorch contributing on Hugging Face. Time to follow them to see what they're cooking!

2 replies

posted an update 25 days ago

Post

460

If you also tuned into Altman's second congress hearing (first in 2023) along with other AI executives, my takeaway is two words: New Deal (by FDR almost a century ago).

The causal link is quite fascinating and worthy of a few blogposts or deep research queries, but I won't have more time for this (I really wish so), so here goes.

* AI workload loves GPUs because they allocate more transistors than CPUs for computing, and pack them by high-bandwidth memory
* More computing in the small physical space -> more power draw and more heat dissipation
* more heat dissipation -> liquid cooling
* new cooling and heavier power draw -> bigger racks (heavier and taller)
* bigger racks -> (re)building data centers
* new data centers with higher power demand (peak and stability) -> grid update and nuclear power

posted an update 28 days ago

Post

2279

The new Mistral medium model is very impressive for its size. Will it be open sourced given the history of Mistral? Does anyone have insights?

onekq-ai/WebApp1K-models-leaderboard

posted an update 29 days ago

Post

3279

This time Gemini is very quick with API support on its 2.5 pro May release. The performance is impressive too, now it is among top contenders like o4, R1, and Claude.

onekq-ai/WebApp1K-models-leaderboard

replied to clem's post about 1 month ago

Biggest pain point is still inference providers. Even decent labs like Ai2 or THUDM need to lobby for that. My leaderboard is for web developers but I can only evaluate the most visible models with token API support. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

Maybe some players have GPUs but keep the results to themselves. We can only hope they will reciprocate for what they benefit from this community.

reacted to clem's post with ❤️ about 1 month ago

Post

4061

What are you using to evaluate models or AI systems? So far we're building lighteval & leaderboards on the hub but still feels early & a lot more to build. What would be useful to you?

6 replies

posted an update about 1 month ago

Post

605

Okay, Grok 3 has API support too, and beats Gemini 2.5, but is behind QwQ 32b and DeepSeek v3

onekq-ai/WebApp1K-models-leaderboard

replied to their post about 1 month ago

yes yes.

Maybe you can run a leaderboard of models indexed by freedom 🤗

posted an update about 1 month ago

Post

1753

I didn't noticed that Gemini 2.5 (pro and flash) has been silently launched for API preview. Their performance is solid, but below QwQ 32B and the latest DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

2 replies

replied to their post about 1 month ago

I doubted there will be a Qwen3-coder. The direction changed. Alibaba is a corporation. You can imagine the number of executive sponsors for this release. Stock performance is at stake now. Price of success.

replied to their post about 1 month ago

You meant the non-thinking mode? If so, add /no_think in your prompt