Amazing work👏
Introduces Dream 7B - a discrete diffusion reasoning model, fully opensourced with weights on 🤗
- it outperforms existing non-autoregressive models and matches or beats frontier autoregressive of similar size on reasoning tasks.
Models:
- base: Dream-org/Dream-v0-Base-7B
- SFT: Dream-org/Dream-v0-Instruct-7B
Code: https://github.com/HKUNLP/Dream
Project: https://hkunlp.github.io/blog/2025/dream/

1 reply

reacted to BestWishYsh's post with 🔥 12 days ago

Post

2592

🚨 Hot Take: GPT-4o might NOT be a purely autoregressive model! 🚨

There’s a high chance it has a diffusion head. 🤯 If true, this could be a game-changer for AI architecture. What do you think? 🤔👇

Code: https://github.com/PicoTrex/GPT-ImgEval
Dataset: Yejy53/GPT-ImgEval
Paper: GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation (2504.02782)

replied to lianghsun's post 12 days ago

lfg

reacted to lianghsun's post with 🔥 12 days ago

Post

2147

With the arrival of Twinkle April — Twinkle AI’s annual open-source celebration held every April — our community is excited to unveil its very first project:

📊 Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .

Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 — for example, evaluating LRMs on the ikala/tmmluplus benchmark could take *
half a day without finishing.

One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? 🤔
→ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1

To address these challenges, Twinkle Eval brings three key innovations to the table:

1️⃣ Parallelized evaluation of samples
2️⃣ Multi-round testing for stability
3️⃣ Randomized answer order to test robustness

After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15× 🚀🚀. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance — suggesting further benchmarking is needed.

This framework also comes with additional tunable parameters and detailed logging of LM behavior per question — perfect for those who want to dive deeper. 😆

If you find Twinkle Eval useful, please ⭐ the project and help spread the word 🤗

5 replies

replied to AdinaY's post 12 days ago

wow

reacted to AdinaY's post with 🔥 12 days ago

Post

2781

SkyReels-A2 🚀 an open framework for controllable video generation from text + images, released by Skywork, KunLun

✨Model:
Skywork/SkyReels-A2
✨Paper:
SkyReels-A2: Compose Anything in Video Diffusion Transformers (2504.02436)

1 reply