With the arrival of Twinkle April — Twinkle AI’s annual open-source celebration held every April — our community is excited to unveil its very first project:
Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 — for example, evaluating LRMs on the ikala/tmmluplus benchmark could take * half a day without finishing.
One question we were especially curious about: Does shuffling multiple-choice answer order impact model accuracy? 🤔 → See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1
To address these challenges, Twinkle Eval brings three key innovations to the table:
1️⃣ Parallelized evaluation of samples 2️⃣ Multi-round testing for stability 3️⃣ Randomized answer order to test robustness
After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15× 🚀🚀. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance — suggesting further benchmarking is needed.
This framework also comes with additional tunable parameters and detailed logging of LM behavior per question — perfect for those who want to dive deeper. 😆
If you find Twinkle Eval useful, please ⭐ the project and help spread the word 🤗
🚀Now, you can use the following commands for different tasks:
🖼️ @image 'prompt...' → Generates an image 🔉@tts1 'prompt...' → Generates speech in a female voice 🔉 @tts2 'prompt...' → Generates speech in a male voice 🅰️@text 'prompt...' → Enables textual conversation (If not specified, text-to-text generation is the default mode)
graph TD
A[User Interface] --> B[Chat Logic]
B --> C{Command Type}
C -->|Text| D[FastThink-0.5B]
C -->|Image| E[Qwen2-VL-OCR-2B]
C -->|@image| F[Stable Diffusion XL]
C -->|@tts| G[Edge TTS]
D --> H[Response]
E --> H
F --> H
G --> H