Reasoning datasets competition

community

AI & ML interests

None defined yet.

Recent Activity

reasoning-datasets-competition's activity

TonicΒ 
posted an update 2 days ago
view post
Post
269
πŸ™‹πŸ»β€β™‚οΈ hey there folks ,

So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :

basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!

this is terrible ! but the good news is that we can do something about it !

so there is this "call for opinions and comments" here from the NIH (usa) , and here we can make our opinion on this topic known : https://osp.od.nih.gov/comment-form-responsibly-developing-and-sharing-generative-artificial-intelligence-tools-using-nih-controlled-access-data/

kindly consider dropping your opinion and thoughts about this censorship of science , and share this post , link or thoughts widely .

Together maybe we can start to share data and model weights appropriately and openly in a good way πŸ™πŸ»πŸš€

cc. @cyrilzakka

Akhil-TheerthalaΒ 
posted an update 3 days ago
view post
Post
841
Kuvera v0.1.0 is now live!

A series of personal finance advisor models that try to resolve the queries by trying to understand the person’s psychological state and relevant context.

These are still prototypes that have much room for improvement.

What’s included in this release:
- Akhil-Theerthala/Kuvera-8B-v0.1.0: Qwen3-8B, meticulously fine-tuned on approximately 20,000 personal-finance inquiries.
- Akhil-Theerthala/Kuvera-14B-v0.1.0: LoRA on DeepSeek-R1-Distill-Qwen-14B, honed through training on about 10,000 chain-of-thought queries.

For those interested, the models and datasets are accessible for free (links in the comments). If you are curious about the upcoming version's roadmap, let’s connectβ€”there are many more developments I plan to make, and would definitely appreciate any help.
codelionΒ 
posted an update 5 days ago
view post
Post
3311
🧠 We just implemented Andrej Karpathy's "third paradigm" for LLM learning!

System Prompt Learning (SPL) enables LLMs to automatically learn problem-solving strategies from experience, rather than relying on static prompts.

πŸš€ How it works:
Your LLM builds a database of effective strategies, selects the best ones for each problem, and refines them over time based on success rates.

πŸ“Š Results across math benchmarks:
Arena Hard: 29% β†’ 37.6% (+8.6%)
AIME24: 23.33% β†’ 30% (+6.67%)
OptILLMBench: 61% β†’ 65% (+4%)

The best part? All strategies are human-readable and the system gets progressively better at problem types you use frequently.

✨ Key benefits:
πŸ”„ Cumulative learning over time
πŸ“– Transparent, inspectable strategies
πŸ”Œ Works with any OpenAI-compatible API
⚑ Simple integration: just add "spl-" prefix to your model

Built as an open-source plugin in optillm. After 500 queries, our system developed 129 strategies and refined 97 of them!

This feels like a genuine step toward AI that learns from experience while staying completely interpretable.

πŸ”— GitHub: https://github.com/codelion/optillm/tree/main/optillm/plugins/spl
πŸ“– Full article: https://huggingface.co/blog/codelion/system-prompt-learning
🐦 Original Karpathy tweet: https://x.com/karpathy/status/1921368644069765486

Have you experimented with advanced system prompting? What strategies would you want your LLM to learn?
codelionΒ 
posted an update 10 days ago
view post
Post
2319
Introducing AutoThink: Adaptive reasoning for LLMs that improves performance by 43% on reasoning benchmarks!

Instead of using fixed thinking budgets, AutoThink:
- Classifies query complexity (HIGH/LOW) using adaptive classification
- Dynamically allocates thinking tokens based on complexity
- Uses steering vectors derived from Pivotal Token Search to guide reasoning patterns

Results on DeepSeek-R1-Distill-Qwen-1.5B:
- GPQA-Diamond: 31.06% vs 21.72% baseline (+9.34 points)
- MMLU-Pro: 26.38% vs 25.58% baseline (+0.8 points)
- Uses fewer tokens than baseline approaches

Works with any local reasoning model - DeepSeek, Qwen, Llama, custom models. The technique combines our research on Pivotal Token Search (PTS) implementation and adaptive classification frameworks.

Paper: AutoThink: efficient inference for reasoning LLMs
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327

Code and examples:
https://github.com/codelion/optillm/tree/main/optillm/autothink

PTS implementation and technical details:
https://github.com/codelion/pts
https://huggingface.co/blog/codelion/pts

Adaptive classifier framework:
https://github.com/codelion/adaptive-classifier

Would love to hear your thoughts on adaptive resource allocation for LLM reasoning! Have you experimented with similar approaches?
  • 5 replies
Β·
TonicΒ 
posted an update 12 days ago
view post
Post
2439
πŸ™‹πŸ»β€β™‚οΈ Hey there folks ,

Yesterday the world's first "Learn to Vibe Code" application was released .

As vibe coding is the mainstream paradigm , so now the first educational app is there to support it .

You can try it out already :

https://vibe.takara.ai

and of course it's entirely open source, so i already made my issue and feature branch :-) πŸš€
zarmalhotraΒ 
updated a Space 16 days ago
codelionΒ 
posted an update 17 days ago
view post
Post
2820
🧬 Hey everyone! Just released **OpenEvolve** - an open-source implementation of Google DeepMind's AlphaEvolve system.

It's an evolutionary coding agent that uses LLMs to discover and optimize algorithms. I successfully replicated DeepMind's results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm.

✨ Key features:
- Evolves entire codebases (not just single functions)
- Works with any OpenAI-compatible API
- LLM ensemble approach for better results
- Multi-objective optimization

πŸ‘‰ Check it out:
GitHub: https://github.com/codelion/openevolve
Blog post: https://huggingface.co/blog/codelion/openevolve

Would love to hear your thoughts or answer any questions about it!
codelionΒ 
posted an update 19 days ago
view post
Post
2394
Introducing Pivotal Token Search (PTS): A new technique for targeted LLM alignment

Excited to share Pivotal Token Search (PTS), a technique for identifying and optimizing critical decision points in LLM generations!

GitHub repository: https://github.com/codelion/pts

What is PTS?
PTS helps identify specific "pivotal tokens" that dramatically shift the probability of a successful generation. Unlike traditional DPO which treats all tokens equally, PTS focuses optimization on the tokens that actually matter for success.

Inspired by Microsoft's recent Phi-4 paper (which used this technique to achieve SOTA reasoning with only 14B parameters), PTS is especially effective for:
- Mathematical reasoning
- Coding tasks
- Multi-step problem solving
- Any domain where specific decision points strongly impact outcomes

What we're releasing today: codelion/pivotal-token-search-68241145d8b8502122f3ce4f

1. Open-source code:
- Complete implementation of the PTS algorithm
- Data generation pipelines
- Usage examples and documentation

2. Huggingface resources:
- Datasets collection: https://huggingface.co/datasets?other=pts
* Pre-generated preference pairs for various domains
* Ready to use in your DPO training pipelines

- Models collection: https://huggingface.co/models?other=pts
* Pre-trained models fine-tuned with PTS
* Specialized versions for different reasoning tasks

The algorithm is straightforward to implement and can significantly improve your model's reasoning capabilities. Check out the repository for details on getting started!

We welcome feedback, contributions, and collaborations. Let us know if you use PTS in your projects!
ZennyKennyΒ 
posted an update 30 days ago
view post
Post
888
Community! πŸ’‘πŸ’‘πŸ’‘

It's the last day to submit your datasets for the Reasoning Datasets Competition: https://www.bespokelabs.ai/blog/reasoning-datasets-competition

Here are my submissions:
- ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset
- ZennyKenny/cosa-benchmark-dataset
- ZennyKenny/tactical-military-reasoning-v.1.0
- ZennyKenny/tron-dataset-v.1.0

Have a look and drop a ❀️ or comment! Check out the entire collection of submissions here: https://huggingface.co/datasets?other=reasoning-datasets-competition
davidberenstein1957Β 
posted an update about 1 month ago
ZennyKennyΒ 
posted an update about 1 month ago
view post
Post
3136
After hearing the news that Marc Andreessen thinks that the only job that is safe from AI replacement is venture capital: https://gizmodo.com/marc-andreessen-says-one-job-is-mostly-safe-from-ai-venture-capitalist-2000596506 🧠🧠🧠

The Reasoned Capital synthetic dataset suddenly feels much more topical: ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset πŸ”₯πŸ”₯πŸ”₯

Really looking forward to potentially expanding this architecture and seeing how algorithmic clever investment truly is! πŸ’°πŸ’°πŸ’°
ZennyKennyΒ 
posted an update about 1 month ago
view post
Post
3362
When I heard the Reasoning Dataset Competition deadline was extended to 9 May, I knew I had time to get in one more entry. πŸ”₯πŸ”₯πŸ”₯

With the rise of Vibe Coding, and the potential risks that are introduced by humans letting LLMs build their apps for them, lots of people are (rightfully) concerned about the safety of the code that is hitting prod.

In response to that, I'm happy to present my final submission to the Reasoning Dataset Competition and attempt to start benchmarking the ability of LLMs to identify unsafe and / or exploitable code by way of the CoSa (Code Safety) benchmark: ZennyKenny/cosa-benchmark-dataset

Currently a curated set of 200 examples, calibrated on OpenAI's standard issue models (GPT-4.1, o4 mini, and GPT-3.5 Turbo) as "baseline performance" (70% decile). Check it out and drop a ❀️ if you think it could be useful or hit the Community section with suggestions / critiques.
  • 2 replies
Β·
ZennyKennyΒ 
posted an update about 1 month ago
view post
Post
1373
The same way the advent of Adobe Illustrator has led to innovation in the way that creative professionals work, I earnestly believe that AI will do the same (contrary to the popular opinion that it represents some regression in the world of creatives).

@natalika and I were speaking about this topic and like most illustrators she has some understandable concerns about the spread of AI in her field. She also told me how much time she spends generating concept art that will never see the light of day in >98% of cases. πŸ’‘

To me, that sounded like a perfect opportunity to leverage image diffusion in a way that helps artists spend more time creating cool stuff rather than just malevolently mining their work and using it without credit. Using the Black Forest Labs base model FLUX, Replicate, and about $5 of H100 compute, I post-trained a LoRA adapter on a set of her images associated with one project she's working on and spun up an app with Hugging Face Spaces (and Zero GPU for the win).

I give you, Natalie Diffusion: ZennyKenny/natalie-diffusion

Now, generating concept art in her particular style takes seconds instead of hours and when it's time to put the work into production, a human designer is still invaluable. And building it in the open hopefully inspires other use cases amongst other designers. πŸ––
  • 2 replies
Β·
ZennyKennyΒ 
posted an update about 1 month ago
view post
Post
2737
I've created a new dataset using the Algorithm of Thoughts architecture proposed by Sel et al. (2023) in a reasoning context. (paper: https://arxiv.org/pdf/2308.10379)

The dataset simulates the discovery phase of a fictitious VC firm called Reasoned Capital and, once expanded, can be used to create models which are able to make complex, subjective financial decisions based on different criteria.

The generation process encourages recursive problem-solving in increasingly complex prompts to encourage models to assess and reevaluate the conclusions and generated opinions of upstream models. Pretty neat stuff, and I'm not aware of this architecture being used in a reasoning context anywhere else.

Check it out: ZennyKenny/synthetic_vc_financial_decisions_reasoning_dataset
ZennyKennyΒ 
posted an update about 1 month ago
view post
Post
569
Phew, maybe a little dark, but I've submitted my second dataset to the Reasoning Datasets Competition: ZennyKenny/tactical-military-reasoning-v.1.0

I'd be interested to hear the community's thoughts on the applications of AI in the military. Especially in the wargaming space.

This is something that feels inevitable (and realistically, probably already in progress). Doesn't it make sense for us to have an understanding of the mechanics of such processes? Surely they will never be open source.
Β·