Accidentally Building an AI Reasoning Research Ecosystem (Or: Can AI Stop Thinking?)

Community Article Published June 26, 2025

Almost three years ago now, I built Can-Ai-Code to answer what seemed like a simple question: "Can LLMs even generate syntactically valid code?"

This was back in the dark ages of 2022, when LLMs were like teenagers - unpredictable, often wrong, and nobody knew how to talk to them properly. There were no chat templates yet. We had to manually figure out EOS tokens and decide if Alpaca or Vicuna template would perform better. The early models couldn't even write a basic for-loop without having an existential crisis, and refused functions whos parmeters were named "banana" because that's racist (true story - shout out to Llama2, arguably the most overcensored instrution model of all time).

Syntax errors were abundant during this time, and even getting whitespace in a python function right enough to run was a rare treat nevermind consitently generating correct code.

Time passed. A year or so later, I found even tiny 8B models defeated my "junior developer" test. Unphased, I made a "senior developer" test! This lasted a few months until once again the test was defeated.. by an 8B. Every time I made my test harder, the models would eventually cluster at 100% again. I was stuck in an arms race with artificial intelligence, and I was losing.

It was around this time I understood the root cause: Open-sourcing my results means the tests were on GitHub. And GitHub, being a responsible member of the internet ecosystem, gets scraped for AI training data with all the discretion of a vacuum cleaner in a Cheerios factory. The models weren't actually getting smarter at my tests - they were just regurgitating memorized solutions. This wasn't really an arms race with intelligence; it was an arms race with model vendors training on my test set.

The test suite I spent so much time on lay bleeding on the floor, defeated and contaminated. When the question it asked wasn't even relevant anymore - in 2025 LLMs can do so well we've coined a term 'vibe coding' for letting them completely take the wheel and not even looking at the code they generate yourself.

Reasoning models also hit the scene hard. They promise a magical run-time scaling that improves prompt performance across the board, in exchange for "a few" extra tokens.

So I did what any reasonable person would do when faced with the complete obsolescence of their work: I pivoted to an even harder question.

If "Can AI Code?" was settled, then surely "Can AI Think?" was still up for debate.

So I defined 13 new, diabolically difficult tasks that require reasoning. I built contamination-resistant generators that can create fresh examples of these tasks every few months (you can't memorize tests that don't exist yet!). This hit two birds with one stone: I needed thousands of test cases to get statistically meaningful results and with my new generators I could synthesize as many tests as needed to hit my desired confidence intervals, and then rotate them for new tests to avoid contamination.

Looking at the early results of boolean and multiple-choice questions (that did not occur in my previous coding tests): models were "solving" multiple choice questions they clearly didn't understand. Turns out this problem has a name - Excess Accuracy. A model scoring 60% on true/false questions isn't 60% intelligent - it's 10% of knowledge and 50% of cosmic coin-flipping. I had to build excess accuracy correction to separate genuine understanding from statistical noise, and suddenly the skies opened up and there was seperation in the data!

This all seemed like a sensible transition until I realized that running thousands of thinking tests is considerably more expensive than a few hundred coding tests. When a model writes a function, it outputs maybe 500 tokens. When a model thinks about a hard problem, it can easily burn through 4000 tokens just to tell you it's confused. It was like the difference between asking someone to write their name versus asking them to explain their entire thought process while writing their name, including all the times they second-guessed their penmanship.

I was generating tens of millions of tokens every night and my RTX3090s had already tripped the breakers twice and I was starting to get concerned looks from my wife about the electricity usage.

Then one morning it hit me: maybe this cost IS the point.

What if reasoning efficiency - correct answers per token consumed - matters more than just right-or-wrong scoring? The question evolved from "Can AI think?" to "How efficiently can AI think?" to, increasingly, "Can AI please stop thinking and just give me an answer before my power bill achieves sentience?"

Watching thousands of prompts smash against the context limits gave me a dangerous thought: if token efficiency is the real game, what if I try to steer the reasoning process itself? So I built Ruminate, a proxy server that gives any model using <think> tags configurable, multi-staged "thinking time" budgets. Want your AI to ponder for exactly 400 tokens, then summarize for 200 and answer for 300? This should all be your choice, not the models! Ruminate will not only enforce token budgets but also inject steering-thoughts at transitions to help the model finish up.

Analyzing the resulting distributions from millions of generated tokens (of largely incorrect answers) revealed a beautiful tragedy of AI reasoning: models follow what I call the triple-Gaussian distribution of Reasoning Failure. They either:

  1. underthink and quit too early, or
  2. overthink themselves into logical pretzels, or
  3. hit the "oops" zone where they think just enough but make a mistake and end up confidently wrong.

It was like watching artificial consciousness discover anxiety, procrastination, and the Dunning-Kruger effect all at once! These effects are not mutually exlusive they happen all at once and interact in a myrad of interesting ways depending on task and difficulty - a subject which is a whole post in and of itself.

What started as "Can my local LLM write a simple function?" had metastasized into a complete reasoning research ecosystem with self-evolving benchmarks, contamination-resistant evaluation, statistical rigor, efficiency optimization, and controllable reasoning infrastructure. I hadn't planned any of this. Each solution just revealed the next problem, like opening Russian nesting dolls filled with increasingly complex existential questions about the definition of difficulty and the nature of artificial thought.

It turns out the real question isn't "Can AI think?" - it's somewhere between "How does AI think?" and "Can this AI ever stop thinking?" Because apparently, left to their own devices, most modern open-source reasoning models will happily burn through your entire token budget contemplating the philosophical implications of whether a hot dog is a sandwich (and then confidently conclude that it's actually because of global warming.)

Sometimes the best research happens when you follow the problems wherever they lead, even if they lead you to accidentally building the thing you never knew you needed to answer questions you didn't know you were asking.

I am currently writing documentation and working through 300M+ tokens of results, follow me here on HuggingFace and on GitHub: https://github.com/the-crypt-keeper/ChatBench

Community

Sign up or log in to comment