This is probably the best 13b. Hope you're considering making a Mistral.

#2
by deleted - opened
deleted

I tested the top ~12 13b LLMs and the best performing one was AYT. That is until I came across this one. Must have missed it because it wasn't on the leaderboard.

However, the top Mistral 7bs (Dolphin 2.1 and Zephyr Beta) are now performing slightly better than both AYT and Xwin 0.2 across a wide range of tests, which makes me think if you made a Mistral it would be the best performing 13b or smaller LLM.

deleted

I find the contrasting strengths of 13b LLama 2 AYT and Xwin 0.2 fascinating.

AYT is the "smartest" and most "talented" 13b Llama 2. It can solve tricky logic, math and coding problems that Xwin 0.2 can't. It can also write technically superior stories and poems. Sure enough AYT scores notably higher on objective tests (e.g. 65.5 on Hugging Face).

However, Xwin 0.2 is basically better at everything else. For example, it wrote a story about the TV show Friends and got all their jobs, personalities... correct (AYT kept making absurd mistakes, such as having the siblings Ross and Monica date). Xwin also respected my long list of prompt directions.

Xwin 0.2 basically acts like the friendly and helpful average IQ receptionist, while AYT acts like the brainy and antisocial IT guys down in the basement. Why do LLMs split like this? Why can't someone make a smart chat bot? Would making Xwin 0.2 "smarter" by including a large amount of explanation/instruction tuning also make it score much lower on tests like AlpacaEval?

@Phil337 Care to point at which AYT model that you were referring to? Just curious. Is it this one? https://huggingface.co/posicube/Llama2-chat-AYT-13B

By AYT author :

We hypotheize that if we find a method to ensemble the top rankers in each benchmark effectively, its performance maximizes as well.

I believe the difference are made by their fine-tuning method. AYT is enabled by fine-tuning with best sft datasets (OpenOrca, Alpaca), whereas Xwin is fine-tuned by some WIP state of the art RLHF method.

There is no promise that Xwin v0.2 has used the same dataaset that AYT has been using.

So surely Xwin would performance less than AYT on benchmarks where modesl fine-tuned by OpenOrca, Alpaca would succeed, but Xwin exceled on AlpacaEval due to its RLHF-ed nature by being more "human-preferred".

deleted

Yes, that's the AYT I'm using.

I prefer Xwin because respecting the prompt and facts (e.g. not having siblings date in fan fiction) is more important to be than marginally better logic, math, coding and writing.

It's just a little frustrating that having both in the same LLM doesn't appear to be possible. Making a high performing LLM like Mistral 7b Dolphin 2.1 (67) more human-preferred reduces its objective performance, while making the best human-preferred LLM like Xwin perform better on objective tests by adding more explanation/instruction tuning appears to reduce its human-preferred scores on tests like AlpacaEval.

Sign up or log in to comment