Can't wait to test

#4
by froggeric - opened

I am very excited to test this model. I just finished testing my iMatrix Q4_K_S quantise of your miqu-1-120b, and it is head and shoulders above the original miqu-1-70b. Here is the comparison (higher score = better):

Screenshot 2024-02-27 at 21.32.56.png

Thanks for sharing your test results! That looks great. Would love to see how my other models rank in your tests.

I just finished testing it at q4_km (imatrix), here is the update with other miqu based models, including yours:

image.png

What I have noticed when compared with your 120b version, is, the 103b version has a bit more difficulties following instructions (but still very good at it). However in general it gives more detailed replies. I see 2 big advantages with the 103b version:

  • being smaller, it is possible to run a larger context
  • size for size, it is possible to use it 1 quant higher than the 120b, which should give even better results
    I am just starting another round of tests with the q5_ks imatrix version :)

Finished testing the q5_ks (imatrix) version:

image.png

Slight improvements over q4_km, but as it uses more memory, it reduces what it is available for context. Still, with 96GB I can still use a context larger than 16k.

I have revised my scores for the 103b q5_ks version. I had the feeling I had been slightly biased. And indeed, after reviewing the answers it gave, I had overlooked some glaring logical problems in favour of the writing quality. Here are the correct scores:

image.png

Even though the total scores are the same, my favourite is miqu-1-120b. miqu-1-103b clearly has more problem following instructions, and steering it in the right direction is hard work. miquliz-120b is not as good as miqu-120b for storytelling, and I would say has a worrying tendencing of getting dumber when a large context gets filled in; however, for short-medium smart assistant role, it actually scores better than miqu-120b.

I think the most potential for getting the best large model with what is available now, is with self-merges of miqu, followed by a finetuning like Westlake to restore some of the information lost. I don't think we have yet discovered what the best self-merge pattern is. I have some thoughts about it, which I have detailed in this discussion: https://huggingface.co/llmixer/BigWeave-v16-103b/discussions/2

Thanks a lot for the in-depth testing and well-written reviews! And also for sharing your thoughts on how self-merging could be further improved.

I'd love to see Repeat layers to create FrankenModels by dnhkng · Pull Request #275 · turboderp/exllamav2 finally gaining traction. I think there's enough evidence by now that the self-merging actually improves performance, so by doing on the fly would let us iterate and get even better results much faster.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment