DavidAU/Llama-3.1-1-million-cxt-Dark-Planet-8B-GGUF · Excellent Model / Detailed overview and discussion.

Apr 17

•

Dear DavidAU.

I've been testing this model alongside with 128k. one and I can surely say this one is much more bulletproof compared to 128k. one, providing visibly more coherent, stable, vivid, logic and really resilient. I've also noticed how better it follows long instructions(especially with long character cards) compared to basic one.

Overall it's nearly perfect and somehow doesn't have the same issues involving Stheno merge(currently). I've also noticed increased repetition in some cases compared to basic one, so increasing to 1.105, 1.121 helps, it also retains a good coherence even with higher rep. penalty compared to other ones.

DavidAU

Owner Apr 18

Excellent; thank you ;

RE: Repeat;
Hmm, maybe higher temps? and/or larger, slightly more detailed prompt?
I find this helps with this type of issue.

Larger rep pen range, (even with lower rep pen) may also do the trick.

VizorZ0042

Apr 19

Well, I experimented with really detailed prompt, it start repeating in mid-process of completing the prompt, and looping by adding the same sentence in slightly altered form, adding more and more.

Higher rep pen range helps with repetition, but in cost of shorter responses, a bit less vivid and descriptive scenes and faster completion of the whole prompt. For example rep pen 64 fulfills all the actions from the detailed prompt in 32 outputs, assisted with User inputs, while rep pen 256 might finish it in 27 outputs and so on. Smoothing Factor helps a bit, but may also decrease coherency.

Furthermore I'm still testing it and astonished by how well and stable it is.

DavidAU

Owner Apr 20

@VizorZ0042
thank you so much for your feedback and testing.

I have located a possible fix to the "repeat" issues at the end ; and this is in testing.
Once the hard testing is complete, I will upload the new quants.

VizorZ0042

Apr 20

@DavidAU I do hope the new quants won't be less coherent and/or creative as currently this model has outstanding coherency even with higher Repetition Penalty, Temperature, Top-K and other different sampling configurations aimed to achieve better creativity, where other models just lose coherency with small changes of Top-K, higher Temperature and etc.

DavidAU

Owner Apr 21

@VizorZ0042

Going to upload them as a new version alongside the current version in the repo.
Same thoughts as you here - a little fix, sometimes does not address the "greater good".

The issue relates to EOS tokens, and additional a patch with new quants to address this.
This issue appears in Grand Horror 1 million , as well as Darkest Planet 1 million - but not in DeepHermes Reasoning 1 million.

Thank you again for your reviews/feedback and help.

DavidAU

Owner Apr 21

@VizorZ0042
Version 1.01 quants uploading now for Dark Planet 1 million context.
Filename will have "1.01" in it.

VizorZ0042

Apr 21

•

edited Apr 21

@DavidAU Thanks once again for your great work, was really glad to help. And I'll be testing this one too, with Q4KS quants, and with special set of actions and prompts in around 42 generations, and I'll show results later.

VizorZ0042

Apr 23

•

edited Apr 23

@DavidAU After comprehensive testing of both 1.01 and reg versions I noticed really good improvement in less repetition compared to old one. Both overly generates at Top_K 80, but 1.01 performs noticeably less repetitive with really good coherence. Both have stheno-related issues that appears much less frequently compared to 128K one.

If necessary, I have generations archived in text format for both old one and new one, even saved card with settings too.

DavidAU

Owner Apr 23

@VizorZ0042
Excellent , that is fantastic.

Still a lot of tweaks / learning at this end to get the "conversion" process to highest quality and maintain org model performance ;
Thank you again for your help.

VizorZ0042

Apr 23

•

edited Apr 23

@DavidAU Have you tried to merge with Mungert/DeepHermes-3-Llama-3-8B-Preview-GGUF instead of Stheno? Developer says it has precision-adaptive quantization to preserve to accuracy while maintaining extreme memory efficiency. It may fix memory-related issues and improve coherency by a great margin compared to NousResearch's.

DavidAU

Owner Apr 24

@VizorZ0042
hmm, interesting idea .
Maybe a partial merge

(FYI: DARK PLANET: Stheno and other models in the merge are PER layer - not a full merge of all the models).

The issues may be: Stheno is also responsible for some prose / some de-censoring of the model.
(because of what is merged where in the model - it contributes in first 1/4 of the model layers - mostly.)

The other issue, a full merge of DeepHermes with Darkplanet to make a 12.2 B model.
This would create a more powerful model to begin with, but the question is balance.

There are also multiple versions of Stheno too. HMMM.

And also; the base model (in the merge) can also fix or add issues too.
IE DeepHermes as a BASE model ( a base model is "consumed" during merging, but not completely - this can have a minor effect to a drastic effect on the final model )

There are literally endless options , but only some much time - AGHHH!.
Okay... let me think on this.

DavidAU

Owner Apr 24

@VizorZ0042

You may want to try out this early test build:

https://huggingface.co/DavidAU/EXP-Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-Uncensored-Deep-Reasoning-32B-GGUF

VizorZ0042

Apr 24

•

edited Apr 26

@DavidAU Thanks for the great model but unfortunately I have only 8GB. of RAM, so I can use 8 B with Q5-KM or lower.

And all stheno-related models are bad, I've been testing them for months, always failed in memory, coherency, and it's really unstable. Devoper (Sao10k) even removed 3.3 and 3.4 because they failed and were even worse compared to 3.2.

The reason of choosing Mungert's DeepHermes, is because of great memory capabilities, better coherence and overall performance compared to NousResearch's DeepHermes.

[Example erased]

VizorZ0042

Apr 24

•

edited Apr 26

@DavidAU A great amount of Llama-based models start typing next replies for User, I don't know how to fix such behavior, appeats when with this model.

VizorZ0042

Apr 26

•

edited Apr 26

Further testing reveals exceptional writing abilities with Top_K 80 even with higher Rep penalty (1.105). But small issues with higher Rep penalty, like confusing User's skin type, body parts and etc (Top_K affects only a bit). Nearly identical behavior compared to SpinFire, but with noticeably better coherency, overall performance, and much less frequency of confusion. Rep pen 1.02 decreases such behavior to almost zero, but tends to try and repeat. Rep range 128 alters writing style in negative bias but helps with repetition. Decreasing Min_P to 0.045 improves writing style. Haven't tested Top_K 80 with higher Temperature (0.8+), but I think it might improve writing abilities even further.

VizorZ0042

Apr 26

Sometimes the model forgets to put asterisks between sentences or by the end / beginning. Happens regardless of settings.

DavidAU

Owner Apr 27

@VizorZ0042

First of all ; excellent feedback. ; Wow.

I missed this initially = "Mungert's Hermes" ? Please clarify. Repo?

RE: Llama / User reply.
This is a nuance issue in some cases ; but more specifically the model needs concrete "DON'T DO THIS or the world will end" directive(s) in the system prompt.
Another option is "ACT AS", "ACT LIKE", "YOU ARE" ... IE "Dungeon master" ... and ... "you will only craft the adventure and NEVER answer for the user".
Bluntly, this is a very common issue and seems to "go away" with either a larger model size (parameters - even if using a much smaller quant ie IQ3s) or a larger quant for smaller parms models.

RE: Creative ;
Might want to try the model in Sillytavern, Text Gen Web UI, Llama-Server.exe (with SillyTavern as front end) or Kolboldcpp (standalone OR with SillyTavern as front end) and activate / use "Dynamic Temp" and/or DRY.

You could also use my prototype plug-in for ST here:
https://huggingface.co/DavidAU/AI_Autocorrect__Auto-Creative-Enhancement__Auto-Low-Quant-Optimization__gguf-exl2-hqq-SOFTWARE

This SCRAMBLES temp/topk (at various levels) PER paragraph generated.

VizorZ0042

Apr 27

•

edited Apr 27

About Mungert's Hermes, I meant this one : https://huggingface.co/Mungert/DeepHermes-3-Llama-3-8B-Preview-GGUF

As about 'Auto-Creative Enhancement', I wrote a message there >22 days ago (KoboldAI/KoboldAI Lite WebUI Support.) to add support for KoboldAI's User Mods (Scripts) because ST affects generation speed in a noticeable way.

Also noticed this one : https://huggingface.co/Mungert/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct-GGUF
Aims to be better compared to Nvidia's UltraLong, but I didn't have enough tests; just too much time due to my limited computational power.

DavidAU

Owner Apr 27

Hey;

Okay ; these are related to :

As of Llama release "B5127" ( about April 14 2025) (llama.cpp/pull/12511) there are new quantization options available for all models.
This allows "custom quantization" - all layers, and/or by tensors and/or per layer.
Previously you could only "optimize" the output tensor and embed layer.

You can select IQ1_S to f32.

The new argument is:
--tensor-type

Prior to April 14, this was available, but as a specific branch ; it is now part of LLamacpp "official" so to speak.

This allows "mixing" - as per Mugert's models ; but also allows "creative use" (IE Imatrix + Layer/Tensor mixes) optimization too.
Likewise specific mixes to "fix" Reasoning model quants AND MOE quants too (shared expert tensors).
I have been rooting for a patch like this for LLamacpp for some time now.

But this patch also allows "super quants" too , like Q10, Q11, Q12 (IE: with f16/bf16/f32) and super hybrids like this:

F16 for first 1/4 of the model (instruction following), mid layers at IQ4XS / IQ4NL / IQ3_S , last 1/4 at another quant(s), and finally the last layer @F16 /BF16/F32.
Then there is the output tensor and embed too.

With Merge models (regular (Dark Planet) and franken -> Brainstorm / Grand Horror / Darkest Universe) this takes on a whole new dimension.

Then add in Imatrixed "parts" of the model too. (entire model, select layer(s)/tensors) ...

Going to make a "mess" this week, and see what's what here... ;

RE: WebUI support ; I got that one - haven't looked into it further yet - it is flagged.
Later versions of ST , show further drop in T/S ; a lot of Javascript processing going on.

Thanks;

VizorZ0042

Apr 27

•

edited Apr 27

Thanks for explanation. I'll be waiting for your newest models (8B or lower) then, as these new quants showing even better results compared to 'un-improved' ones.

VizorZ0042

Apr 28

•

edited Apr 28

@DavidAU Forgot to tell; Most of your models perform even better with Flash Attention on, tested on DarkPlanet V1-3 (this one too.), SpinFire, BigTalker, DirtyHarry and some other Llama3.X models I don't remember. Scene / prose become more vivid, creative. Characters become more lively with better dialogs / emotions. Better coherency too.

DavidAU

Owner May 1

Thank you ; I generally steer away from Flash Attn as I found the opposite effect - however I have not used it in ages.
I will give it another go ; and update repo(s) too based on your recommendation + testing.

VizorZ0042

May 1

•

edited May 1

@DavidAU Try on SpinFire and DarkPlanet-Hermes, but with Min_p 0.045. They both work very well with Flash Attention in F16 mode.

DavidAU

Owner May 6

Excellent.
Noted ; great testing ... ;

VizorZ0042

3 days ago

Testing with custom settings:

Kyubey (without Optional Enhancement):

Kyubey (with Optional Enhancement):

This model performed a bit worse compared to million-DeepHermes; But overall average-ish. Seems like these specific 2 values needs to be fine-tuned (Rp.Slope might be too)