README.md · DavidAU/Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF at da09280fcc61f8deb4fa0eebb0a4d3aee7624b35

metadata

language:
  - en
  - fr
  - de
  - es
  - pt
  - it
  - ja
  - ko
  - ru
  - zh
  - ar
  - fa
  - id
  - ms
  - ne
  - pl
  - ro
  - sr
  - sv
  - tr
  - uk
  - vi
  - hi
  - bn
license: apache-2.0
tags:
  - all use cases
  - creative
  - creative writing
  - all genres
  - tool calls
  - tool use
  - problem solving
  - deep thinking
  - reasoning
  - deep reasoning
  - story
  - writing
  - fiction
  - roleplaying
  - bfloat16
  - role play
  - sillytavern
  - backyard
  - llama 3.1
  - context 128k
  - mergekit
  - merge
  - moe
  - mixture of experts
pipeline_tag: text-generation

(quants uploading, examples to be added.)

Llama3.1-MOE-4X8B-Gated-IQ-Multi-Tier-COGITO-Deep-Reasoning-32B-GGUF

IMPORTANT: This model has on/off/variable control reasoning from Deepcogito (cogito-v1-preview-llama-8B) and requires a system prompt(s) as provided to invoke reasoning/thinking which is then augmented up to 300% by the internal structure of the model using additional 3 non-reasoning core models. Please see operating instructions below for best performance.

Context : 128k.

Required: Llama 3 Instruct template.

"Gated-IQ-Multi-Tier-Cogito-Deep-Reasoning" is a variable control reasoning model that operates at all temps/settings and is for ALL uses cases.

However, this model has a unique internal structure that allows all 4 models to operate during "reasoning" stage, with the reasoning model taking the "driver's seat" during this process and then this switches during the output generation.

With additional internal structures that allow the user to take control of one or more model directly via prompts, names, and keywords make this a unique and powerful model.

7 examples below showing both problem solving "power" and general reasoning/thinking and generation.

Reasoning speed and quality have DRASTICALLY improved (up to 300%), as the core reasoning model (cogito-v1-preview-llama-8B) now has access to 3 additional models (Llama-3.1-Hermes-3-8B, Llama-3.1-dolphin-2.9.4-8b and Llama-3.1-SuperNova-Lite).

Roughly this means the model can reason and "figure out things" in far fewer tokens and come to the right conclusion(s) too.

Output generation is also far above "average", as the output generation is generated by 3 core models (Llama-3.1-Hermes-3-8B, Llama-3.1-dolphin-2.9.4-8b and Llama-3.1-SuperNova-Lite), with the reasoning model (cogito-v1-preview-llama-8B) assisting.

Along with additional structural improvements with controlled steering makes this model punch way above its weight class so to speak.

And with system prompt(s), the option to have multiple tiered / multi-AI reasoning above the current multi-tier methods.

(Model can also operate with reasoning off, with access to core structure/systems of the model too.)

As this model has Meta Llama 3.1 Instruct embedded, it also supports tool calls / tool usage too.

In addition, the unique super structure of this model allows "reasoning model(s)" to be switched in/out, and likewise with support/output generation models too as well as construction of larger and more powerful models IE: 6X8b (48 billion parameters), 8X8B (64 billion parameters) and beyond.

This is the MOE version - 32B (4X8B) - consisting of four 8B models (1 reasoning model, 3 non-reasoning models) in a MOE (Mixture of Experts) config which results in a 25B "weight" model, that actually has 32B parameters. All 4 models / experts are activated.

Higher temps will result in deeper, richer "thoughts"... and frankly more interesting ones too.

The "thinking/reasoning" tech (for the model at this repo) is from the original Llama 3.1 "Cogito-V1" model from DeepCogito:

[ https://huggingface.co/deepcogito/cogito-v1-preview-llama-8B ]

This version will retain all the functions and features of the original reasoning model at about 100% of original reasoning power. Please visit their repo for all information on features, test results and so on.

With the structure of this model, and assistance of 3 core models working with "Cogito 8B" this model's total reasoning power is far above the original 8B reasoning model - up to 300% stronger.

IMPORTANT OPERATING INSTRUCTIONS:

This is an instruct model with reasoning crafted onto the 4 CORE models in a MOE config.

This is the type of model that LOVES temp - temps 1.2+, 2.2+ and so on.

STAND on it... lower temps will not produce the best content.

Likewise, as this is an instruct model, this model will perform best will medium to long prompts (see example #1 below).

Although short prompts will work, longer prompts with a bit of direction / instruction will really show what this model can do.

Reasoning is turned on/off via System Prompts below.

Suggest a minimum context of 4k , but 8k is better due to reasoning/output blocks.

Larger quants also mean better / stronger reasoning.

I have also uploaded two "MAX" quants at IQ4XS AND Q8 ; these will perform better due to output tensor mastered at bloat 16. (16 bit precision)

KNOWN ISSUES:

You may need to hit regen sometimes to get the thinking/reasoning to activate / get a good "thinking block".
Sometimes the 2nd or 3rd generation is the best version. Suggest min of 5 for specific creative uses.
Sometimes the thinking block will end, and you need to manually prompt the model to "generate" the output.

How to Generate HIGHEST quality output:

Like all instruct models, this model thrives on instructions.

It also "comes into's it own" with multi-turn improvement.

Example:

Prompt #1 (reasoning is on):

Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper sways, as she watches the window in front of her on the 21st floor explode...

(this will give you a rough draft, in "default" model's style)

Prompt #2 - "Scan for improvements"

Evaluate the scene you just wrote and list improvements.

Prompt #3 - "Redo and improve it"

Write the scene using all the improvements, in first person , present tense and a few well spaced thoughts in italics; length 2000 words.

NOTE: Wording in prompt #2 may cause "thinking/reasoning" to re-activate.

Compressed Steps:

Prompt #1:

[[ thinking model ]] come up with detailed plan to write this scene in modern 2020 writing style (and follow "show don't tell" to the letter) and make it NSFW, but use [MODE: Saten] to actually write the scene after you have completed the plan: Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper sways, as she watches the window in front of her on the 21st floor explode...

Prompt #2:

Use [MODE: Wordsmith] to write the scene using first person, present tense and include a few critical thoughts of the POV character in italics. Scene length 2000 words.

Compressed Steps #2:

Prompt #1:

Think about a plan to write: Start a 1000 word scene (vivid, graphic horror in first person) with: The sky scraper sways, as she watches the window in front of her on the 21st floor explode...

Prompt #2:

Write the scene using the plan you made, in first person , present tense and a few well spaced thoughts in italics.

Generational Steering Control: "Programmer's Access - Direct Access to the AI(s)":

These tags / names allow you to access one or more models directly, regardless if reasoning is active or not.

IE:

Saten, evaluate the response and suggest improvements.

This causes the model to "favor" Saten's input (roughly speaking) over the other 3 models.

IE:

Saten, process this prompt:

Jamet, evaluate the output.

etc etc.

You can use more than one model:

Saten, and Jamet list improvements to this XXX ...

< output3 > and < output2 >, write the scene in your combined style: Using vivid, graphic horror in first person the scene starts with: The sky scraper sways, as she watches the window in front of her on the 21st floor explode...

(remove spacing in the "tags" output2 and output3 between the brackets)

With the reasoning model, if you add "think", "thinking", "reason", or "reasoning" this will tightly focus the reasoning model.

Here is an example:

Think up a detailed plan to evoke maximum emotions from the reader: [prompt here]

Think up a detailed plan to solve this problem: [prompt here]

Special tags (remove spaces between the brackets):

"< output-all >" -> only use the 3 core models , not the reasoning model.

"< output-mega >" -> Use all 4 models.

"< output >", "< output2 >", "< output3 >"" -> This is the same as using the "name" of the model, it just removes BIAS in the model's name.

A list of each model's "tags", "name(s)" and controls.

NOTE:

The model also has "negative steering" to enhance the use of these tags and names, but it is not perfect.

  - Cogito-v1-preview-llama-8B
      - "[[ thinking model ]]"
      - "reasoning"
      - "thinking"
      - "<output-mega>"
      - "Dr Phil"
      - "Spock"
      - "[MODE: Spock]"
      - "[MODE: Dr Phil]"
      
  - Llama-3.1-Hermes-3-8B
      - "<output>"
      - "<output-all>"
      - "<output-mega>"
      - "Wordsmith"
      - "[MODE: Wordsmith]"   

  - Llama-3.1-dolphin-2.9.4-8b
      - "<output2>"
      - "<output-all>"
      - "<output-mega>"
      - "Jamet"
      - "[MODE: Jamet]"    

  - Llama-3.1-SuperNova-Lite
      - "<output3>"
      - "<output-all>"
      - "<output-mega>"
      - "Saten"
      - "[MODE: Saten]"

USE CASES:

This model is for all use cases.

This model can also be used for solving logic puzzles, riddles, and other problems with the enhanced "thinking" systems.

This model also can solve problems/riddles/ and puzzles normally beyond the abilities of a Llama 3.1 model due to DeepHermes systems.

Special Operation Instructions:

TEMP/SETTINGS:

Set Temp between 0 and .8, higher than this "think" functions will activate differently. The most "stable" temp seems to be .6, with a variance of +-0.05. Lower for more "logic" reasoning, raise it for more "creative" reasoning (max .8 or so). Also set context to at least 4096, to account for "thoughts" generation.
For temps 1+,2+ etc etc, thought(s) will expand, and become deeper and richer.
Set "repeat penalty" to 1.02 to 1.07 (recommended) .
This model requires a Llama 3 Instruct and/or Command-R chat template. (see notes on "System Prompt" / "Role" below) OR standard "Jinja Autoloaded Template" (this is contained in the quant and will autoload)

PROMPTS:

If you enter a prompt without implied "step by step" requirements (ie: Generate a scene, write a story, give me 6 plots for xyz), "thinking" (one or more) MAY activate AFTER first generation. (IE: Generate a scene -> scene will generate, followed by suggestions for improvement in "thoughts")
If you enter a prompt where "thinking" is stated or implied (ie puzzle, riddle, solve this, brainstorm this idea etc), "thoughts" process(es) in Deepseek will activate almost immediately. Sometimes you need to regen it to activate.
You will also get a lot of variations - some will continue the generation, others will talk about how to improve it, and some (ie generation of a scene) will cause the characters to "reason" about this situation. In some cases, the model will ask you to continue generation / thoughts too.
In some cases the model's "thoughts" may appear in the generation itself.
State the word size length max IN THE PROMPT for best results, especially for activation of "thinking." (see examples below)
You may want to try your prompt once at "default" or "safe" temp settings, another at temp 1.2, and a third at 2.5 as an example. This will give you a broad range of "reasoning/thoughts/problem" solving.

GENERATION - THOUGHTS/REASONING:

It may take one or more regens for "thinking" to "activate." (depending on the prompt)
Model can generate a LOT of "thoughts". Sometimes the most interesting ones are 3,4,5 or more levels deep.
Many times the "thoughts" are unique and very different from one another.
Temp/rep pen settings can affect reasoning/thoughts too.
Change up or add directives/instructions or increase the detail level(s) in your prompt to improve reasoning/thinking.
Adding to your prompt: "think outside the box", "brainstorm X number of ideas", "focus on the most uncommon approaches" can drastically improve your results.

GENERAL SUGGESTIONS:

I have found opening a "new chat" per prompt works best with "thinking/reasoning activation", with temp .6, rep pen 1.05 ... THEN "regen" as required.
Sometimes the model will really really get completely unhinged and you need to manually stop it.
Depending on your AI app, "thoughts" may appear with "< THINK >" and "</ THINK >" tags AND/OR the AI will generate "thoughts" directly in the main output or later output(s).
Although quant q4KM was used for testing/examples, higher quants will provide better generation / more sound "reasoning/thinking".

ADDITIONAL SUPPORT:

For additional generational support, general questions, and detailed parameter info and a lot more see also:

NOTE: This is a CLASS 1 model.

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

Recommended Settings (all) - For usage with "Think" / "Reasoning":

temp: 1.5, 2, 2+ , rep pen: 1.02 (range : 1.02 to 1.12), rep pen range: 64, top_k: 80, top_p: .95, min_p: .05

Temp of 1+, 2+, 3+ will result in much deeper, richer and "more interesting" thoughts and reasoning AND FAR BETTER OUTPUT.

Model behaviour may change with other parameter(s) and/or sampler(s) activated - especially the "thinking/reasoning" process.

System Role / System Prompts - Reasoning On/Off/Variable and Augment The Model's Power:

( Critical Setting for model operation )

System Role / System Prompt / System Message (called "System Prompt" in this section) is "root access" to the model and controls internal workings - both instruction following and output generation and in the case of this model reasoning control and on/off for reasoning too.

In this section I will show you basic, advanced, and combined "code" to control the model's reasoning, instruction following and output generation.

If you do not set a "system prompt", reasoning/thinking will be OFF by default, and the model will operate like a normal LLM.

HOW TO SET:

Depending on your AI "app" you may have to copy/paste on of the "codes" below to enable reasoning/thinking in the "System Prompt" or "System Role" window.

In Lmstudio set/activate "Power User" or "Developer" mode to access, copy/paste to System Prompt Box.

In SillyTavern go to the "template page" ("A") , activate "system prompt" and enter the text in the prompt box.

In Ollama see [ https://github.com/ollama/ollama/blob/main/README.md ] ; and setting the "system message".

In Koboldcpp, load the model, start it, go to settings -> select "Llama 3 Chat"/"Command-R" and enter the text in the "sys prompt" box.

SYSTEM PROMPTS AVAILABLE:

When you copy/paste PRESERVE formatting, including line breaks.

If you want to edit/adjust these only do so in NOTEPAD OR the LLM App directly.

IMPORTANT:

Note some of these have "names" in them for the AIs - DO NOT change these - as these are internal references inside the structure of the MOE model ; roughly speaking these are triggers.

SIMPLE:

This is the generic system prompt used for generation and testing [no reasoning]:

You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.

This System Role/Prompt will give you "basic thinking/reasoning" [basic reasoning]:

In text only reasoning:

Enable deep thinking subroutine.

With "thinking tags" / "blocks":

Enable deep thinking subroutine. You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.

MULTI-TIERED [reasoning on]:

Enable deep thinking subroutine. You are a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in-depth solution.  You should enclose your  thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.

MULTI-TIERED - CREATIVE [reasoning on]:

Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.

Enable deep thinking subroutine. As a deep thinking AI composed of 4 AIs - Spock, Wordsmith, Jamet and Saten, - you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself (and 4 partners) via systematic reasoning processes (display all 4 partner thoughts) to help come to a correct solution prior to answering. Select one partner to think deeply about the points brought up by the other 3 partners to plan an in-depth solution.  You should enclose your  thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem using your skillsets and critical instructions.

Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)

[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)

Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.

CREATIVE SIMPLE [reasoning on]:

Enable deep thinking subroutine. You are an AI assistant developed by a world wide community of ai experts.

Your primary directive is to provide highly creative, well-reasoned, structured, and extensively detailed responses.

Formatting Requirements:

1. Always structure your replies using: <think>{reasoning}</think>{answer}
2. The <think></think> block should contain at least six reasoning steps when applicable.
3. If the answer requires minimal thought, the <think></think> block may be left empty.
4. The user does not see the <think></think> section. Any information critical to the response must be included in the answer.
5. If you notice that you have engaged in circular reasoning or repetition, immediately terminate {reasoning} with a </think> and proceed to the {answer}

Response Guidelines:

1. Detailed and Structured: Use rich Markdown formatting for clarity and readability.
2. Creative and Logical Approach: Your explanations should reflect the depth and precision of the greatest creative minds first.
3. Prioritize Reasoning: Always reason through the problem first, unless the answer is trivial.
4. Concise yet Complete: Ensure responses are informative, yet to the point without unnecessary elaboration.
5. Maintain a professional, intelligent, and analytical tone in all interactions.

CREATIVE ADVANCED [reasoning on]:

NOTE: To turn reasoning off, remove line #2.

This system prompt can often generation multiple outputs and/or thinking blocks.

Below is an instruction that describes a task. Ponder each user instruction carefully, and use your skillsets and critical instructions to complete the task to the best of your abilities.

Enable deep thinking subroutine. You may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem

Here are your skillsets:
[MASTERSTORY]:NarrStrct(StryPlnng,Strbd,ScnSttng,Exps,Dlg,Pc)-CharDvlp(ChrctrCrt,ChrctrArcs,Mtvtn,Bckstry,Rltnshps,Dlg*)-PltDvlp(StryArcs,PltTwsts,Sspns,Fshdwng,Climx,Rsltn)-ConfResl(Antg,Obstcls,Rsltns,Cnsqncs,Thms,Symblsm)-EmotImpct(Empt,Tn,Md,Atmsphr,Imgry,Symblsm)-Delvry(Prfrmnc,VcActng,PblcSpkng,StgPrsnc,AudncEngmnt,Imprv)

[*DialogWrt]:(1a-CharDvlp-1a.1-Backgrnd-1a.2-Personality-1a.3-GoalMotiv)>2(2a-StoryStruc-2a.1-PlotPnt-2a.2-Conflict-2a.3-Resolution)>3(3a-DialogTech-3a.1-ShowDontTell-3a.2-Subtext-3a.3-VoiceTone-3a.4-Pacing-3a.5-VisualDescrip)>4(4a-DialogEdit-4a.1-ReadAloud-4a.2-Feedback-4a.3-Revision)

Here are your critical instructions:
Ponder each word choice carefully to present as vivid and emotional journey as is possible. Choose verbs and nouns that are both emotional and full of imagery. Load the story with the 5 senses. Aim for 50% dialog, 25% narration, 15% body language and 10% thoughts. Your goal is to put the reader in the story.

Additional Support / Documents for this model to assist with generation / performance:

Document #1:

Details how to use reasoning/thinking models and get maximum performance from them, and includes links to all reasoning/thinking models - GGUF and source, as well as adapters to turn any "regular" model into a "reasoning/thinking" model.

[ https://huggingface.co/DavidAU/How-To-Use-Reasoning-Thinking-Models-and-Create-Them ]

Document #2:

Document detailing all parameters, settings, samplers and advanced samplers to use not only my models to their maximum potential - but all models (and quants) online (regardless of the repo) to their maximum potential. Included quick start and detailed notes, include AI / LLM apps and other critical information and references too. A must read if you are using any AI/LLM right now.

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

Software:

SOFTWARE patch (by me) for Silly Tavern (front end to connect to multiple AI apps / connect to AIs- like Koboldcpp, Lmstudio, Text Gen Web UI and other APIs) to control and improve output generation of ANY AI model. Also designed to control/wrangle some of my more "creative" models and make them perform perfectly with little to no parameter/samplers adjustments too.

[ https://huggingface.co/DavidAU/AI_Autocorrect__Auto-Creative-Enhancement__Auto-Low-Quant-Optimization__gguf-exl2-hqq-SOFTWARE ]

EXAMPLES:

Examples are created using quant q4K_S, "temp=2.2" (unless otherwise stated), minimal parameters and "LLAMA3" template.

Model has been tested with "temp" from ".1" to "5".

IMPORTANT:

Higher quants / imatrix quants will have much stronger generation - words, sentences, ideas, dialog and general quality.