Kquant03
/

PsychoOrca_32x1.1B_MoE_bf16

Text Generation

Transformers

Safetensors

English

mixtral

text-generation-inference

Model card Files Files and versions Community

Kquant03 commited on Jan 1, 2024

Commit

514b203

1 Parent(s): ad35569

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -3

README.md CHANGED Viewed

@@ -6,11 +6,12 @@ A frankenMoE of [TinyLlama-1.1B-1T-OpenOrca](https://huggingface.co/jeff31415/Ti
 [TinyLlama-1.1B-intermediate-step-1195k-token-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T),
 and [tiny-llama-1.1b-chat-medical](https://huggingface.co/SumayyaAli/tiny-llama-1.1b-chat-medical).
-### Most 1.1B models are incoherent and can't even answer simple questions. I picked out some models that aren't as bad, then mashed 32 copies of those 3 models together into a 32x MoE
 OpenOrca experts have been given the task of creating responses for simple questions about things like pop culture, history, and science...step-1195k experts have been chosen to provide warmth and a positive environment, while chat-medical experts have been chosen to provide further detail about human subjects, and to give small little bits of medical advice: I.E. "how do I get rid of this headache I gave myself from making you?"
-# What is a Mixture of Experts (MoE)?
 The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
@@ -22,6 +23,7 @@ So, what exactly is a MoE? In the context of transformer models, a MoE consists
     A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
@@ -33,7 +35,9 @@ So, to recap, in MoEs we replace every FFN layer of the transformer model with a
 Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:
     Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
-    Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, given a MoE like Mixtral 8x7B, we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).
 ## "Wait...but you called this a frankenMoE?"

 [TinyLlama-1.1B-intermediate-step-1195k-token-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T),
 and [tiny-llama-1.1b-chat-medical](https://huggingface.co/SumayyaAli/tiny-llama-1.1b-chat-medical).
+### Most 1.1B models are incoherent and can't even answer simple questions. I picked out some models that aren't as bad, then scripted for them to be split into different fields properly based on what their training data made them excel at. [The mergekit-moe script can be found here](https://drive.google.com/file/d/1JMdqQ1eTcOUAT6uuWVGgpEKenF2m3gwi/view?usp=drive_link). It is based on [Undi95's config.yaml for his MoE RP here](https://huggingface.co/Undi95/Mixtral-8x7B-MoE-RP-Story/blob/main/config.yaml)
 OpenOrca experts have been given the task of creating responses for simple questions about things like pop culture, history, and science...step-1195k experts have been chosen to provide warmth and a positive environment, while chat-medical experts have been chosen to provide further detail about human subjects, and to give small little bits of medical advice: I.E. "how do I get rid of this headache I gave myself from making you?"
+# ["What is a Mixture of Experts (MoE)?"](https://huggingface.co/blog/moe)
+### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
 The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
     A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
+At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
 Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:
     Training: MoEs enable significantly more compute-efficient pretraining, but they’ve historically struggled to generalize during fine-tuning, leading to overfitting.
+    Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, [given a MoE like Mixtral 8x7B](https://huggingface.co/blog/moe), we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).
+If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
 ## "Wait...but you called this a frankenMoE?"