Kquant03 commited on
Commit
3d39ae7
1 Parent(s): c77aee0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -6,11 +6,11 @@ A frankenMoE of [TinyLlama-1.1B-1T-OpenOrca](https://huggingface.co/jeff31415/Ti
6
  [TinyLlama-1.1B-intermediate-step-1195k-token-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T),
7
  and [tiny-llama-1.1b-chat-medical](https://huggingface.co/SumayyaAli/tiny-llama-1.1b-chat-medical).
8
 
9
- # Most 1.1B models are incoherent and can't even answer simple questions. I picked out some models that aren't as bad, then mashed 32 copies of those 3 models together into a 32x MoE
10
 
11
  OpenOrca experts have been given the task of creating responses for simple questions about things like pop culture, history, and science...step-1195k experts have been chosen to provide warmth and a positive environment, while chat-medical experts have been chosen to provide further detail about human subjects, and to give small little bits of medical advice: I.E. "how do I get rid of this headache I gave myself from making you?"
12
 
13
- ### What is a Mixture of Experts (MoE)?
14
 
15
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
16
 
@@ -32,7 +32,11 @@ Although MoEs provide benefits like efficient pretraining and faster inference c
32
  Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, given a MoE like Mixtral 8x7B, we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).
33
 
34
 
35
- # "Wait...but you called this a frankenMoE?"
36
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously. There are rumors about someone developing a way for us to unscuff these frankenMoE models by training the router layer simultaneously. For now, frankenMoE remains psychotic.
37
 
38
- # p.s. I will try to convert it to GGUF if you guys want. This model is crazy, though, it was really more just for fun and to learn how to use mergekit and axolotl.
 
 
 
 
 
6
  [TinyLlama-1.1B-intermediate-step-1195k-token-2.5T](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T),
7
  and [tiny-llama-1.1b-chat-medical](https://huggingface.co/SumayyaAli/tiny-llama-1.1b-chat-medical).
8
 
9
+ ### Most 1.1B models are incoherent and can't even answer simple questions. I picked out some models that aren't as bad, then mashed 32 copies of those 3 models together into a 32x MoE
10
 
11
  OpenOrca experts have been given the task of creating responses for simple questions about things like pop culture, history, and science...step-1195k experts have been chosen to provide warmth and a positive environment, while chat-medical experts have been chosen to provide further detail about human subjects, and to give small little bits of medical advice: I.E. "how do I get rid of this headache I gave myself from making you?"
12
 
13
+ # What is a Mixture of Experts (MoE)?
14
 
15
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
16
 
 
32
  Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, given a MoE like Mixtral 8x7B, we’ll need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? That’s because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).
33
 
34
 
35
+ ## "Wait...but you called this a frankenMoE?"
36
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously. There are rumors about someone developing a way for us to unscuff these frankenMoE models by training the router layer simultaneously. For now, frankenMoE remains psychotic.
37
 
38
+
39
+
40
+
41
+
42
+ ### p.s. I will try to convert it to GGUF if you guys want. This model is crazy, though, it was really more just for fun and to learn how to use mergekit and axolotl.