MarsupialAI
/

Aqueducts-18B

Text Generation

text-generation-inference

Model card Files Files and versions Community

MarsupialAI commited on May 1, 2024

Commit

88130fd

·

verified ·

1 Parent(s): 96ed735

Update README.md

Files changed (1) hide show

README.md +12 -2

README.md CHANGED Viewed

@@ -17,8 +17,18 @@ repetition/jank.
 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65a531bc7ec6af0f95c707b1/Z13P-BJaMM3Q7o4cZxoKt.jpeg)
-This isn't my recipe so don't ask me to explain it.  This was engineered by FM, so the credit/blame is his.  See recipe.yml
-if you want to examine the madness in detail.
 This model is uncensored and capable of generating objectionable material.  As with any LLM, no factual claims made by the model
 should be taken at face value. You know that boilerplate safety disclaimer that most professional models have? Assume this has

 ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/65a531bc7ec6af0f95c707b1/Z13P-BJaMM3Q7o4cZxoKt.jpeg)
+This isn't my recipe.  This was engineered by FM, so the credit/blame is his.  Here is how he explains what's going on here:
+> "Stack" merging exists as an alternative to straight up merging of models; its general idea comes from the fact that in a stacked arrangement the models will preserve their weights better than when merged in any way. Unfortunately, the results are often not so predictable as we'd wish them to be, and the models end up losing their crucial capabilities, thus invalidating the whole point of preserving them in the first place.
+>
+> In the irregular iterative experiments (Jan-Apr '24), some conclusions were reached:
+> 1) The naive "Frankenmerge" stacking of the slices of models doesn't preserve the input and the output layers of the participating models; however, if said layers are merged prior and reused for the whole stacked model, the capabilities of the used momdels appear to be restored, if partially.
+> 2) The often overlooked gradient merge, while not enhancing the simple merges of models much, proves crucial in saving space (layers) when attempting to stack models "lengthwise". In this recipe, the target was to approximate the prompt passing within the internal layers of three 11B models, fit within the space for two. Straight stacking of 3 such models would've produced a model of 22B parameters with 96 layers, while this construction allows us to use just 80.
+>
+> Note: the results achieved are mostly subjetive and not confirmed by the rigorous testing.
+> Note 2: for the gradient merging of 11B models, it's highly advisable to study their structure; since at inception, it is made of layers of a duplicate 7B model, it is preferrable to merge the layer slices that align with each other internally. This will become irrelevant soon because Solar old.
+See recipe.yml if you want to examine the madness in detail.
 This model is uncensored and capable of generating objectionable material.  As with any LLM, no factual claims made by the model
 should be taken at face value. You know that boilerplate safety disclaimer that most professional models have? Assume this has