Dynamic 8x7B Mixtral Model

Nous-Hermes-2-Mixtral-8x7B-17m-DPO-raw : 17 MoE FF Layers, 15 Dense FF Layers

Model Details

Model Description

MoE layer pruning test modified from Nous-Hermes-2-Mixtral-8x7B-DPO. So it uses the same chatml format for conversations.

15 layers of MoE is merged into a normal feed forward layer ( 17/32 layers are MoE), so the total params are reduced from 47B to 14B.

Pruned layers index are as follows:

[3, 4, 7, 10, 11, 23, 24, 25, 26, 27, 28, 29]
  • Developed by: MistralAI, NousResearch, theblackcat
  • Model type: Modified Mixtral Architecture for dynamic MoE
  • License: apache-2.0

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Experiment stage, still finding the best sweet spot for running just under 24G memory under 4 bit-quantization config.

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = CustomMixtralForCausalLM.from_pretrained(model_path,
                                            torch_dtype=torch.bfloat16,
                                            low_cpu_mem_usage=True,
                                            load_in_4bit=True,
                                            trust_remote_code=True
                                        )
pytorch_total_params = sum(p.numel() for p in model.parameters())
print(pytorch_total_params/1e9)
max_length = 100
input_text = """<|im_start|>user\nHow are you? Write a story for me please<|im_end|><|im_start|>assistant\n"""
input_ids = tokenizer(input_text, return_tensors="pt")["input_ids"].to('cuda')
print(len(input_ids[0]))
output = model.generate(input_ids, max_length=max_length, temperature=0.7, repetition_penalty=1.1, do_sample=True)
print(tokenizer.decode(output[0]))
Downloads last month
10
Safetensors
Model size
31.9B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.