JunxiongWang
/

Mamba2InLlama_0_875

@@ -1,26 +1,51 @@
 ---
 base_model: JunxiongWang/llama3_0_875_mamba2_sft
-tags:
-- alignment-handbook
-- generated_from_trainer
 datasets:
 - HuggingFaceH4/ultrafeedback_binarized
 - HuggingFaceH4/orca_dpo_pairs
 - JunxiongWang/llama3-ultrafeedback-armorm
 model-index:
 - name: JunxiongWang/Mamba2InLlama_0_875
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-Please check [here](https://github.com/jxiw/MambaInLlama/tree/main) for details.
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/junxiong12/huggingface/runs/58mrdgq8)
-# JunxiongWang/Mamba2InLlama_0_875
-This model is a fine-tuned version of [JunxiongWang/llama3_0_875_mamba2_sft](https://huggingface.co/JunxiongWang/llama3_0_875_mamba2_sft/) on the HuggingFaceH4/ultrafeedback_binarized, the HuggingFaceH4/orca_dpo_pairs and the JunxiongWang/llama3-ultrafeedback-armorm datasets.
 It achieves the following results on the evaluation set:
 - Loss: 0.4761
 - Rewards/chosen: -1.4040
@@ -32,19 +57,15 @@ It achieves the following results on the evaluation set:
 - Logits/rejected: 0.3408
 - Logits/chosen: 0.3851
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -69,7 +90,6 @@ The following hyperparameters were used during training:
 | 0.5009        | 0.4798 | 2000 | 0.4998          | -1.4973        | -2.6147          | 0.7804             | 1.1175          | -586.2582      | -468.3976    | 0.4682          | 0.5136        |
 | 0.4895        | 0.9597 | 4000 | 0.4761          | -1.4040        | -2.6012          | 0.7982             | 1.1973          | -584.9104      | -459.0677    | 0.3408          | 0.3851        |
 ### Framework versions
 - Transformers 4.43.1
@@ -77,13 +97,102 @@ The following hyperparameters were used during training:
 - Datasets 2.20.0
 - Tokenizers 0.19.1
-[MambaInLlama](arxiv.org/abs/2408.15237)
 ```
-@article{junxiongdaniele2024mambainllama,
-  title   = {The Mamba in the Llama: Distilling and Accelerating Hybrid Models},
-  author  = {Junxiong Wang and Daniele Paliotta and Avner May and Alexander M. Rush and Tri Dao},
-  journal = {arXiv preprint arXiv:2408.15237},
-  year    = {2024}
 }
 ```

 ---
 base_model: JunxiongWang/llama3_0_875_mamba2_sft
 datasets:
 - HuggingFaceH4/ultrafeedback_binarized
 - HuggingFaceH4/orca_dpo_pairs
 - JunxiongWang/llama3-ultrafeedback-armorm
+tags:
+- alignment-handbook
+- generated_from_trainer
+- mamba
+- distillation
 model-index:
 - name: JunxiongWang/Mamba2InLlama_0_875
   results: []
+pipeline_tag: text-generation
+library_name: transformers
+license: apache-2.0
 ---
+# JunxiongWang/Mamba2InLlama_0_875: The Mamba in the Llama
+This model is part of the work presented in the paper [The Mamba in the Llama: Distilling and Accelerating Hybrid Models](https://arxiv.org/abs/2408.15237).
+**Code Repository (New Version)**: [https://github.com/jxiw/M1](https://github.com/jxiw/M1)
+**Code Repository (Original)**: [https://github.com/jxiw/MambaInLlama](https://github.com/jxiw/MambaInLlama)
+**Project Page**: [https://openreview.net/forum?id=uAzhODjALU](https://openreview.net/forum?id=uAzhODjALU)
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/junxiong12/huggingface/runs/58mrdgq8)
+## Model description
+This model, `JunxiongWang/Mamba2InLlama_0_875`, is a fine-tuned hybrid language model that combines elements of Transformer and Mamba (linear RNN) architectures. It is a result of a distillation process aimed at converting large pre-trained Transformer models, such as Llama3-8B-Instruct, into more deployment-advantageous linear RNNs. The approach involves reusing linear projection weights from attention layers, resulting in a hybrid model that incorporates a fraction (in this case, one-quarter) of the original attention layers.
+The model aims to achieve performance comparable to the original Transformer in chat benchmarks while offering improved inference characteristics. It leverages a hardware-aware speculative decoding algorithm for accelerated inference speed. This specific model, distilled from Llama3-8B-Instruct, demonstrates competitive results with a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench. It also exhibits natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length.
+## Intended uses & limitations
+This model is intended for efficient and accelerated text generation tasks, particularly in scenarios where the advantageous deployment characteristics of linear RNNs (like Mamba) are desired over traditional Transformer models. It is suitable for chat applications, general language modeling, and tasks requiring long-range context handling.
+Limitations: While designed to preserve generative quality, the hybrid architecture might have different performance profiles compared to a full Transformer model on specific tasks. Users should be aware that the optimal performance and reproducibility are dependent on adherence to the recommended environment setup, including specific CUDA and Python package versions. As with all large language models, potential biases from training data and hallucination remain a consideration.
+## Training and evaluation data
+This model is a fine-tuned version of [JunxiongWang/llama3_0_875_mamba2_sft](https://huggingface.co/JunxiongWang/llama3_0_875_mamba2_sft/) on the following datasets:
+*   [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized)
+*   [HuggingFaceH4/orca_dpo_pairs](https://huggingface.co/datasets/HuggingFaceH4/orca_dpo_pairs)
+*   [JunxiongWang/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/JunxiongWang/llama3-ultrafeedback-armorm)
 It achieves the following results on the evaluation set:
 - Loss: 0.4761
 - Rewards/chosen: -1.4040
 - Logits/rejected: 0.3408
 - Logits/chosen: 0.3851
+## Training procedure
+The models were distilled using a multi-stage approach:
+1.  **Stepwise layer alignment** (Optional): Attention layers are replaced by Mamba2 layers one by one in a stepwise manner. During this stage, MLP layers are typically frozen to ensure the model remains similar to the initialization model.
+2.  **End-to-end distillation** (Most important): The primary phase involves minimizing the KL divergence loss between the student (hybrid) and teacher (original Transformer) models. In this stage, all parameters, including MLP layers, are trained to achieve better results based purely on KL loss.
+3.  **Instruction tuning** (Optional): For simplicity, DPO (Direct Preference Optimization) was used for this process to align the models with human preferences.
+The distillation process typically requires around 3 to 4 days using 8x80G A100 GPUs with limited resources.
 ### Training hyperparameters
 | 0.5009        | 0.4798 | 2000 | 0.4998          | -1.4973        | -2.6147          | 0.7804             | 1.1175          | -586.2582      | -468.3976    | 0.4682          | 0.5136        |
 | 0.4895        | 0.9597 | 4000 | 0.4761          | -1.4040        | -2.6012          | 0.7982             | 1.1973          | -584.9104      | -459.0677    | 0.3408          | 0.3851        |
 ### Framework versions
 - Transformers 4.43.1
 - Datasets 2.20.0
 - Tokenizers 0.19.1
+## Usage
+For detailed instructions, the full codebase, and other released models, please refer to the primary [M1 GitHub repository](https://github.com/jxiw/M1).
+### Environment Setup
+The project provides an `environment.yml` file listing specific Python package versions used for reproducibility. It is recommended to use these versions for optimal performance. Key packages include `mamba-ssm`, `causal-conv1d`, and `flash-attn`, along with specific PyTorch and CUDA versions.
+```bash
+# CUDA>=11.6 needed for `mamba-ssm` and `causal-conv1d`.
+conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit
+# Install PyTorch (with CUDA 11.8) before everything else. those assume you are using cu118
+pip install torch==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118
+pip install causal-conv1d==1.4.0
+pip install flash-attn==2.6.3
+# make sure you use this alignment version
+git clone https://github.com/huggingface/alignment-handbook.git
+cd alignment-handbook/
+git checkout 606d2e9
+git clone https://github.com/huggingface/transformers.git --branch v4.43.1
+# check your version matches those
+# deepspeed==0.12.2
+# torch==2.1.1+cu118
+# transformers==4.43.1
+# trl==0.8.6
+# accelerate==0.33.0
+# peft==0.12.0
+# huggingface-hub==0.24.5
+```
+If `mamba-ssm==2.2.2` is installed via pip, a manual change to `CONDA_ENV_PATH/site-packages/mamba_ssm/modules/mha.py` might be needed to support GQA (used in Llama3). Refer to [this version](https://github.com/state-spaces/mamba/blob/014c094d11f780a27330657faabecaaded7a31db/mamba_ssm/modules/mha.py) or build `mamba-ssm` from source (commit after `014c094d11f780a27330657faabecaaded7a31db`).
+### Generation Example (Mamba 2)
+```python
+import torch
+from transformers import AutoTokenizer
+# For Mamba2InLlama models, use mamba2_inference.hybrid_wrapper
+from mamba2_inference.hybrid_wrapper import MambaTransformerHybridModelWrapper
+pretrained_model_name = "JunxiongWang/Mamba2InLlama_0_875" # This model
+model = MambaTransformerHybridModelWrapper.from_pretrained(pretrained_model_name, torch_dtype=torch.bfloat16)
+model.eval()
+messages = [[
+    {
+        "role": "user",
+        "content": "Farmer Brown has 20 animals on his farm, all either chickens or cows. They have a total of 70 legs, all together. How many of the animals are chickens?",
+    },
+]]
+tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
+formatted_prompts = [
+    tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True) for message in messages
+]
+prompts = [
+    tokenizer.encode(formatted_prompt, return_tensors="pt", truncation=True, max_length=200)
+    for formatted_prompt in formatted_prompts
+]
+batch_prompts = torch.cat(prompts, dim=0).cuda()
+outputs = model.generate(
+    input_ids=batch_prompts,
+    max_length=1000,
+    cg=True, # Enables speculative decoding if supported
+    return_dict_in_generate=True,
+    output_scores=True,
+    enable_timing=True,
+    top_k=1,
+    eos_token_id=tokenizer.eos_token_id
+)
+generated_text = tokenizer.batch_decode(outputs.sequences.tolist())
+print(generated_text[0])
+# Example output (trimmed for brevity):
+# Let's use algebra to solve this problem. Let \( c \) represent the number of chickens and \( k \) represent the number of cows.
+# ... (full derivation and answer)
+# So, there are 5 chickens on Farmer Brown's farm.
 ```
+## Citation
+If you use this codebase, or otherwise found our work valuable, please cite:
+```bibtex
+@inproceedings{
+junxiongdaniele2024mambainllama,
+title={The Mamba in the Llama: Distilling and Accelerating Hybrid Models},
+author={Junxiong Wang and Daniele Paliotta and Avner May and Alexander M Rush and Tri Dao},
+booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
+year={2024},
+url={https://openreview.net/forum?id=uAzhODjALU}
 }
 ```