JingzeShi commited on
Commit
bf04250
verified
1 Parent(s): 1e1abc5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -15
README.md CHANGED
@@ -25,9 +25,9 @@ tags:
25
  <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;">
26
  <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
27
  </a>
28
- <a href="https://arxiv.org/abs/2412.11834" target="_blank" style="margin: 2px;">
29
  <img alt="arXiv" src="https://img.shields.io/static/v1?label=arXiv&message=2412.11834&color=B31B1B&logo=arXiv" style="display: inline-block; vertical-align: middle;"/>
30
- </a>
31
  <a href="https://github.com/SmallDoges/small-doge" target="_blank" style="margin: 2px;">
32
  <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallDoge-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
33
  </a>
@@ -36,7 +36,7 @@ tags:
36
  </a>
37
  </div>
38
 
39
- Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, please refer to [Wonderful Matrices](https://arxiv.org/abs/2412.11834), all training details and code are publicly available on the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
@@ -83,13 +83,13 @@ outputs = model.generate(
83
 
84
  We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).
85
 
86
- > TODO: The larger model is under training and will be uploaded soon.
87
-
88
  **SFT**:
89
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
90
  |---|---|---|---|---|---|---|
91
- | [Doge-20M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-20M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
92
- | [Doge-60M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-60M-Instruct-SFT) | [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
 
 
93
 
94
 
95
  **Procedure**:
@@ -107,13 +107,11 @@ We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datas
107
  ## Citation
108
 
109
  ```bibtex
110
- @misc{shi2024wonderfulmatrices,
111
- title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture},
112
- author={Jingze Shi and Bingheng Wu},
113
- year={2024},
114
- eprint={2412.11834},
115
- archivePrefix={arXiv},
116
- primaryClass={cs.LG},
117
- url={https://arxiv.org/abs/2412.11834},
118
  }
119
  ```
 
25
  <a href="https://discord.gg/P2yYH95N" target="_blank" style="margin: 2px;">
26
  <img alt="Discord" src="https://img.shields.io/badge/Discord-Small%20Doges-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
27
  </a>
28
+ <!-- <a href="https://arxiv.org/abs/2412.11834" target="_blank" style="margin: 2px;">
29
  <img alt="arXiv" src="https://img.shields.io/static/v1?label=arXiv&message=2412.11834&color=B31B1B&logo=arXiv" style="display: inline-block; vertical-align: middle;"/>
30
+ </a> -->
31
  <a href="https://github.com/SmallDoges/small-doge" target="_blank" style="margin: 2px;">
32
  <img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallDoge-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
33
  </a>
 
36
  </a>
37
  </div>
38
 
39
+ Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the [small-doge](https://github.com/SmallDoges/small-doge) repository.
40
 
41
 
42
  ## Uses
 
83
 
84
  We build the Doge-Instruct-SFT by SFT on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk).
85
 
 
 
86
  **SFT**:
87
  | Model | Training Data | Epochs | Content Length | LR | Batch Size | Precision |
88
  |---|---|---|---|---|---|---|
89
+ | [Doge-20M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-20M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 8e-4 | 0.25M | bfloat16 |
90
+ | [Doge-60M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-60M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 6e-4 | 0.25M | bfloat16 |
91
+ | [Doge-160M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-160M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 4e-4 | 0.25M | bfloat16 |
92
+ | [Doge-320M-Instruct-SFT](https://huggingface.co/SmallDoge/Doge-320M-Instruct-SFT) | [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | 2 | 2048 | 2e-4 | 0.25M | bfloat16 |
93
 
94
 
95
  **Procedure**:
 
107
  ## Citation
108
 
109
  ```bibtex
110
+ @misc{smalldoges,
111
+ title={SmallDoges: A Family of Dynamic UltraFast Small Language Models},
112
+ author={Jingze, Shi and Yifan, Wu and Bingheng, Wu and Yuyu, Luo},
113
+ year={2025},
114
+ month={March},
115
+ url={https://github.com/SmallDoges/small-doge}
 
 
116
  }
117
  ```