moonshotai
/

Moonlight-16B-A3B-Instruct

Text Generation

text-generation-inference

Model card Files Files and versions Community

liushaowei commited on Feb 22

Commit

d413755

·

1 Parent(s): bb39c64

update readme format

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -9,9 +9,9 @@ library_name: transformers
 <!-- # Muon is Scalable For LLM Training -->
 <div align="center">
-  <a href="https://github.com/MoonshotAI/dummy.pdf"><img src="figures/logo.png" height="16" width="16" style="vertical-align:middle"><b> Tech Report</b></a>  |
-  <a href="https://huggingface.co/moonshotai/Moonlight"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="vertical-align:middle"><b> HuggingFace</b></a>  |
-  <a href="#"><img src="figures/megatron.png" height="16" width="16" style="vertical-align:middle"><b>Megatron(coming soon)</b></a>
 </div>
@@ -52,7 +52,7 @@ We compared Moonlight with SOTA public models at similar scale:
 - **LLAMA3-3B** is a 3B-parameter dense model trained with 9T tokens
 - **Qwen2.5-3B** is a 3B-parameter dense model trained with 18T tokens
 - **Deepseek-v2-Lite** is a 2.4B/16B-parameter MOE model trained with 5.7T tokens
 | | **Benchmark (Metric)** | **Llama3.2-3B** | **Qwen2.5-3B** | **DSV2-Lite** | **Moonlight** |
 |---|---|---|---|---|---|
 | | Activated Param† | 2.81B | 2.77B | 2.24B | 2.24B |
@@ -70,6 +70,7 @@ We compared Moonlight with SOTA public models at similar scale:
 | | CMath | - | 80.0 | 58.4 | **81.1** |
 | **Chinese** | C-Eval | - | 75.0 | 60.3 | **77.2** |
 | | CMMLU | - | 75.0 | 64.3 | **78.2** |
 *Qwen 2 & 2.5 reports didn't disclose their optimizer information. †The reported parameter counts exclude the embedding parameters. ‡We test all listed models with the full set of TriviaQA.*

 <!-- # Muon is Scalable For LLM Training -->
 <div align="center">
+  <a href="https://github.com/MoonshotAI/dummy.pdf" ><img src="figures/logo.png" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> Tech Report</b></a>  |
+  <a href="https://huggingface.co/moonshotai/Moonlight"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> HuggingFace</b></a>  |
+  <a href="#"><img src="figures/megatron.png" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;">Megatron(coming soon)</b></a>
 </div>
 - **LLAMA3-3B** is a 3B-parameter dense model trained with 9T tokens
 - **Qwen2.5-3B** is a 3B-parameter dense model trained with 18T tokens
 - **Deepseek-v2-Lite** is a 2.4B/16B-parameter MOE model trained with 5.7T tokens
+ <div align="center">
 | | **Benchmark (Metric)** | **Llama3.2-3B** | **Qwen2.5-3B** | **DSV2-Lite** | **Moonlight** |
 |---|---|---|---|---|---|
 | | Activated Param† | 2.81B | 2.77B | 2.24B | 2.24B |
 | | CMath | - | 80.0 | 58.4 | **81.1** |
 | **Chinese** | C-Eval | - | 75.0 | 60.3 | **77.2** |
 | | CMMLU | - | 75.0 | 64.3 | **78.2** |
+</div>
 *Qwen 2 & 2.5 reports didn't disclose their optimizer information. †The reported parameter counts exclude the embedding parameters. ‡We test all listed models with the full set of TriviaQA.*