Post
2461
๐๐๐ป๐๐๐ฎ๐ป-๐๐ฎ๐ฟ๐ด๐ฒ ๐ท๐๐๐ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ฑ ๐ฏ๐ ๐ง๐ฒ๐ป๐ฐ๐ฒ๐ป๐: ๐๐ฎ๐ฟ๐ด๐ฒ๐๐ ๐ฒ๐๐ฒ๐ฟ ๐ผ๐ฝ๐ฒ๐ป ๐ ๐ผ๐ ๐๐๐ , ๐ผ๐ป๐น๐ ๐ฑ๐ฎ๐ ๐ฎ๐ฐ๐๐ถ๐๐ฒ ๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ๐ ๐ฏ๐๐ ๐ฏ๐ฒ๐ฎ๐๐ ๐๐๐ฎ๐ ๐ ๐ฏ.๐ญ-๐ฐ๐ฌ๐ฑ๐ ๐ผ๐ป ๐บ๐ผ๐๐ ๐ฎ๐ฐ๐ฎ๐ฑ๐ฒ๐บ๐ถ๐ฐ ๐ฏ๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ๐ ๐
โก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input
๐งช Trained on 7T tokens, including 1.5T tokens of synthetic data
๐๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded
๐ Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters
โฃ Impressive perf on MATH: 77.4
๐ย Large context length: up to 256K tokens
๐ License:
โฃ Commercial use allowed, except if your products have >100M monthly active users
โฃ No access in the EU
๐คย Model weights available on HF!
Read the full paper here ๐ย Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265)
โก Mixture of Experts (MoE) architecture: 389 B parameters in total, but only 52B are activated for any input
๐งช Trained on 7T tokens, including 1.5T tokens of synthetic data
๐๏ธ Architecture : Novel "recycle routing" prevents token dropping when experts are overrloaded
๐ Great benchmark results: Surpasses Llama-3-405B-Instruct in most benchmarks although it has 8x fewer active parameters
โฃ Impressive perf on MATH: 77.4
๐ย Large context length: up to 256K tokens
๐ License:
โฃ Commercial use allowed, except if your products have >100M monthly active users
โฃ No access in the EU
๐คย Model weights available on HF!
Read the full paper here ๐ย Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (2411.02265)