Post
1667
π₯οΈ Do you have 1TB+ VRAM?
π Well, good news for you!
π¨βπ¬ Good folks at @nvidia have released Nemotron 4 340B, the new open-source LLM king, rivalling GPT-4! π
π 340B parameter models in 3 flavours: base, reward, and instruct models
π― It's a dense model, not MoE
π 4k context window
π 9T tokens training data, 2 phase training (8T pre-train + 1T continued pre-training)
π Trained on 50+ languages and 40+ coding languages (70% training data is English, 15% multi-lingual, 15% code)
π June 2023 training data cut-off
π» To deploy needs 8x H200/ 16x H100/ 16x A100 80GB for BF16 Inference (about 8x H100 in int4)
π Of course, it beats Llama 3 70B on MMLU (81.1), Arena Hard (54.2), and GSM8K (92.4)
π€ But beaten by Qwen 2 on HumanEval and MTBench which is a 72B parameter model
π§ Used SFT, DPO, and RPO. RLHF via Nemo Aligner framework to align the model
π 98% of alignment data was synthetically generated
π Nvidia open licence with commercial use allowed
Β―\_(γ)_/Β―
π Glad to see more open models but this is one confusing fellow!
π€¨340B parameter model that is narrowly beating 70B models? Starts failing against 72B models? Sounds like a model for synthetic data generation! But then it has 4k context?
π Models: nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911
π Paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b
π Well, good news for you!
π¨βπ¬ Good folks at @nvidia have released Nemotron 4 340B, the new open-source LLM king, rivalling GPT-4! π
π 340B parameter models in 3 flavours: base, reward, and instruct models
π― It's a dense model, not MoE
π 4k context window
π 9T tokens training data, 2 phase training (8T pre-train + 1T continued pre-training)
π Trained on 50+ languages and 40+ coding languages (70% training data is English, 15% multi-lingual, 15% code)
π June 2023 training data cut-off
π» To deploy needs 8x H200/ 16x H100/ 16x A100 80GB for BF16 Inference (about 8x H100 in int4)
π Of course, it beats Llama 3 70B on MMLU (81.1), Arena Hard (54.2), and GSM8K (92.4)
π€ But beaten by Qwen 2 on HumanEval and MTBench which is a 72B parameter model
π§ Used SFT, DPO, and RPO. RLHF via Nemo Aligner framework to align the model
π 98% of alignment data was synthetically generated
π Nvidia open licence with commercial use allowed
Β―\_(γ)_/Β―
π Glad to see more open models but this is one confusing fellow!
π€¨340B parameter model that is narrowly beating 70B models? Starts failing against 72B models? Sounds like a model for synthetic data generation! But then it has 4k context?
π Models: nvidia/nemotron-4-340b-666b7ebaf1b3867caf2f1911
π Paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b