This work from Chinese startup @MiniMax-AI introduces a novel architecture that achieves state-of-the-art performance while handling context windows up to 4 million tokens - roughly 20x longer than current models. The key was combining lightning attention, mixture of experts (MoE), and a careful hybrid approach.
๐๐ฒ๐ ๐ถ๐ป๐๐ถ๐ด๐ต๐๐:
๐๏ธ MoE with novel hybrid attention: โฃ Mixture of Experts with 456B total parameters (45.9B activated per token) โฃ Combines Lightning attention (linear complexity) for most layers and traditional softmax attention every 8 layers
๐ Outperforms leading models across benchmarks while offering vastly longer context: โฃ Competitive with GPT-4/Claude-3.5-Sonnet on most tasks โฃ Can efficiently handle 4M token contexts (vs 256K for most other LLMs)
๐ฌ Technical innovations enable efficient scaling: โฃ Novel expert parallel and tensor parallel strategies cut communication overhead in half โฃ Improved linear attention sequence parallelism, multi-level padding and other optimizations achieve 75% GPU utilization (that's really high, generally utilization is around 50%)
๐ฏ Thorough training strategy: โฃ Careful data curation and quality control by using a smaller preliminary version of their LLM as a judge!
Overall, not only is the model impressive, but the technical paper is also really interesting! ๐ It has lots of insights including a great comparison showing how a 2B MoE (24B total) far outperforms a 7B model for the same amount of FLOPs.
๐ฅ Why 3DIS & 3DIS-FLUX? Current SOTA multi-instance generation methods are typically adapter-based, requiring additional control modules trained on pre-trained models for layout and instance attribute control. However, with the emergence of more powerful models like FLUX and SD3.5, these methods demand constant retraining and extensive resources.
โจ Our Solution: 3DIS We introduce a decoupled approach that only requires training a low-resolution Layout-to-Depth model to convert layouts into coarse-grained scene depth maps. Leveraging community and company pre-trained models like ControlNet + SAM2, we enable training-free controllable image generation on high-resolution models such as SDXL and FLUX.
๐ Benefits of Our Decoupled Multi-Instance Generation: 1. Enhanced Control: By constructing scenes using depth maps in the first stage, the model focuses on coarse-grained scene layout, improving control over instance placement. 2. Flexibility & Preservation: The second stage employs training-free rendering methods, allowing seamless integration with various models (e.g., fine-tuned weights, LoRA) while maintaining the generative capabilities of pre-trained models.
Join us in advancing Layout-to-Image Generation! Follow and star our repository to stay updated! โญ