Low-Rank Adapters Meet Neural Architecture Search for LLM Compression Paper • 2501.16372 • Published 7 days ago • 5
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling Paper • 2501.16975 • Published 2 days ago • 12
Optimizing Large Language Model Training Using FP4 Quantization Paper • 2501.17116 • Published 1 day ago • 17
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published 1 day ago • 36
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity Paper • 2501.16295 • Published 3 days ago • 5
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models Paper • 2501.12370 • Published 9 days ago • 7
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Paper • 2501.15907 • Published 3 days ago • 14
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Paper • 2501.15570 • Published 4 days ago • 17
Towards General-Purpose Model-Free Reinforcement Learning Paper • 2501.16142 • Published 3 days ago • 19
CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation Paper • 2501.11325 • Published 10 days ago • 3
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning Paper • 2411.19458 • Published Nov 29, 2024 • 5