InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published 3 days ago • 42
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published 3 days ago • 88
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation Paper • 2409.12576 • Published 3 days ago • 11
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization Paper • 2409.12903 • Published 3 days ago • 13
LVCD: Reference-based Lineart Video Colorization with Diffusion Models Paper • 2409.12960 • Published 3 days ago • 16
Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation Paper • 2409.12532 • Published 3 days ago • 3
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions Paper • 2409.12958 • Published 3 days ago • 4
FlexiTex: Enhancing Texture Generation with Visual Guidance Paper • 2409.12431 • Published 4 days ago • 7
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer Paper • 2409.08425 • Published 10 days ago • 8
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey Paper • 2409.11564 • Published 5 days ago • 16
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models Paper • 2409.12139 • Published 4 days ago • 10
Towards Diverse and Efficient Audio Captioning via Diffusion Models Paper • 2409.09401 • Published 8 days ago • 6
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published 4 days ago • 55
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models Paper • 2409.11136 • Published 5 days ago • 18
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion Paper • 2409.11406 • Published 5 days ago • 21
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think Paper • 2409.11355 • Published 5 days ago • 24
OSV: One Step is Enough for High-Quality Image to Video Generation Paper • 2409.11367 • Published 5 days ago • 12
A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B Paper • 2409.11055 • Published 5 days ago • 15
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer Paper • 2409.10819 • Published 6 days ago • 14
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing Paper • 2409.10831 • Published 6 days ago • 3
Policy Filtration in RLHF to Fine-Tune LLM for Code Generation Paper • 2409.06957 • Published 12 days ago • 5
One missing piece in Vision and Language: A Survey on Comics Understanding Paper • 2409.09502 • Published 8 days ago • 23
ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds Paper • 2409.09213 • Published 9 days ago • 10
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types Paper • 2409.09269 • Published 9 days ago • 7
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation Paper • 2409.09214 • Published 9 days ago • 43
A Diffusion Approach to Radiance Field Relighting using Multi-Illumination Synthesis Paper • 2409.08947 • Published 9 days ago • 11
InstantDrag: Improving Interactivity in Drag-based Image Editing Paper • 2409.08857 • Published 9 days ago • 28
DrawingSpinUp: 3D Animation from Single Character Drawings Paper • 2409.08615 • Published 9 days ago • 13
Apollo: Band-sequence Modeling for High-Quality Audio Restoration Paper • 2409.08514 • Published 10 days ago • 8
Click2Mask: Local Editing with Dynamic Mask Generation Paper • 2409.08272 • Published 10 days ago • 3
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published 11 days ago • 59
MOSAIC: A Modular System for Assistive and Interactive Cooking Paper • 2402.18796 • Published Feb 29 • 23
Can OOD Object Detectors Learn from Foundation Models? Paper • 2409.05162 • Published 14 days ago • 5
PiTe: Pixel-Temporal Alignment for Large Video-Language Model Paper • 2409.07239 • Published 11 days ago • 11
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published 10 days ago • 40
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers Paper • 2409.04109 • Published 16 days ago • 37
TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder Paper • 2409.08248 • Published 10 days ago • 12
DreamHOI: Subject-Driven Generation of 3D Human-Object Interactions with Diffusion Priors Paper • 2409.08278 • Published 10 days ago • 10
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation Paper • 2409.08240 • Published 10 days ago • 15
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Paper • 2409.08239 • Published 10 days ago • 15
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos Paper • 2409.07450 • Published 11 days ago • 10
Can Large Language Models Unlock Novel Scientific Research Ideas? Paper • 2409.06185 • Published 13 days ago • 9
gsplat: An Open-Source Library for Gaussian Splatting Paper • 2409.06765 • Published 12 days ago • 11
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published 11 days ago • 18
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Paper • 2409.06820 • Published 12 days ago • 56
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis Paper • 2409.07129 • Published 11 days ago • 7
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models Paper • 2409.07452 • Published 11 days ago • 18