Woodpecker: Hallucination Correction for Multimodal Large Language Models Paper • 2310.16045 • Published Oct 24, 2023 • 17
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models Paper • 2310.14566 • Published Oct 23, 2023 • 27
SILC: Improving Vision Language Pretraining with Self-Distillation Paper • 2310.13355 • Published Oct 20, 2023 • 9
Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing Paper • 2310.12404 • Published Oct 19, 2023 • 15
MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models Paper • 2310.11954 • Published Oct 18, 2023 • 25
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model Paper • 2309.16058 • Published Sep 27, 2023 • 56
Jointly Training Large Autoregressive Multimodal Models Paper • 2309.15564 • Published Sep 27, 2023 • 8
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions Paper • 2308.04152 • Published Aug 8, 2023 • 2
Multimodal Foundation Models: From Specialists to General-Purpose Assistants Paper • 2309.10020 • Published Sep 18, 2023 • 41
Language as the Medium: Multimodal Video Classification through text only Paper • 2309.10783 • Published Sep 19, 2023 • 1
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants Paper • 2310.00653 • Published Oct 1, 2023 • 3
You Only Look at Screens: Multimodal Chain-of-Action Agents Paper • 2309.11436 • Published Sep 20, 2023 • 1
UniAudio: An Audio Foundation Model Toward Universal Audio Generation Paper • 2310.00704 • Published Oct 1, 2023 • 21
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency Paper • 2310.03734 • Published Oct 5, 2023 • 15
Aligning Text-to-Image Diffusion Models with Reward Backpropagation Paper • 2310.03739 • Published Oct 5, 2023 • 22
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation Paper • 2310.08541 • Published Oct 12, 2023 • 18
Aligning Large Multimodal Models with Factually Augmented RLHF Paper • 2309.14525 • Published Sep 25, 2023 • 31
Toward Joint Language Modeling for Speech Units and Text Paper • 2310.08715 • Published Oct 12, 2023 • 10
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation Paper • 2310.05737 • Published Oct 9, 2023 • 4
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages Paper • 2305.04160 • Published May 7, 2023 • 2
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning Paper • 2310.09478 • Published Oct 14, 2023 • 21
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models Paper • 2308.13437 • Published Aug 25, 2023 • 4
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities Paper • 2308.12966 • Published Aug 24, 2023 • 8
Ziya-VL: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning Paper • 2310.08166 • Published Oct 12, 2023 • 1
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models Paper • 2310.08825 • Published Oct 13, 2023 • 1
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs Paper • 2310.00582 • Published Oct 1, 2023 • 1
Kosmos-G: Generating Images in Context with Multimodal Large Language Models Paper • 2310.02992 • Published Oct 4, 2023 • 4
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models Paper • 2309.04041 • Published Sep 7, 2023 • 1
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models Paper • 2309.09958 • Published Sep 18, 2023 • 19
TextBind: Multi-turn Interleaved Multimodal Instruction-following Paper • 2309.08637 • Published Sep 14, 2023 • 8
LMDX: Language Model-based Document Information Extraction and Localization Paper • 2309.10952 • Published Sep 19, 2023 • 66
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding Paper • 2310.15308 • Published Oct 23, 2023 • 23
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images Paper • 2310.16825 • Published Oct 25, 2023 • 36
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation Paper • 2310.16656 • Published Oct 25, 2023 • 46
Experts Weights Averaging: A New General Training Scheme for Vision Transformers Paper • 2308.06093 • Published Aug 11, 2023 • 2
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks Paper • 2306.04073 • Published Jun 7, 2023 • 2
Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning Paper • 2303.08566 • Published Mar 15, 2023 • 1
SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation Paper • 2205.08180 • Published May 17, 2022 • 1
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception Paper • 2305.06324 • Published May 10, 2023 • 1
Meta-Transformer: A Unified Framework for Multimodal Learning Paper • 2307.10802 • Published Jul 20, 2023 • 44
Using Multiple Instance Learning to Build Multimodal Representations Paper • 2212.05561 • Published Dec 11, 2022 • 1
LMEye: An Interactive Perception Network for Large Language Models Paper • 2305.03701 • Published May 5, 2023 • 2
Concept-Oriented Deep Learning with Large Language Models Paper • 2306.17089 • Published Jun 29, 2023 • 1
AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models Paper • 2309.06495 • Published Sep 5, 2023 • 1
Multimodal Multi-Hop Question Answering Through a Conversation Between Tools and Efficiently Finetuned Large Language Models Paper • 2309.08922 • Published Sep 16, 2023 • 1
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering Paper • 2305.03453 • Published May 5, 2023 • 1
ViperGPT: Visual Inference via Python Execution for Reasoning Paper • 2303.08128 • Published Mar 14, 2023 • 2
Visual Programming: Compositional visual reasoning without training Paper • 2211.11559 • Published Nov 18, 2022 • 1
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems Paper • 2210.15037 • Published Oct 26, 2022 • 1
Diversifying Joint Vision-Language Tokenization Learning Paper • 2306.03421 • Published Jun 6, 2023 • 2
Joint Adaptive Representations for Image-Language Learning Paper • 2305.19924 • Published May 31, 2023 • 1
TouchStone: Evaluating Vision-Language Models by Language Models Paper • 2308.16890 • Published Aug 31, 2023 • 1
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning Paper • 2309.07915 • Published Sep 14, 2023 • 4
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference Paper • 2310.04378 • Published Oct 6, 2023 • 21
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V Paper • 2310.11441 • Published Oct 17, 2023 • 28
An Image is Worth Multiple Words: Learning Object Level Concepts using Multi-Concept Prompt Learning Paper • 2310.12274 • Published Oct 18, 2023 • 13
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models Paper • 2310.11440 • Published Oct 17, 2023 • 17
Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers Paper • 2303.13755 • Published Mar 24, 2023 • 1
Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks Paper • 2310.19909 • Published Oct 30, 2023 • 21
i-Code Studio: A Configurable and Composable Framework for Integrative AI Paper • 2305.13738 • Published May 23, 2023 • 1
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Paper • 2306.08640 • Published Jun 14, 2023 • 26
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models Paper • 2309.10707 • Published Sep 18, 2023 • 2
M^3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning Paper • 2306.04387 • Published Jun 7, 2023 • 8
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models Paper • 2308.00675 • Published Aug 1, 2023 • 36
Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task Paper • 2307.03972 • Published Jul 8, 2023 • 1
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding Paper • 2306.17107 • Published Jun 29, 2023 • 11
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Paper • 2305.18752 • Published May 30, 2023 • 4
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts Paper • 2305.14839 • Published May 24, 2023 • 1
Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling Paper • 2311.00430 • Published Nov 1, 2023 • 58
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data Paper • 2309.13876 • Published Sep 25, 2023 • 1
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models Paper • 2309.15701 • Published Sep 27, 2023 • 2
Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition Paper • 2310.06434 • Published Oct 10, 2023 • 4
MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics Transcription Paper • 2108.02625 • Published Aug 5, 2021 • 1
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing Paper • 2311.00571 • Published Nov 1, 2023 • 43
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models Paper • 2309.01947 • Published Sep 5, 2023 • 1
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization Paper • 2211.02077 • Published Nov 3, 2022 • 1
MUTEX: Learning Unified Policies from Multimodal Task Specifications Paper • 2309.14320 • Published Sep 25, 2023 • 1
Linking Representations with Multimodal Contrastive Learning Paper • 2304.03464 • Published Apr 7, 2023 • 1
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers Paper • 2211.11315 • Published Nov 21, 2022 • 1
Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency Paper • 2311.02772 • Published Nov 5, 2023 • 7
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning Paper • 2304.06461 • Published Apr 13, 2023 • 1
UNFUSED: UNsupervised Finetuning Using SElf supervised Distillation Paper • 2303.05668 • Published Mar 10, 2023 • 1
One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification Paper • 2305.17394 • Published May 27, 2023 • 1
PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations Paper • 2203.16965 • Published Mar 31, 2022 • 1
Task-Agnostic Structured Pruning of Speech Representation Models Paper • 2306.01385 • Published Jun 2, 2023 • 1
Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation Paper • 2305.11685 • Published May 19, 2023 • 2
Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognition Paper • 2303.13072 • Published Mar 23, 2023 • 1
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval Paper • 2309.01516 • Published Sep 4, 2023 • 1
Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning Paper • 2212.03220 • Published Dec 6, 2022 • 1
End-to-end Knowledge Retrieval with Multi-modal Queries Paper • 2306.00424 • Published Jun 1, 2023 • 1
A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering Paper • 2304.13649 • Published Apr 26, 2023 • 1
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models Paper • 2311.04589 • Published Nov 8, 2023 • 23
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning Paper • 2211.13218 • Published Nov 23, 2022 • 1
When Prompt-based Incremental Learning Does Not Meet Strong Pretraining Paper • 2308.10445 • Published Aug 21, 2023 • 1
PILOT: A Pre-Trained Model-Based Continual Learning Toolbox Paper • 2309.07117 • Published Sep 13, 2023 • 2
A Simple Baseline that Questions the Use of Pretrained-Models in Continual Learning Paper • 2210.04428 • Published Oct 10, 2022 • 1
A soft nearest-neighbor framework for continual semi-supervised learning Paper • 2212.05102 • Published Dec 9, 2022 • 1
Avalanche: an End-to-End Library for Continual Learning Paper • 2104.00405 • Published Apr 1, 2021 • 1
SequeL: A Continual Learning Library in PyTorch and JAX Paper • 2304.10857 • Published Apr 21, 2023 • 1
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer Paper • 2306.06446 • Published Jun 10, 2023 • 1
An Efficient General-Purpose Modular Vision Model via Multi-Task Heterogeneous Training Paper • 2306.17165 • Published Jun 29, 2023 • 1
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts Paper • 2105.03036 • Published May 7, 2021 • 2
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition Paper • 2307.05956 • Published Jul 12, 2023 • 1
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models Paper • 2311.05997 • Published Nov 10, 2023 • 37
Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization Paper • 2311.06243 • Published Nov 10, 2023 • 22
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores Paper • 2311.05908 • Published Nov 10, 2023 • 16
Continual Learning for Monolingual End-to-End Automatic Speech Recognition Paper • 2112.09427 • Published Dec 17, 2021 • 1
Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model Paper • 2208.08340 • Published Aug 17, 2022 • 1
MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification Paper • 2309.09276 • Published Sep 17, 2023 • 1
Approximated Prompt Tuning for Vision-Language Pre-trained Models Paper • 2306.15706 • Published Jun 27, 2023 • 1
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks Paper • 2311.07463 • Published Nov 13, 2023 • 15
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module Paper • 2311.05556 • Published Nov 9, 2023 • 87
Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach Paper • 2310.12004 • Published Oct 18, 2023 • 2
From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation Paper • 2304.08953 • Published Apr 18, 2023 • 2
Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition Paper • 2209.15176 • Published Sep 30, 2022 • 1
Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation Paper • 2309.08876 • Published Sep 16, 2023 • 1
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Paper • 2312.04410 • Published Dec 7, 2023 • 15
Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system Paper • 2211.01571 • Published Nov 3, 2022 • 1
E-Branchformer: Branchformer with Enhanced merging for speech recognition Paper • 2210.00077 • Published Sep 30, 2022 • 2
Interpret Vision Transformers as ConvNets with Dynamic Convolutions Paper • 2309.10713 • Published Sep 19, 2023 • 1
EfficientFormer: Vision Transformers at MobileNet Speed Paper • 2206.01191 • Published Jun 2, 2022 • 1
COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models Paper • 2305.17235 • Published May 26, 2023 • 2
eP-ALM: Efficient Perceptual Augmentation of Language Models Paper • 2303.11403 • Published Mar 20, 2023 • 3
OneLLM: One Framework to Align All Modalities with Language Paper • 2312.03700 • Published Dec 6, 2023 • 24
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action Paper • 2312.17172 • Published Dec 28, 2023 • 29
Augmenting text for spoken language understanding with Large Language Models Paper • 2309.09390 • Published Sep 17, 2023 • 3
Audiobox: Unified Audio Generation with Natural Language Prompts Paper • 2312.15821 • Published Dec 25, 2023 • 17
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation Paper • 2312.14385 • Published Dec 22, 2023 • 7
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training Paper • 2401.00849 • Published Jan 1, 2024 • 17
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding Paper • 2311.15075 • Published Nov 25, 2023 • 1
MLLMs-Augmented Visual-Language Representation Learning Paper • 2311.18765 • Published Nov 30, 2023 • 1
InfMLLM: A Unified Framework for Visual-Language Tasks Paper • 2311.06791 • Published Nov 12, 2023 • 3
Generative Multimodal Models are In-Context Learners Paper • 2312.13286 • Published Dec 20, 2023 • 37
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning Paper • 2311.18799 • Published Nov 30, 2023 • 1
SwitchGPT: Adapting Large Language Models for Non-Text Outputs Paper • 2309.07623 • Published Sep 14, 2023 • 1
DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 189
Diffusion Model Alignment Using Direct Preference Optimization Paper • 2311.12908 • Published Nov 21, 2023 • 50
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis Paper • 2312.03491 • Published Dec 6, 2023 • 35
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models Paper • 2311.07919 • Published Nov 14, 2023 • 10
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models Paper • 2311.07575 • Published Nov 13, 2023 • 15
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data Paper • 2311.06753 • Published Nov 12, 2023 • 8
LayoutPrompter: Awaken the Design Ability of Large Language Models Paper • 2311.06495 • Published Nov 11, 2023 • 12
Honeybee: Locality-enhanced Projector for Multimodal LLM Paper • 2312.06742 • Published Dec 11, 2023 • 14
SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels Paper • 2309.08513 • Published Sep 15, 2023 • 1
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models Paper • 2305.05189 • Published May 9, 2023 • 2
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing Paper • 2304.02051 • Published Apr 4, 2023 • 4
DITTO: Diffusion Inference-Time T-Optimization for Music Generation Paper • 2401.12179 • Published Jan 22, 2024 • 22
StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion Paper • 2401.11053 • Published Jan 19, 2024 • 11
FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder Paper • 2401.10032 • Published Jan 18, 2024 • 13
BAE-Net: A Low complexity and high fidelity Bandwidth-Adaptive neural network for speech super-resolution Paper • 2312.13722 • Published Dec 21, 2023 • 1
Incremental FastPitch: Chunk-based High Quality Text to Speech Paper • 2401.01755 • Published Jan 3, 2024 • 10
CoMoSVC: Consistency Model-based Singing Voice Conversion Paper • 2401.01792 • Published Jan 3, 2024 • 11
Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction Paper • 2401.06387 • Published Jan 12, 2024 • 1
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder Paper • 2311.14957 • Published Nov 25, 2023 • 2
Factorization Vision Transformer: Modeling Long Range Dependency with Local Window Cost Paper • 2312.08614 • Published Dec 14, 2023 • 1
MM-LLMs: Recent Advances in MultiModal Large Language Models Paper • 2401.13601 • Published Jan 24, 2024 • 49
ModaVerse: Efficiently Transforming Modalities with LLMs Paper • 2401.06395 • Published Jan 12, 2024 • 3
Video Understanding with Large Language Models: A Survey Paper • 2312.17432 • Published Dec 29, 2023 • 3
Boosting Large Language Model for Speech Synthesis: An Empirical Study Paper • 2401.00246 • Published Dec 30, 2023 • 14
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs Paper • 2311.15759 • Published Nov 27, 2023 • 1
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models Paper • 2401.13311 • Published Jan 24, 2024 • 12
Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models Paper • 2309.01479 • Published Sep 4, 2023 • 1
Parameter-Efficient Conformers via Sharing Sparsely-Gated Experts for End-to-End Speech Recognition Paper • 2209.08326 • Published Sep 17, 2022 • 1
Mixture-of-experts VAEs can disregard variation in surjective multimodal data Paper • 2204.05229 • Published Apr 11, 2022 • 1
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code Paper • 2205.06126 • Published May 12, 2022 • 1
simple diffusion: End-to-end diffusion for high resolution images Paper • 2301.11093 • Published Jan 26, 2023 • 2
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model Paper • 2401.09417 • Published Jan 17, 2024 • 62
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation Paper • 2401.13560 • Published Jan 24, 2024 • 1
Vivim: a Video Vision Mamba for Medical Video Object Segmentation Paper • 2401.14168 • Published Jan 25, 2024 • 2
2-D SSM: A General Spatial Layer for Visual Transformers Paper • 2306.06635 • Published Jun 11, 2023 • 1
IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers Paper • 2304.14400 • Published Apr 27, 2023 • 4
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit Paper • 2312.09911 • Published Dec 15, 2023 • 55
StarVector: Generating Scalable Vector Graphics Code from Images Paper • 2312.11556 • Published Dec 17, 2023 • 36
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer Paper • 2401.16658 • Published Jan 30, 2024 • 14
GENOME: GenerativE Neuro-symbOlic visual reasoning by growing and reusing ModulEs Paper • 2311.04901 • Published Nov 8, 2023 • 11
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Paper • 2311.04257 • Published Nov 7, 2023 • 22
u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model Paper • 2311.05348 • Published Nov 9, 2023 • 15
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Paper • 2311.05437 • Published Nov 9, 2023 • 51
Empowering LLM to use Smartphone for Intelligent Task Automation Paper • 2308.15272 • Published Aug 29, 2023 • 1
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts Paper • 2206.02770 • Published Jun 6, 2022 • 3
FLatten Transformer: Vision Transformer using Focused Linear Attention Paper • 2308.00442 • Published Aug 1, 2023 • 1
PALO: A Polyglot Large Multimodal Model for 5B People Paper • 2402.14818 • Published Feb 22, 2024 • 25
SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems Paper • 2401.03945 • Published Jan 8, 2024
TinyLLaVA: A Framework of Small-scale Large Multimodal Models Paper • 2402.14289 • Published Feb 22, 2024 • 21
MultiModN- Multimodal, Multi-Task, Interpretable Modular Networks Paper • 2309.14118 • Published Sep 25, 2023
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models Paper • 2402.19481 • Published Feb 29, 2024 • 23
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20, 2024 • 15
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models Paper • 2402.13577 • Published Feb 21, 2024 • 10
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters Paper • 2403.02677 • Published Mar 5, 2024 • 18
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models Paper • 2403.00231 • Published Mar 1, 2024 • 1
MoAI: Mixture of All Intelligence for Large Language and Vision Models Paper • 2403.07508 • Published Mar 12, 2024 • 77
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models Paper • 2403.13447 • Published Mar 20, 2024 • 19
MambaIR: A Simple Baseline for Image Restoration with State-Space Model Paper • 2402.15648 • Published Feb 23, 2024
FiT: Flexible Vision Transformer for Diffusion Model Paper • 2402.12376 • Published Feb 19, 2024 • 49
SSM Meets Video Diffusion Models: Efficient Video Generation with Structured State Spaces Paper • 2403.07711 • Published Mar 12, 2024
LocalMamba: Visual State Space Model with Windowed Selective Scan Paper • 2403.09338 • Published Mar 14, 2024 • 9
VideoMamba: State Space Model for Efficient Video Understanding Paper • 2403.06977 • Published Mar 11, 2024 • 31
An Embarrassingly Simple Approach for LLM with Strong ASR Capacity Paper • 2402.08846 • Published Feb 13, 2024 • 1
Transparent Image Layer Diffusion using Latent Transparency Paper • 2402.17113 • Published Feb 27, 2024 • 5
On Speculative Decoding for Multimodal Large Language Models Paper • 2404.08856 • Published Apr 13, 2024 • 13
Good Seed Makes a Good Crop: Discovering Secret Seeds in Text-to-Image Diffusion Models Paper • 2405.14828 • Published May 23, 2024
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Paper • 2311.08046 • Published Nov 14, 2023 • 2
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models Paper • 2405.10311 • Published May 16, 2024
Speak While You Think: Streaming Speech Synthesis During Text Generation Paper • 2309.11210 • Published Sep 20, 2023
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
GrootVL: Tree Topology is All You Need in State Space Model Paper • 2406.02395 • Published Jun 4, 2024 • 1
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 102