Prompt as Knowledge Bank: Boost Vision-language model via Structural Representation for zero-shot medical detection Paper • 2502.16223 • Published Feb 22
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers Paper • 2502.14377 • Published Feb 20 • 12
Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities Paper • 2309.00952 • Published Sep 2, 2023
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance Paper • 2408.08189 • Published Aug 15, 2024 • 17
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task Paper • 2409.04005 • Published Sep 6, 2024 • 19
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task Paper • 2409.04005 • Published Sep 6, 2024 • 19
view reply Would you share the total training cost info? as traing of IDEFICS2-8B used "approximately 1.5 billion images and 225 billion text tokens" which is quite huge for a 8B sized LMM model