Llama-2-13b-deepspeed-visualchat
ATTENTION: this encoder needs QwenCLIP model
DeepSpeed-VisualChat is a scalable, efficient, and user-friendly multi-modal training pipeline that leverages a novel multi-modal causal attention mechanism for better alignment of visual and text features. It uses data blending techniques to address the scarcity of interleaved text-and-image inputs in datasets.
The framework trains using a 2B visual encoder from QWen-VL and a 13B-70B language decoder from LLaMA-2, showcasing its extraordinary scalability. DeepSpeed-VisualChat is now open-sourced and encourages community contributions and collaborations. Visit the GitHub page to get started.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.