This repository contains the model (1B version) presented in the paper UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing.

UniLIP proposes a unified, CLIP-based encoder featuring both rich semantics and fine-grained image details. Through a two-stage and self-distillation training for reconstruction, we empower CLIP to achieve excellent reconstruction results without compromising its original understanding abilities. Leveraging this powerful unified representation, UniLIP excels across understanding, generation, and editing tasks.

For more details, please refer to the original paper and the GitHub repository:

Paper: https://www.arxiv.org/abs/2507.23278

GitHub: https://github.com/nnnth/UniLIP