---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards


pipeline_tag: image-text-to-text
tags:
  - vision-language
  - multimodal
  - research
---

**CAT** is a minimal and lightweight Vision-Language Model (VLM) designed for efficient training and experimentation. Built using pure PyTorch, the entire model architecture and training logic fits within ~750 lines of code. It combines a ViT-based image encoder (SigLIP-B/16-224-85M) with a lightweight causal language model (SmolLM2-135M), resulting in a compact 222M parameter model.


**Usage:**

Follow the install instructions and run the following code:

```python
from models.vision_language_model import CAT

model = CAT.from_pretrained("mahwizzzz/CAT")
```