# Robin-LLaVA
**A LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images**
This repo and the associated models were created in collaboration between the [Robin team at AGI-Collective](#team) and [Simon Ramstedt](https://simonramstedt.com/), with computing resources from [Hessian-AI](https://hessian.ai/) and [OLCF](https://www.olcf.ornl.gov/).
The Robin Effort: We study different components and methods to merge pretrained vision and language models with the goal to build better visual understanding models. As part of this first milestone we study the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), further improving capabilities by finetuning the vision encoder.
## Installation
Ideally install into an empty venv (`python -m venv venv && source venv/bin/activate`)
```bash
pip install git+ssh://git@github.com/agi-collective/robin-llava.git
```
Note that currently Robin-LlaVA inference requires a Nvidia GPU with 24GB+ since CPU inference and quantized inference don't work reliably yet. Training was done on a machine with 8 x A100 80GB Nvidia GPUs provided by [Hessian-AI](https://hessian.ai/).
## Run interactive command line interface
```bash
python -m llava.serve.cli \
--model-path agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora \
--model-base teknium/OpenHermes-2.5-Mistral-7B \
--image-file https://llava-vl.github.io/static/images/view.jpg
```
## Use as library
```python
from llava.serve.pipeline import LlavaMistralPipeline
pipe = LlavaMistralPipeline(
model_path="agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora",
model_base="teknium/OpenHermes-2.5-Mistral-7B",
)
messages = [
{"role": "USER", "content": "What's in the image?", "image": "https://llava-vl.github.io/static/images/view.jpg"},
]
messages = pipe(messages)
# returns original messages list plus the new response, i.e.:
# {"role": "ASSISTANT", "content": ...}
```
## Available models
We use the following components:
- **Base LLM**: We explore using Vicuna, Mistral and OpenHermes-2.5
- **Base Vision Model**: We use the SigLIP model since it gives stronger performance on vision benchmarks compared to CLIP
- We finetune the Vision Encoder hoping the next token prediction loss helps further improves the vision capabilities of the pretrained vision encoder
| Model | Base | GQA | SQA Text | SQA Image |
| ------------------------------------------------------------------ | --------------------------------- | ----- | -------- | --------- |
| liuhaotian/llava-v1.5-7b | lmsys/vicuna-7b-v1.5 | 62 | 70.43 | 66.8 |
| liuhaotian/llava-v1.5-13b | lmsys/vicuna-7b-v1.5 | 63.3 | | 71.6 |
| [agi-collective/vicuna-7b-clip-finetune-lora](https://huggingface.co/agi-collective/vicuna-7b-clip-finetune-lora) | lmsys/vicuna-7b-v1.5 | **62.04** | 70.86 | 68.72 |
| [agi-collective/vicuna-7b-siglip-so400m-finetune-lora](https://huggingface.co/agi-collective/vicuna-7b-siglip-so400m-finetune-lora) | lmsys/vicuna-7b-v1.5 | 56.79 | 68.76 | 67.48 |
| [agi-collective/mistral-7b-siglip-so400m-finetune-lora](https://huggingface.co/agi-collective/mistral-7b-siglip-so400m-finetune-lora) | mistralai/Mistral-7B-v0.1 | 49.44 | 73.66 | 68.57 |
| [agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora](https://huggingface.co/agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora) | teknium/OpenHermes-2.5-Mistral-7B | 53.59 | 78.17 | 72.73 |
| [agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora](https://huggingface.co/agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora) | teknium/OpenHermes-2.5-Mistral-7B | 54.48 | **79.56** | **74.22** |
(best 7B model results **highlighted**)
## Evaluations
Evaluations were done using the a subset of the evaluation suite in https://github.com/haotian-liu/llava. More evaluations including both vision language and language only benchmarks coming soon!
## Training
For training details see the `pretrain.sh` and `finetune_lora.sh` scripts inside `scripts/*/`. The training code is for information purposes only and we won't be able help people debug problems with it for this preliminary release. More rigorous and less rushed and messy training scripts will be released in the future.
## Examples
**User:** What's in this image?
**Llava7b:**
The image features a pier extending out over a large body of water, likely a lake. The pier is made of wood and has a bench at the end, providing a place for people to sit and enjoy the view. The water appears calm and serene, making it an ideal spot for relaxation.
**HermesSiglip (Ours):**
The image features a wooden pier extending out into a large body of water. The pier is surrounded by a serene and calm lake, providing a peaceful atmosphere. The pier appears to be empty, with no people or boats visible in the scene. The water is calm and still, creating a tranquil environment for those who visit the pier.
---
**User:** Will these magnets attract or repel each other?
**Llava7b:**
The magnets in the image are of different colors and placed next to each other. It is not possible to determine their polarity from the image alone, so it is not possible to definitively say whether they will attract or repel each other. The behavior of magnets depends on their polarity and the orientation of their magnetic fields. To determine their interaction, one would need to observe the magnets in a more controlled environment, such as a magnetic field tester or a magnetic compass.
**HermesSiglip (Ours):**
In the image, there are two magnets placed on a gray background, one with a green stripe and the other with a blue stripe. Since the magnets have opposite colors, they will attract each other due to their magnetic properties. The magnets are aligned in a way that their magnetic poles are facing each other, which causes them to be attracted to one another. This phenomenon is known as the magnetic attraction between two oppositely charged magnets.
## Limitations
For the Open-Hermes-2.5-based model we switched away from the ChatML chat template and back to the Llava default Vicuna chat template. The benchmarks and example results below show that this worked fine but we do believe that we can achieve even better performance and usability by retraining the model with ChatML.
## AGI-Collective - Robin Team
Alexis Roger, Andrew R Williams, Daniel Kaplan, Edwin Fennell, George Adamopoulos, Kshitij Gupta, Prateek Humane, Quentin Anthony, Rishika Bhagwatkar, Sun Qi, Yuchen Lu, Irina Rish
## Acknowledgements
We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was in collaboration with representatives from EleutherAI. The code in this repo is based on [github.com/haotian-liu/LLaVA](https://github.com/haotian-liu/LLaVA).