---
license: mit
base_model:
- fancyfeast/llama-joycaption-alpha-two-hf-llava
language:
- en
pipeline_tag: image-text-to-text
tags:
- captioning
- llava
---
# Llama Joycaption Alpha Two hf Llava FP8 Dynamic

An FP8 compression of the [Llama JoyCaption Alpha Two model made by fancyfeast](https://huggingface.co/fancyfeast/llama-joycaption-alpha-two-hf-llava "Llama JoyCaption Alpha Two model made by fancyfeast"), using [llm-compressor](https://github.com/vllm-project/llm-compressor "llm-compressor") and compatible with [vllm](https://github.com/vllm-project/vllm "vllm").

The model has been tested personally, unfortunately without the proper method, but it works well with my usage pattern.

All credit to **fancyfeast** and on the [official model page](https://huggingface.co/fancyfeast/llama-joycaption-alpha-two-hf-llava "official model page") you can see more details.

## How to Get Started with the Model:

The same as the Llama JoyCaption Alpha Two: 

You need the [compressed-tensors](https://github.com/neuralmagic/compressed-tensors "compressed-tensors") lib to run the code below with FP8.

    import torch
    from PIL import Image
    from transformers import AutoProcessor, LlavaForConditionalGeneration
    
    
    IMAGE_PATH = "image.jpg"
    PROMPT = "Write a long descriptive caption for this image in a formal tone."
    MODEL_NAME = "JKCHSTR/llama-joycaption-alpha-two-hf-llava-FP8-Dynamic"
    
    
    # Load JoyCaption
    # bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
    # device_map=0 loads the model into the first GPU
    processor = AutoProcessor.from_pretrained(MODEL_NAME)
    llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
    llava_model.eval()
    
    with torch.no_grad():
        # Load image
        image = Image.open(IMAGE_PATH)
    
        # Build the conversation
        convo = [
            {
                "role": "system",
                "content": "You are a helpful image captioner.",
            },
            {
                "role": "user",
                "content": PROMPT,
            },
        ]
    
        # Format the conversation
        # WARNING: HF's handling of chat's on Llava models is very fragile.  This specific combination of processor.apply_chat_template(), and processor() works
        # but if using other combinations always inspect the final input_ids to ensure they are correct.  Often times you will end up with multiple <bos> tokens
        # if not careful, which can make the model perform poorly.
        convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
        assert isinstance(convo_string, str)
    
        # Process the inputs
        inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
        inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
    
        # Generate the captions
        generate_ids = llava_model.generate(
            **inputs,
            max_new_tokens=300,
            do_sample=True,
            suppress_tokens=None,
            use_cache=True,
            temperature=0.6,
            top_k=None,
            top_p=0.9,
        )[0]
    
        # Trim off the prompt
        generate_ids = generate_ids[inputs['input_ids'].shape[1]:]
    
        # Decode the caption
        caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
        caption = caption.strip()
        print(caption)