--- license: mit base_model: - fancyfeast/llama-joycaption-alpha-two-hf-llava language: - en pipeline_tag: image-text-to-text tags: - captioning - llava --- # Llama Joycaption Alpha Two hf Llava FP8 Dynamic An FP8 compression of the [Llama JoyCaption Alpha Two model made by fancyfeast](https://huggingface.co/fancyfeast/llama-joycaption-alpha-two-hf-llava "Llama JoyCaption Alpha Two model made by fancyfeast"), using [llm-compressor](https://github.com/vllm-project/llm-compressor "llm-compressor") and compatible with [vllm](https://github.com/vllm-project/vllm "vllm"). The model has been tested personally, unfortunately without the proper method, but it works well with my usage pattern. All credit to **fancyfeast** and on the [official model page](https://huggingface.co/fancyfeast/llama-joycaption-alpha-two-hf-llava "official model page") you can see more details. ## How to Get Started with the Model: The same as the Llama JoyCaption Alpha Two: You need the [compressed-tensors](https://github.com/neuralmagic/compressed-tensors "compressed-tensors") lib to run the code below with FP8. import torch from PIL import Image from transformers import AutoProcessor, LlavaForConditionalGeneration IMAGE_PATH = "image.jpg" PROMPT = "Write a long descriptive caption for this image in a formal tone." MODEL_NAME = "JKCHSTR/llama-joycaption-alpha-two-hf-llava-FP8-Dynamic" # Load JoyCaption # bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1) # device_map=0 loads the model into the first GPU processor = AutoProcessor.from_pretrained(MODEL_NAME) llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0) llava_model.eval() with torch.no_grad(): # Load image image = Image.open(IMAGE_PATH) # Build the conversation convo = [ { "role": "system", "content": "You are a helpful image captioner.", }, { "role": "user", "content": PROMPT, }, ] # Format the conversation # WARNING: HF's handling of chat's on Llava models is very fragile. This specific combination of processor.apply_chat_template(), and processor() works # but if using other combinations always inspect the final input_ids to ensure they are correct. Often times you will end up with multiple tokens # if not careful, which can make the model perform poorly. convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True) assert isinstance(convo_string, str) # Process the inputs inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda') inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16) # Generate the captions generate_ids = llava_model.generate( **inputs, max_new_tokens=300, do_sample=True, suppress_tokens=None, use_cache=True, temperature=0.6, top_k=None, top_p=0.9, )[0] # Trim off the prompt generate_ids = generate_ids[inputs['input_ids'].shape[1]:] # Decode the caption caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False) caption = caption.strip() print(caption)