fancyfeast/llama-joycaption-alpha-two-hf-llava · Generate caption based on vision with given tags as assistance

I wanted to ask if its possible to finetune this model to also accept tags that are given to it as guidance. The model can mostly generate good captions based on what it "sees", but for me it often misses some things or describes something that isnt even in the image. I have accurate tags for my images, however I want a caption for flux lora training instead of tags.

Is this something that would be doable? I know how to finetune this model but Idk if it would even work if you get what I'm saying