microsoft/llava-rad · Questions regarding training and Heatmap process

Dear LlaVA-Rad Team,

My name is Sreyes Venkatesh and I am currently a Radiology Research Fellow at University Hospital. I've been working with LlaVA-Rad in order to test its ability to be deployed in a hospital setting , and I had some questions I was hoping you could answer:

In a clinical setting, a radiologist would compare the scan of interest to a prior scan of the patient (when a prior is available). Although MIMIC CXR has mentions of comparisons in its reports along with frontal and lateral views, I noticed that during testing only single frontal views were used for evaluation. Would I be correct in assuming that LLaVa-Rad is not explicitly designed to generate a report using the prior-scan of that patient as relevant information? i.e. LLaVA-Rad can only process a single image at a time. Was there any specific reason for this design choice?
I have been unable to find any code on the git repo regarding how to generate the attention maps that were presented in the paper. Has the code for visualizing LLaVA attention maps been released? If not, I would greatly appreciate any details on how you generated the attention maps.

LlaVa-Rad is a very exciting model, and I would love to be able to bridge the gap between an academic setting and real-world deployment. Any advice would be greatly appreciated. Thank you very much for your time.

Sincerely,
Sreyes Venkatesh