|
--- |
|
license: apache-2.0 |
|
language: |
|
- de |
|
- en |
|
base_model: |
|
- allenai/Molmo-7B-D-0924 |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
# ELAM-7B |
|
|
|
ELAM (Evaluative Large Action Model) is a Molmo 7B-D-based LAM (Large Action Model) that is also able to evaluate user expectations on screenshots of user interfaces. |
|
It was specifically fine-tuned on 17,708 instructions and evaluations for 6,230 automotive UI images. The images contained German and English text. |
|
All training prompts were English. German content was either translated or quoted directly. |
|
|
|
The evaluation dataset [AutomotiveUI-Bench-4K](https://huggingface.co/datasets/sparks-solutions/AutomotiveUI-Bench-4K) is available on Hugging Face. |
|
|
|
# Results |
|
## AutomotiveUI-Bench-4K |
|
| Model | Test Action Grounding | Expected Result Grounding | Expected Result Evaluation | |
|
|---|---|---|---| |
|
| InternVL2.5-8B | 26.6 | 5.7 | 64.8 | |
|
| TinyClick | 61.0 | 54.6 | - | |
|
| UGround-V1-7B (Qwen2-VL) | 69.4 | 55.0 | - | |
|
| Molmo-7B-D-0924 | 71.3 | 71.4 | 66.9 | |
|
| LAM-270M (TinyClick) | 73.9 | 59.9 | - | |
|
| ELAM-7B (Molmo) | **87.6** | **77.5** | **78.2** | |
|
|
|
# Quick-Start |
|
``` |
|
conda create -n elam python=3.10 -y |
|
conda activate elam |
|
pip install datasets==3.5.0 einops==0.8.1 torchvision==0.20.1 accelerate==1.6.0 |
|
pip install transformers==4.48.2 |
|
``` |
|
|
|
```python |
|
import re |
|
|
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig |
|
|
|
# Load processor |
|
model_name = "sparks-solutions/ELAM-7B" |
|
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto") |
|
|
|
# Load model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, trust_remote_code=True, torch_dtype="bfloat16", device_map="auto" |
|
) |
|
|
|
|
|
def preprocess_elam_prompt(user_request: str, label_class: str): |
|
"""Apply ELAM prompt template depending on class.""" |
|
if label_class == "Expected Result": |
|
return f"Evaluate this statement about the image:\n'{user_request}'\nThink step by step, conclude whether the evaluation is 'PASSED' or 'FAILED' and point to the UI element that corresponds to this evaluation." |
|
elif label_class == "Test Action": |
|
return f"Identify and point to the UI element that corresponds to this test action:\n{user_request}" |
|
|
|
|
|
def postprocess_response_elam(response: str): |
|
"""Parse Molmo-style point coordinates from string and return tuple of floats in [0-1].""" |
|
pattern = r'<point x="(?P<x>\d+\.\d+)" y="(?P<y>\d+\.\d+)"' |
|
match = re.search(pattern, response) |
|
if match: |
|
x_coord_raw = float(match.group("x")) |
|
y_coord_raw = float(match.group("y")) |
|
x_coord = x_coord_raw / 100 |
|
y_coord = y_coord_raw / 100 |
|
return [x_coord, y_coord] |
|
else: |
|
return [-1, -1] |
|
|
|
``` |
|
|
|
Two prompt types were fine-tuned for UI testing: |
|
|
|
1. *Test Action*: These prompts take an instruction (e.g., "tap music not in bottom navigation bar") and return the corresponding tap coordinates. |
|
2. *Expected Results*: These prompts take an expectation (e.g., "notification toggle switch is disabled") and return "PASSED" or "FAILED" along with coordinates of the relevant UI element. |
|
|
|
```python |
|
|
|
image_path = "path/to/your/ui/image" |
|
user_request = "Tap home button" # or "The home icon is white" |
|
request_type = "Test Action" # or "Expected Result" |
|
|
|
|
|
image = Image.open(image_path) |
|
|
|
elam_prompt = preprocess_elam_prompt(user_request, request_type) |
|
|
|
inputs = processor.process( |
|
images=[image], |
|
text=elam_prompt, |
|
) |
|
|
|
# Move inputs to the correct device and make a batch of size 1, cast to bfloat16 |
|
inputs_bfloat16 = {} |
|
for k, v in inputs.items(): |
|
if v.dtype == torch.float32: |
|
inputs_bfloat16[k] = v.to(model.device).to(torch.bfloat16).unsqueeze(0) |
|
else: |
|
inputs_bfloat16[k] = v.to(model.device).unsqueeze(0) |
|
|
|
inputs = inputs_bfloat16 # Replace original inputs with the correctly typed inputs |
|
|
|
# Generate output |
|
output = model.generate_from_batch( |
|
inputs, GenerationConfig(max_new_tokens=2048, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer |
|
) |
|
|
|
# Only get generated tokens; decode them to text |
|
generated_tokens = output[0, inputs["input_ids"].size(1) :] |
|
response = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
coordinates = postprocess_response_elam(response) |
|
|
|
# Print outputs |
|
print(f"ELAM response: {response}") |
|
print(f"Got coordinates: {coordinates}") |
|
``` |
|
|
|
# Citation |
|
If you find ELAM useful in your research, please cite the following paper: |
|
|
|
``` latex |
|
@misc{ernhofer2025leveragingvisionlanguagemodelsvisual, |
|
title={Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI}, |
|
author={Benjamin Raphael Ernhofer and Daniil Prokhorov and Jannica Langner and Dominik Bollmann}, |
|
year={2025}, |
|
eprint={2505.05895}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2505.05895}, |
|
} |
|
``` |
|
|
|
# Acknowledgements |
|
## Funding |
|
This work was supported by German BMBF within the scope of project "KI4BoardNet". |
|
|
|
|
|
## Models and Code |
|
- ELAM is based on [Molmo](https://github.com/allenai/molmo) by Allen Institute for AI. |
|
- Training was conducted using [ms-swift](https://github.com/modelscope/ms-swift) by ModelScope. |
|
|