gemma-3-4b-it-qat-int4_asym-ov / README.md

Update README.md

bfcbe0b verified 2 months ago

4.6 kB

	---
	license: apache-2.0
	base_model:
	- google/gemma-3-4b-it-qat-int4-unquantized
	tags:
	- OpenArc
	- OpenVINO
	- Optimum-Intel
	- image-text-to-text
	---

	## Gemma 3 for OpenArc has landed!

	My Project [OpenArc](https://github.com/SearchSavior/OpenArc), an inference engine for OpenVINO, now supports this model and serves inference over OpenAI compatible endpoints for text to text and text with vision! That release comes out today or tomorrow.

	We have a growing Discord community of others interested in using Intel for AI/ML.

	[![Discord](https://img.shields.io/discord/1341627368581628004?logo=Discord&logoColor=%23ffffff&label=Discord&link=https%3A%2F%2Fdiscord.gg%2FmaMY7QjG)](https://discord.gg/maMY7QjG)


	This model was converted to the OpenVINO IR format using the following Optimum-CLI command:

	```
	optimum-cli export openvino -m ""input-model"" --task image-text-to-text --weight-format int8 ""converted-model""
	```
	- Find documentation on the Optimum-CLI export process [here](https://huggingface.co/docs/optimum/main/en/intel/openvino/export)
	- Use my HF space [Echo9Zulu/Optimum-CLI-Tool_tool](https://huggingface.co/spaces/Echo9Zulu/Optimum-CLI-Tool_tool) to build commands and execute locally

	### What does the test code do?

	Well, it demonstrates how to inference in python and what parts of that code are important for benchmarking performance.
	Text generation offers different challenges than text-generation with images; for examples, vision encoders often use different strategies for handling properties an image can have.
	In practice this translates to higher memory usage, reduced throughput or bad results.

	To run the test code:

	- Install device specific drivers
	- Build Optimum-Intel for OpenVINO from source
	- Find your spiciest images to get that AGI refusal smell

	```
	pip install optimum[openvino]+https://github.com/huggingface/optimum-intel
	```

	```
	import time
	from PIL import Image
	from transformers import AutoProcessor
	from optimum.intel.openvino import OVModelForVisualCausalLM


	model_id = "Echo9Zulu/gemma-3-4b-it-int8_asym-ov" # Can be an HF id or a path

	ov_config = {"PERFORMANCE_HINT": "LATENCY"} # Optimizes for first token latency and locks to single CPU socket

	print("Loading model... this should get faster after the first generation due to caching behavior.")
	print("")
	start_load_time = time.time()
	model = OVModelForVisualCausalLM.from_pretrained(model_id, export=False, device="CPU", ov_config=ov_config) # For GPU use "GPU.0"
	processor = AutoProcessor.from_pretrained(model_id) # Instead of using AutoTokenizers we use AutoProcessor which routes to the appropriate input processor i.e, how does a model expect image tokens.
	# Under the hood this takes care of model specific preprocessing and has functionality overlap with AutoTokenizers.
	end_load_time = time.time()

	image_path = r"" # This script expects .png
	image = Image.open(image_path)
	image = image.convert("RGB") # Required by gemma3. In practice this would need to be handled at the engine level OR in model-specifc pre-processing.

	conversation = [
	{
	"role": "user",
	"content": [
	{
	"type": "image"
	},
	{"type": "text", "text": "Describe this image."},
	],
	}
	]

	text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

	inputs = processor(text=[text_prompt], images=[image], padding=True, return_tensors="pt")

	input_token_count = len(inputs.input_ids[0])
	print(f"Sum of image and text tokens: {len(inputs.input_ids[0])}")

	start_time = time.time()
	output_ids = model.generate(**inputs, max_new_tokens=1024)

	generated_ids = [output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
	output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

	num_tokens_generated = len(generated_ids[0])
	load_time = end_load_time - start_load_time
	generation_time = time.time() - start_time
	tokens_per_second = num_tokens_generated / generation_time
	average_token_latency = generation_time / num_tokens_generated

	print("\nPerformance Report:")
	print("-"*50)
	print(f"Input Tokens : {input_token_count:>9}")
	print(f"Generated Tokens : {num_tokens_generated:>9}")
	print(f"Model Load Time : {load_time:>9.2f} sec")
	print(f"Generation Time : {generation_time:>9.2f} sec")
	print(f"Throughput : {tokens_per_second:>9.2f} t/s")
	print(f"Avg Latency/Token : {average_token_latency:>9.3f} sec")

	print(output_text)
	```