Update README.md

bc94e29 verified 16 days ago

4.11 kB

	---
	tags:
	- fp8
	- fp8-dynamic
	- vllm
	- llm-compressor
	- internvl3.5
	- internvl
	language:
	- multilingual
	pipeline_tag: image-text-to-text
	inference: false
	license: mit
	base_model: OpenGVLab/InternVL3_5-38B
	base_model_relation: quantized
	library_name: vllm
	---

	# InternVL3.5 38B FP8

	This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38B`optimized for high-performance inference with vLLM.

	The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 40%.

	## Just Run It (vLLM serve)

	You can serve the model using vLLM's OpenAI-compatible API server.

	```bash
	vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
	--quantization compressed-tensors \
	--served-model-name internvl3_5-38b \
	--reasoning-parser qwen3 \
	--trust-remote-code \
	--max-model-len 32768 \
	--tensor-parallel-size 1 # Adjust based on your GPU setup
	```
	Notes
	- 32k max context length
	- reasoning parser ready to go, requires system prompt to run in thinking mode
	- still investigating tool calling


	## Key Features

	* Calibration-Free FP8: Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
	* Vision-Language Optimized: The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
	* vLLM Ready: Designed for seamless integration with vLLM for high-throughput serving.
	* Memory Efficient: ~40% memory reduction compared to the original FP16 model.
	* Performance Boost: Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).

	## Model Details

	\| Attribute \| Value \|
	\| :--- \| :--- \|
	\| Original Model \| [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) \|
	\| Quantized Model \| `brandonbeiler/InternVL3_5-38B-FP8-Dynamic` \|
	\| Quantization Method \| FP8 Dynamic (W8A8) \|
	\| Quantization Library \| [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 \|
	\| Quantized By \| [brandonbeiler](https://huggingface.co/brandonbeiler) \|


	## Usage with vLLM in Python

	The following snippet demonstrates inference using the vLLM library.

	```python
	from vllm import LLM, SamplingParams

	# Load the quantized model
	# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
	model = LLM(
	model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
	trust_remote_code=True,
	max_model_len=32768, # InternVL 3.5 supports a 32k context length. [19, 41]
	tensor_parallel_size=1, # Adjust for your hardware setup. [11, 15, 38, 40]
	)

	# Set sampling parameters
	# A temperature of 0.6 is recommended for this model. [39]
	sampling_params = SamplingParams(temperature=0.6, max_tokens=512)

	# Generate a response
	# Note: Replace "<image>" with your image input
	prompt = "Describe this image: <image>"
	response = model.generate(prompt, sampling_params)

	print(response[0].outputs[0].text)
	```



	## Technical Specifications

	### Hardware Requirements

	* Base VRAM: ~47GB (for model weights)
	* Context VRAM:
	* \+ ~1.3GB for 10k token context
	* \+ ~2GB for 32k token context with FP8 KV cache
	* Recommended GPUs: NVIDIA H100, L40S
	* Supported GPUs: NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
	* Optimal Performance: NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).

	### Quantization Details

	* Weights: FP8 E4M3 with per-tensor scales.
	* Activations: Dynamically quantized to FP8 E4M3 with per-tensor scales.
	* Preserved Modules (Full Precision): Vision tower, embeddings, and the first MLP layer (mlp1).

	## Package Versions

	This model was quantized using the following environment:

	```
	llmcompressor==0.7.1
	compressed-tensors==0.10.2
	transformers==4.55.0
	torch==2.7.1
	vllm==0.10.1.1
	```

	Quantized with ❤️ using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for the open-source community.