README.md · brandonbeiler/InternVL3-8B-FP8-Dynamic at main

InternVL3-8B-FP8-Dynamic / README.md

brandonbeiler

Update README.md

ac504d0 verified 3 months ago

preview code

raw

history blame contribute delete

2.68 kB

	---
	language:
	- en
	- zh
	tags:
	- fp8
	- quantization
	- dynamic
	- vision-language
	- multimodal
	- vLLM
	- llm-compressor
	- internvl3
	pipeline_tag: image-text-to-text
	inference: false
	license: mit
	base_model:
	- OpenGVLab/InternVL3-8B
	---

	# 🔥 InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model 🔥
	This is a FP8 dynamic quantized version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM.
	The model utilizes dynamic FP8 quantization for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks.

	## 🔧 Usage
	### With vLLM (Recommended)
	```python
	from vllm import LLM, SamplingParams

	# Load the quantized model
	model = LLM(
	model="brandonbeiler/InternVL3-8B-FP8-Dynamic",
	trust_remote_code=True,
	max_model_len=8192,
	tensor_parallel_size=1, # Adjust based on your GPU setup
	)
	# Generate response
	sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
	response = model.generate("Describe this image: <image>", sampling_params)
	print(response[0].outputs[0].text)
	```

	## 🚀 Key Features
	- Vision-Language Optimized: Specialized quantization recipe that preserves visual understanding
	- vLLM Ready: Seamless integration with vLLM for production deployment
	- Memory Efficient: ~50% memory reduction compared to FP16 original
	- Performance Boost: Significant faster inference on H100/L40S GPUs
	## 📊 Model Details
	- Original Model: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)
	- Source Model: OpenGVLab/InternVL3-8B
	- Quantized Model: InternVL3-8B-FP8-Dynamic
	- Quantization Method: FP8 Dynamic (W8A8)
	- Quantization Library: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811
	- Quantized by: [brandonbeiler](https://huggingface.co/brandonbeiler)

	## 🏗️ Technical Specifications
	### Hardware Requirements
	- Inference: 7.8GB VRAM (+ Context)
	- Supported GPUs: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism)
	- GPU Architecture: Ada Lovelace, Hopper (for optimal FP8 performance)
	### Quantization Details
	- Weights: FP8 E4M3 with dynamic per-tensor scales
	- Activations: FP8 E4M3 with dynamic per-tensor scales
	- Preserved Components: Vision tower, embeddings, normalization layers, mlp1
	## 🔬 Package Versions
	This model was created using:
	```
	llmcompressor==0.5.2.dev112+g6800f811
	compressed-tensors==latest
	transformers==4.52.4
	torch==2.7.0
	vllm==0.9.1
	```

	Quantized with ❤️ using LLM Compressor for the open-source community