|
--- |
|
language: |
|
- en |
|
- zh |
|
tags: |
|
- fp8 |
|
- quantization |
|
- dynamic |
|
- vision-language |
|
- multimodal |
|
- vLLM |
|
- llm-compressor |
|
- internvl3 |
|
pipeline_tag: image-text-to-text |
|
inference: false |
|
license: mit |
|
base_model: |
|
- OpenGVLab/InternVL3-8B |
|
--- |
|
|
|
# π₯ InternVL3-8B-FP8-Dynamic: Optimized Vision-Language Model π₯ |
|
This is a **FP8 dynamic quantized** version of [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B), optimized for high-performance inference with vLLM. |
|
The model utilizes **dynamic FP8 quantization** for optimal ease of use and deployment, achieving significant speedup with minimal accuracy degradation on vision-language tasks. |
|
|
|
## π§ Usage |
|
### With vLLM (Recommended) |
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
# Load the quantized model |
|
model = LLM( |
|
model="brandonbeiler/InternVL3-8B-FP8-Dynamic", |
|
trust_remote_code=True, |
|
max_model_len=8192, |
|
tensor_parallel_size=1, # Adjust based on your GPU setup |
|
) |
|
# Generate response |
|
sampling_params = SamplingParams(temperature=0.7, max_tokens=512) |
|
response = model.generate("Describe this image: <image>", sampling_params) |
|
print(response[0].outputs[0].text) |
|
``` |
|
|
|
## π Key Features |
|
- **Vision-Language Optimized**: Specialized quantization recipe that preserves visual understanding |
|
- **vLLM Ready**: Seamless integration with vLLM for production deployment |
|
- **Memory Efficient**: ~50% memory reduction compared to FP16 original |
|
- **Performance Boost**: Significant faster inference on H100/L40S GPUs |
|
## π Model Details |
|
- **Original Model**: [OpenGVLab/InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) |
|
- **Source Model**: OpenGVLab/InternVL3-8B |
|
- **Quantized Model**: InternVL3-8B-FP8-Dynamic |
|
- **Quantization Method**: FP8 Dynamic (W8A8) |
|
- **Quantization Library**: [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.5.2.dev112+g6800f811 |
|
- **Quantized by**: [brandonbeiler](https://huggingface.co/brandonbeiler) |
|
|
|
## ποΈ Technical Specifications |
|
### Hardware Requirements |
|
- **Inference**: 7.8GB VRAM (+ Context) |
|
- **Supported GPUs**: H100, L40S, A100 (80GB), RTX 4090 (2x for tensor parallelism) |
|
- **GPU Architecture**: Ada Lovelace, Hopper (for optimal FP8 performance) |
|
### Quantization Details |
|
- **Weights**: FP8 E4M3 with dynamic per-tensor scales |
|
- **Activations**: FP8 E4M3 with dynamic per-tensor scales |
|
- **Preserved Components**: Vision tower, embeddings, normalization layers, mlp1 |
|
## π¬ Package Versions |
|
This model was created using: |
|
``` |
|
llmcompressor==0.5.2.dev112+g6800f811 |
|
compressed-tensors==latest |
|
transformers==4.52.4 |
|
torch==2.7.0 |
|
vllm==0.9.1 |
|
``` |
|
|
|
*Quantized with β€οΈ using LLM Compressor for the open-source community* |