File size: 4,111 Bytes
d81bf18 9328e13 d81bf18 9328e13 d81bf18 8e0e486 9328e13 d81bf18 e6e579b efcfd92 e6e579b a84a3ca bc94e29 a84a3ca e6e579b 9328e13 e6e579b d81bf18 e6e579b d81bf18 e6e579b d81bf18 e6e579b d81bf18 9328e13 e6e579b d81bf18 e6e579b d81bf18 e6e579b d81bf18 05ccf33 d81bf18 e6e579b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
tags:
- fp8
- fp8-dynamic
- vllm
- llm-compressor
- internvl3.5
- internvl
language:
- multilingual
pipeline_tag: image-text-to-text
inference: false
license: mit
base_model: OpenGVLab/InternVL3_5-38B
base_model_relation: quantized
library_name: vllm
---
# InternVL3.5 38B FP8
This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38B`optimized for high-performance inference with *vLLM*.
The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 40%.
## Just Run It (vLLM serve)
You can serve the model using vLLM's OpenAI-compatible API server.
```bash
vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
--quantization compressed-tensors \
--served-model-name internvl3_5-38b \
--reasoning-parser qwen3 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 # Adjust based on your GPU setup
```
**Notes**
- 32k max context length
- reasoning parser ready to go, requires system prompt to run in thinking mode
- still investigating tool calling
## Key Features
* **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
* **Vision-Language Optimized:** The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
* **vLLM Ready:** Designed for seamless integration with vLLM for high-throughput serving.
* **Memory Efficient:** ~40% memory reduction compared to the original FP16 model.
* **Performance Boost:** Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).
## Model Details
| Attribute | Value |
| :--- | :--- |
| **Original Model** | [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) |
| **Quantized Model** | `brandonbeiler/InternVL3_5-38B-FP8-Dynamic` |
| **Quantization Method** | FP8 Dynamic (W8A8) |
| **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
| **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
## Usage with vLLM in Python
The following snippet demonstrates inference using the vLLM library.
```python
from vllm import LLM, SamplingParams
# Load the quantized model
# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
model = LLM(
model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
trust_remote_code=True,
max_model_len=32768, # InternVL 3.5 supports a 32k context length. [19, 41]
tensor_parallel_size=1, # Adjust for your hardware setup. [11, 15, 38, 40]
)
# Set sampling parameters
# A temperature of 0.6 is recommended for this model. [39]
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)
# Generate a response
# Note: Replace "<image>" with your image input
prompt = "Describe this image: <image>"
response = model.generate(prompt, sampling_params)
print(response[0].outputs[0].text)
```
## Technical Specifications
### Hardware Requirements
* **Base VRAM:** ~47GB (for model weights)
* **Context VRAM:**
* \+ ~1.3GB for 10k token context
* \+ ~2GB for 32k token context with FP8 KV cache
* **Recommended GPUs:** NVIDIA H100, L40S
* **Supported GPUs:** NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
* **Optimal Performance:** NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).
### Quantization Details
* **Weights:** FP8 E4M3 with per-tensor scales.
* **Activations:** Dynamically quantized to FP8 E4M3 with per-tensor scales.
* **Preserved Modules (Full Precision):** Vision tower, embeddings, and the first MLP layer (mlp1).
## Package Versions
This model was quantized using the following environment:
```
llmcompressor==0.7.1
compressed-tensors==0.10.2
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1
```
*Quantized with ❤️ using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for the open-source community.* |