File size: 4,111 Bytes

d81bf18
 
 
9328e13
d81bf18
 
 
9328e13
 
 
d81bf18
 
 
8e0e486
 
9328e13
d81bf18
e6e579b
 
 
 
 
efcfd92
e6e579b
a84a3ca
 
 
 
 
 
 
 
bc94e29
a84a3ca
 
 
 
 
 
 
 
 
 
e6e579b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9328e13
 
e6e579b
 
 
d81bf18
 
 
 
e6e579b
d81bf18
 
 
e6e579b
 
d81bf18
e6e579b
 
 
 
 
 
 
 
 
 
d81bf18
 
 
9328e13
 
e6e579b
 
d81bf18
e6e579b
 
 
 
 
 
 
 
 
d81bf18
e6e579b
 
 
 
 
 
 
 
 
d81bf18
 
05ccf33
d81bf18
 
 
 
 
e6e579b

---
tags:
- fp8
- fp8-dynamic
- vllm
- llm-compressor
- internvl3.5
- internvl
language:
- multilingual
pipeline_tag: image-text-to-text
inference: false
license: mit
base_model: OpenGVLab/InternVL3_5-38B
base_model_relation: quantized
library_name: vllm
---

# InternVL3.5 38B FP8

This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38B`optimized for high-performance inference with *vLLM*.

The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 40%.

## Just Run It (vLLM serve)

You can serve the model using vLLM's OpenAI-compatible API server.

```bash
vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
    --quantization compressed-tensors \
    --served-model-name internvl3_5-38b \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --max-model-len 32768 \
    --tensor-parallel-size 1 # Adjust based on your GPU setup
```
**Notes**
- 32k max context length
- reasoning parser ready to go, requires system prompt to run in thinking mode
- still investigating tool calling


## Key Features

*   **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
*   **Vision-Language Optimized:** The vision tower, embeddings, and the first MLP layer are preserved in full precision to maintain high performance on vision-language tasks.
*   **vLLM Ready:** Designed for seamless integration with vLLM for high-throughput serving.
*   **Memory Efficient:** ~40% memory reduction compared to the original FP16 model.
*   **Performance Boost:** Accelerated inference on FP8-compatible hardware (e.g., NVIDIA H100, L40S).

## Model Details

| Attribute | Value |
| :--- | :--- |
| **Original Model** | [OpenGVLab/InternVL3_5-38B](https://huggingface.co/OpenGVLab/InternVL3_5-38B) |
| **Quantized Model** | `brandonbeiler/InternVL3_5-38B-FP8-Dynamic` |
| **Quantization Method** | FP8 Dynamic (W8A8) |
| **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
| **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |


## Usage with vLLM in Python

The following snippet demonstrates inference using the vLLM library.

```python
from vllm import LLM, SamplingParams

# Load the quantized model
# trust_remote_code is required to load the custom model architecture. [32, 44, 45, 48]
model = LLM(
    model="brandonbeiler/InternVL3_5-38B-FP8-Dynamic",
    trust_remote_code=True,
    max_model_len=32768,          # InternVL 3.5 supports a 32k context length. [19, 41]
    tensor_parallel_size=1,      # Adjust for your hardware setup. [11, 15, 38, 40]
)

# Set sampling parameters
# A temperature of 0.6 is recommended for this model. [39]
sampling_params = SamplingParams(temperature=0.6, max_tokens=512)

# Generate a response
# Note: Replace "<image>" with your image input
prompt = "Describe this image: <image>"
response = model.generate(prompt, sampling_params)

print(response[0].outputs[0].text)
```



## Technical Specifications

### Hardware Requirements

*   **Base VRAM:** ~47GB (for model weights)
*   **Context VRAM:**
    *   \+ ~1.3GB for 10k token context
    *   \+ ~2GB for 32k token context with FP8 KV cache
*   **Recommended GPUs:** NVIDIA H100, L40S
*   **Supported GPUs:** NVIDIA A100 (80GB), 2x RTX 4090 (with tensor parallelism), latest AMD GPUs.
*   **Optimal Performance:** NVIDIA GPUs with Compute Capability >= 9.0 (Hopper, Blackwell).

### Quantization Details

*   **Weights:** FP8 E4M3 with per-tensor scales.
*   **Activations:** Dynamically quantized to FP8 E4M3 with per-tensor scales.
*   **Preserved Modules (Full Precision):** Vision tower, embeddings, and the first MLP layer (mlp1).

## Package Versions

This model was quantized using the following environment:

```
llmcompressor==0.7.1
compressed-tensors==0.10.2
transformers==4.55.0
torch==2.7.1
vllm==0.10.1.1
```

*Quantized with ❤️ using [LLM Compressor](https://github.com/vllm-project/llm-compressor) for the open-source community.*