brandonbeiler
/

InternVL3_5-38B-FP8-Dynamic

Image-Text-to-Text

compressed-tensors

Model card Files Files and versions

brandonbeiler commited on 22 days ago

Commit

a84a3ca

·

verified ·

1 Parent(s): 8e0e486

Update README.md

Files changed (1) hide show

README.md +19 -13

README.md CHANGED Viewed

@@ -22,6 +22,25 @@ This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38
 The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
 ## Key Features
 *   **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
@@ -40,19 +59,6 @@ The quantization process uses a specialized recipe that preserves the model's co
 | **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
 | **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
-## With vLLM OpenAI-Compatible Server
-You can serve the model using vLLM's OpenAI-compatible API server.
-```bash
-vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
-    --quantization compressed-tensors \
-    --served-model-name internvl3_5-38b \
-    --reasoning-parser: qwen3 \
-    --trust-remote-code \
-    --max-model-len 32768 \
-    --tensor-parallel-size 1 # Adjust based on your GPU setup
-```
 ## Usage with vLLM in Python

 The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
+## Just Run It (vLLM serve)
+You can serve the model using vLLM's OpenAI-compatible API server.
+```bash
+vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
+    --quantization compressed-tensors \
+    --served-model-name internvl3_5-38b \
+    --reasoning-parser: qwen3 \
+    --trust-remote-code \
+    --max-model-len 32768 \
+    --tensor-parallel-size 1 # Adjust based on your GPU setup
+```
+**Notes**
+- 32k max context length
+- reasoning parser ready to go, requires system prompt to run in thinking mode
+- still investigating tool calling
 ## Key Features
 *   **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
 | **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
 | **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
 ## Usage with vLLM in Python