brandonbeiler commited on
Commit
a84a3ca
·
verified ·
1 Parent(s): 8e0e486

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -13
README.md CHANGED
@@ -22,6 +22,25 @@ This is an FP8 dynamically quantized (W8A8) version of `OpenGVLab/InternVL3_5-38
22
 
23
  The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ## Key Features
26
 
27
  * **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
@@ -40,19 +59,6 @@ The quantization process uses a specialized recipe that preserves the model's co
40
  | **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
41
  | **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
42
 
43
- ## With vLLM OpenAI-Compatible Server
44
-
45
- You can serve the model using vLLM's OpenAI-compatible API server.
46
-
47
- ```bash
48
- vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
49
- --quantization compressed-tensors \
50
- --served-model-name internvl3_5-38b \
51
- --reasoning-parser: qwen3 \
52
- --trust-remote-code \
53
- --max-model-len 32768 \
54
- --tensor-parallel-size 1 # Adjust based on your GPU setup
55
- ```
56
 
57
  ## Usage with vLLM in Python
58
 
 
22
 
23
  The quantization process uses a specialized recipe that preserves the model's core visual understanding capabilities while reducing the memory footprint by nearly 50%.
24
 
25
+ ## Just Run It (vLLM serve)
26
+
27
+ You can serve the model using vLLM's OpenAI-compatible API server.
28
+
29
+ ```bash
30
+ vllm serve brandonbeiler/InternVL3_5-38B-FP8-Dynamic \
31
+ --quantization compressed-tensors \
32
+ --served-model-name internvl3_5-38b \
33
+ --reasoning-parser: qwen3 \
34
+ --trust-remote-code \
35
+ --max-model-len 32768 \
36
+ --tensor-parallel-size 1 # Adjust based on your GPU setup
37
+ ```
38
+ **Notes**
39
+ - 32k max context length
40
+ - reasoning parser ready to go, requires system prompt to run in thinking mode
41
+ - still investigating tool calling
42
+
43
+
44
  ## Key Features
45
 
46
  * **Calibration-Free FP8:** Dynamic W8A8 quantization. Weights are pre-quantized, and activations are quantized on the fly.
 
59
  | **Quantization Library** | [LLM Compressor](https://github.com/vllm-project/llm-compressor) v0.7.1 |
60
  | **Quantized By** | [brandonbeiler](https://huggingface.co/brandonbeiler) |
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ## Usage with vLLM in Python
64