robgreenberg3 jennyyyi commited on
Commit
1ef9cf4
·
verified ·
1 Parent(s): a6c9d52

Update README.md (#2)

Browse files

- Update README.md (72b1d4569b53372acb52c7a0ed2773ad7d000426)
- Update README.md (0f771ca84a890afae9775b64624230c0f4ae1116)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +166 -4
README.md CHANGED
@@ -36,8 +36,14 @@ tags:
36
  - quantized
37
  - int4
38
  ---
39
-
40
- # Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
 
 
 
 
 
 
41
 
42
  ## Model Overview
43
  - **Model Architecture:** Mistral3ForConditionalGeneration
@@ -77,7 +83,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
77
  from vllm import LLM, SamplingParams
78
  from transformers import AutoProcessor
79
 
80
- model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-FP8-dynamic"
81
  number_gpus = 1
82
 
83
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
@@ -95,8 +101,164 @@ generated_text = outputs[0].outputs[0].text
95
  print(generated_text)
96
  ```
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
 
99
- vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
100
 
101
  ## Creation
102
 
 
36
  - quantized
37
  - int4
38
  ---
39
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
40
+ Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
41
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
42
+ </h1>
43
+
44
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
45
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
46
+ </a>
47
 
48
  ## Model Overview
49
  - **Model Architecture:** Mistral3ForConditionalGeneration
 
83
  from vllm import LLM, SamplingParams
84
  from transformers import AutoProcessor
85
 
86
+ model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16"
87
  number_gpus = 1
88
 
89
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
 
101
  print(generated_text)
102
  ```
103
 
104
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
105
+
106
+ <details>
107
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
108
+
109
+ ```bash
110
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
111
+ --ipc=host \
112
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
113
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
114
+ --name=vllm \
115
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
116
+ vllm serve \
117
+ --tensor-parallel-size 8 \
118
+ --max-model-len 32768 \
119
+ --enforce-eager --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
120
+ ```
121
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
122
+ </details>
123
+
124
+ <details>
125
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
126
+
127
+ ```bash
128
+ # Download model from Red Hat Registry via docker
129
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
130
+ ilab model download --repository docker://registry.redhat.io/rhelai1/mistral-small-3-1-24b-instruct-2503-quantized-w4a16:1.5
131
+ ```
132
+
133
+ ```bash
134
+ # Serve model via ilab
135
+ ilab model serve --model-path ~/.cache/instructlab/models/mistral-small-3-1-24b-instruct-2503-quantized-w4a16
136
+
137
+ # Chat with model
138
+ ilab model chat --model ~/.cache/instructlab/models/mistral-small-3-1-24b-instruct-2503-quantized-w4a16
139
+ ```
140
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
141
+ </details>
142
+
143
+ <details>
144
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
145
+
146
+ ```python
147
+ # Setting up vllm server with ServingRuntime
148
+ # Save as: vllm-servingruntime.yaml
149
+ apiVersion: serving.kserve.io/v1alpha1
150
+ kind: ServingRuntime
151
+ metadata:
152
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
153
+ annotations:
154
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
155
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
156
+ labels:
157
+ opendatahub.io/dashboard: 'true'
158
+ spec:
159
+ annotations:
160
+ prometheus.io/port: '8080'
161
+ prometheus.io/path: '/metrics'
162
+ multiModel: false
163
+ supportedModelFormats:
164
+ - autoSelect: true
165
+ name: vLLM
166
+ containers:
167
+ - name: kserve-container
168
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
169
+ command:
170
+ - python
171
+ - -m
172
+ - vllm.entrypoints.openai.api_server
173
+ args:
174
+ - "--port=8080"
175
+ - "--model=/mnt/models"
176
+ - "--served-model-name={{.Name}}"
177
+ env:
178
+ - name: HF_HOME
179
+ value: /tmp/hf_home
180
+ ports:
181
+ - containerPort: 8080
182
+ protocol: TCP
183
+ ```
184
+
185
+ ```python
186
+ # Attach model to vllm server. This is an NVIDIA template
187
+ # Save as: inferenceservice.yaml
188
+ apiVersion: serving.kserve.io/v1beta1
189
+ kind: InferenceService
190
+ metadata:
191
+ annotations:
192
+ openshift.io/display-name: mistral-small-3-1-24b-instruct-2503-quantized-w4a16 # OPTIONAL CHANGE
193
+ serving.kserve.io/deploymentMode: RawDeployment
194
+ name: mistral-small-3-1-24b-instruct-2503-quantized-w4a16 # specify model name. This value will be used to invoke the model in the payload
195
+ labels:
196
+ opendatahub.io/dashboard: 'true'
197
+ spec:
198
+ predictor:
199
+ maxReplicas: 1
200
+ minReplicas: 1
201
+ model:
202
+ modelFormat:
203
+ name: vLLM
204
+ name: ''
205
+ resources:
206
+ limits:
207
+ cpu: '2' # this is model specific
208
+ memory: 8Gi # this is model specific
209
+ nvidia.com/gpu: '1' # this is accelerator specific
210
+ requests: # same comment for this block
211
+ cpu: '1'
212
+ memory: 4Gi
213
+ nvidia.com/gpu: '1'
214
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
215
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-mistral-small-3-1-24b-instruct-2503-quantized-w4a16:1.5
216
+ tolerations:
217
+ - effect: NoSchedule
218
+ key: nvidia.com/gpu
219
+ operator: Exists
220
+ ```
221
+
222
+ ```bash
223
+ # make sure first to be in the project where you want to deploy the model
224
+ # oc project <project-name>
225
+
226
+ # apply both resources to run model
227
+
228
+ # Apply the ServingRuntime
229
+ oc apply -f vllm-servingruntime.yaml
230
+
231
+ # Apply the InferenceService
232
+ oc apply -f qwen-inferenceservice.yaml
233
+ ```
234
+
235
+ ```python
236
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
237
+ # - Run `oc get inferenceservice` to find your URL if unsure.
238
+
239
+ # Call the server using curl:
240
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
241
+ -H "Content-Type: application/json" \
242
+ -d '{
243
+ "model": "mistral-small-3-1-24b-instruct-2503-quantized-w4a16",
244
+ "stream": true,
245
+ "stream_options": {
246
+ "include_usage": true
247
+ },
248
+ "max_tokens": 1,
249
+ "messages": [
250
+ {
251
+ "role": "user",
252
+ "content": "How can a bee fly when its wings are so small?"
253
+ }
254
+ ]
255
+ }'
256
+
257
+ ```
258
+
259
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
260
+ </details>
261
 
 
262
 
263
  ## Creation
264