AWS Trainium & Inferentia documentation
optimum-neuron plugin for vLLM
optimum-neuron plugin for vLLM
The optimum-neuron
package includes a vLLM plugin
that registers an ‘optimum-neuron’ vLLM platform specifically designed to ease the deployment
of models hosted on the Hugging Face hub to AWS Trainium and Inferentia.
This platform supports two modes of operation:
- it can be used for the inference of pre-exported Neuron models directly from the hub,
- but it allows also the simplified deployment of vanilla models directly without recompilation using cached artifacts.
Notes
- only a relevant subset of all possible configurations for a given model are cached,
- you can use the
optimum-cli
to get all cached configurations for each model. - to deploy models that are not cached on the Hugging Face hub, you need to export them beforehand.
Setup
The easiest way to use the optimum-neuron
vLLM platform is to launch an Amazon ec2 instance using
the Hugging Face Neuron Deep Learning AMI.
Note: Trn2 instances are not supported by the optimum-neuron
platform yet.
- After launching the instance, follow the instructions in Connect to your instance to connect to the instance
- Once inside your instance, activate the pre-installed
optimum-neuron
virtual environment by running
source /opt/aws_neuronx_venv_pytorch_2_7/bin/activate
Offline inference example
The easiest way to test a model is to use the python API:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="unsloth/Llama-3.2-1B-Instruct",
max_num_seqs=4,
max_model_len=4096,
tensor_parallel_size=2,
device="neuron")
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Online inference example
You can also launch an Open AI compatible inference server.
python -m vllm.entrypoints.openai.api_server \ --model="unsloth/Llama-3.2-1B-Instruct" \ --max-num-seqs=4 \ --max-model-len=4096 \ --tensor-parallel-size=2 \ --port=8080 \ --device "neuron"
Use the following command to test the model:
curl 127.0.0.1:8080/v1/completions \ -H 'Content-Type: application/json' \ -X POST \ -d '{"prompt":"One of my fondest memory is", "temperature": 0.8, "max_tokens":128}'