Transformers documentation
Distributed GPU inference
Distributed GPU inference
Tensor parallelism shards a model onto multiple GPUs and parallelizes computations such as matrix multiplication. It enables fitting larger model sizes into memory and is faster because each GPU can process a tensor slice.
Expand the list below to see which models support tensor parallelism. Open a GitHub issue or pull request to add support for a model not currently below.
Supported models
Set tp_plan="auto"
in from_pretrained() to enable tensor parallelism for inference.
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# initialize distributed environment
rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.cuda.set_device(device)
torch.distributed.init_process_group("nccl", device_id=device)
# enable tensor parallelism
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
tp_plan="auto",
)
# prepare input tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
# distributed run
outputs = model(inputs)
Launch the inference script above on torchrun with 4 processes per GPU.
torchrun --nproc-per-node 4 demo.py
You can benefit from considerable speed ups for inference, especially for inputs with large batch size or long sequences.
For a single forward pass on Llama with a sequence length of 512 and various batch sizes, you can expect the following speed ups.
