Transformers documentation

Distributed GPU inference

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.49.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Distributed GPU inference

Tensor parallelism shards a model onto multiple GPUs and parallelizes computations such as matrix multiplication. It enables fitting larger model sizes into memory and is faster because each GPU can process a tensor slice.

Expand the list below to see which models support tensor parallelism. Open a GitHub issue or pull request to add support for a model not currently below.

Supported models

Set tp_plan="auto" in from_pretrained() to enable tensor parallelism for inference.

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# initialize distributed environment
rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.cuda.set_device(device)
torch.distributed.init_process_group("nccl", device_id=device)

# enable tensor parallelism
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    tp_plan="auto",
)

# prepare input tokens
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# distributed run
outputs = model(inputs)

Launch the inference script above on torchrun with 4 processes per GPU.

torchrun --nproc-per-node 4 demo.py

You can benefit from considerable speed ups for inference, especially for inputs with large batch size or long sequences.

For a single forward pass on Llama with a sequence length of 512 and various batch sizes, you can expect the following speed ups.

< > Update on GitHub