Transformers documentation

Multi-GPU inference

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.48.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Multi-GPU inference

Built-in Tensor Parallelism (TP) is now available with certain models using PyTorch. Tensor parallelism shards a model onto multiple GPUs, enabling larger model sizes, and parallelizes computations such as matrix multiplication.

To enable tensor parallel, pass the argument tp_plan="auto" to from_pretrained():

import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Initialize distributed
rank = int(os.environ["RANK"])
device = torch.device(f"cuda:{rank}")
torch.distributed.init_process_group("nccl", device_id=device)

# Retrieve tensor parallel model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    tp_plan="auto",
)

# Prepare input tokens
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Can I help"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

# Distributed run
outputs = model(inputs)

You can use torchrun to launch the above script with multiple processes, each mapping to a GPU:

torchrun --nproc-per-node 4 demo.py

PyTorch tensor parallel is currently supported for the following models:

You can request to add tensor parallel support for another model by opening a GitHub Issue or Pull Request.

Expected speedups

You can benefit from considerable speedups for inference, especially for inputs with large batch size or long sequences.

For a single forward pass on Llama with a sequence length of 512 and various batch sizes, the expected speedup is as follows:

< > Update on GitHub