Diffusers documentation
HunyuanVideo
HunyuanVideo
HunyuanVideo is a 13B parameter diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a “dual-stream to single-stream” architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
You can find all the original HunyuanVideo checkpoints under the Tencent organization.
Click on the HunyuanVideo models in the right sidebar for more examples of video generation tasks.
The examples below use a checkpoint from hunyuanvideo-community because the weights are stored in a layout compatible with Diffusers.
The example below demonstrates how to generate a video optimized for memory or inference speed.
Refer to the Reduce memory usage guide for more details about the various memory saving techniques.
The quantized HunyuanVideo model below requires ~14GB of VRAM.
import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video
# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={
"load_in_4bit": True,
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_compute_dtype": torch.bfloat16
},
components_to_quantize=["transformer"]
)
pipeline = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
)
# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()
prompt = "A fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys."
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)
Notes
HunyuanVideo supports LoRAs with load_lora_weights().
Show example code
import torch from diffusers import AutoModel, HunyuanVideoPipeline from diffusers.quantizers import PipelineQuantizationConfig from diffusers.utils import export_to_video # quantize weights to int4 with bitsandbytes pipeline_quant_config = PipelineQuantizationConfig( quant_backend="bitsandbytes_4bit", quant_kwargs={ "load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16 }, components_to_quantize=["transformer"] ) pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, ) # load LoRA weights pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie") pipeline.set_adapters("steamboat-willie", 0.9) # model-offloading and tiling pipeline.enable_model_cpu_offload() pipeline.vae.enable_tiling() # use "In the style of SWR" to trigger the LoRA prompt = """ In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys. """ video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "output.mp4", fps=15)
Refer to the table below for recommended inference values.
parameter recommended value text encoder dtype torch.float16
transformer dtype torch.bfloat16
vae dtype torch.float16
num_frames (k)
4 * k
+ 1Try lower
shift
values (2.0
to5.0
) for lower resolution videos and highershift
values (7.0
to12.0
) for higher resolution images.
HunyuanVideoPipeline
class diffusers.HunyuanVideoPipeline
< source >( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer )
Parameters
- text_encoder (
LlamaModel
) — Llava Llama3-8B. - tokenizer (
LlamaTokenizer
) — Tokenizer from Llava Llama3-8B. - transformer (HunyuanVideoTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformer
to denoise the encoded image latents. - vae (AutoencoderKLHunyuanVideo) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder_2 (
CLIPTextModel
) — CLIP, specifically the clip-vit-large-patch14 variant. - tokenizer_2 (
CLIPTokenizer
) — Tokenizer of class CLIPTokenizer.
Pipeline for text-to-video generation using HunyuanVideo.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None negative_prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 num_inference_steps: int = 50 sigmas: typing.List[float] = None true_cfg_scale: float = 1.0 guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 ) → ~HunyuanVideoPipelineOutput
or tuple
Parameters
- prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds
. instead. - prompt_2 (
str
orList[str]
, optional) — The prompt or prompts to be sent totokenizer_2
andtext_encoder_2
. If not defined,prompt
is will be used instead. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored iftrue_cfg_scale
is not greater than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in all the text-encoders. - height (
int
, defaults to720
) — The height in pixels of the generated image. - width (
int
, defaults to1280
) — The width in pixels of the generated image. - num_frames (
int
, defaults to129
) — The number of frames in the generated video. - num_inference_steps (
int
, defaults to50
) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - sigmas (
List[float]
, optional) — Custom sigmas to use for the denoising process with schedulers which support asigmas
argument in theirset_timesteps
method. If not defined, the default behavior whennum_inference_steps
is passed will be used. - true_cfg_scale (
float
, optional, defaults to 1.0) — When > 1.0 and a providednegative_prompt
, enables true classifier-free guidance. - guidance_scale (
float
, defaults to6.0
) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scale
is defined asw
of equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1
. Higher guidance scale encourages to generate images that are closely linked to the textprompt
, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied. - num_videos_per_prompt (
int
, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. - latents (
torch.Tensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. - prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. - pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - negative_pooled_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generated image. Choose betweenPIL.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return aHunyuanVideoPipelineOutput
instead of a plain tuple. - attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
,PipelineCallback
,MultiPipelineCallbacks
, optional) — A function or a subclass ofPipelineCallback
orMultiPipelineCallbacks
that is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
~HunyuanVideoPipelineOutput
or tuple
If return_dict
is True
, HunyuanVideoPipelineOutput
is returned, otherwise a tuple
is returned
where the first element is a list with the generated images and the second element is a list of bool
s
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
>>> from diffusers.utils import export_to_video
>>> model_id = "hunyuanvideo-community/HunyuanVideo"
>>> transformer = HunyuanVideoTransformer3DModel.from_pretrained(
... model_id, subfolder="transformer", torch_dtype=torch.bfloat16
... )
>>> pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")
>>> output = pipe(
... prompt="A cat walks on the grass, realistic",
... height=320,
... width=512,
... num_frames=61,
... num_inference_steps=30,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=15)
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Disable tiled VAE decoding. If enable_vae_tiling
was previously enabled, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
HunyuanVideoPipelineOutput
class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor
,np.ndarray
, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,
with each sub-list containing denoised PIL image sequences of lengthnum_frames.
It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width)
.
Output class for HunyuanVideo pipelines.