Diffusers documentation
LTX Video
LTX Video
LTX Video is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Available models:
| Model name | Recommended dtype |
|---|---|
LTX Video 0.9.0 | torch.bfloat16 |
LTX Video 0.9.1 | torch.bfloat16 |
Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either torch.float32, torch.bfloat16 or torch.float16 but the recommended dtype is torch.bfloat16 as used in the original repository.
Loading Single Files
Loading the original LTX Video checkpoints is also possible with ~ModelMixin.from_single_file. We recommend using from_single_file for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
import torch
from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
# `single_file_url` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
transformer = LTXVideoTransformer3DModel.from_single_file(
single_file_url, torch_dtype=torch.bfloat16
)
vae = AutoencoderKLLTXVideo.from_single_file(single_file_url, torch_dtype=torch.bfloat16)
pipe = LTXImageToVideoPipeline.from_pretrained(
"Lightricks/LTX-Video", transformer=transformer, vae=vae, torch_dtype=torch.bfloat16
)
# ... inference code ...Alternatively, the pipeline can be used to load the weights with ~FromSingleFileMixin.from_single_file.
import torch
from diffusers import LTXImageToVideoPipeline
from transformers import T5EncoderModel, T5Tokenizer
single_file_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
text_encoder = T5EncoderModel.from_pretrained(
"Lightricks/LTX-Video", subfolder="text_encoder", torch_dtype=torch.bfloat16
)
tokenizer = T5Tokenizer.from_pretrained(
"Lightricks/LTX-Video", subfolder="tokenizer", torch_dtype=torch.bfloat16
)
pipe = LTXImageToVideoPipeline.from_single_file(
single_file_url, text_encoder=text_encoder, tokenizer=tokenizer, torch_dtype=torch.bfloat16
)Loading LTX GGUF checkpoints are also supported:
import torch
from diffusers.utils import export_to_video
from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
ckpt_path = (
"https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3_K_S.gguf"
)
transformer = LTXVideoTransformer3DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = LTXPipeline.from_pretrained(
"Lightricks/LTX-Video",
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=704,
height=480,
num_frames=161,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output_gguf_ltx.mp4", fps=24)Make sure to read the documentation on GGUF to learn more about our GGUF support.
Loading and running inference with LTX Video 0.9.1 weights.
import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video
pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=768,
height=512,
num_frames=161,
decode_timestep=0.03,
decode_noise_scale=0.025,
num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)Refer to this section to learn more about optimizing memory consumption.
LTXPipeline
class diffusers.LTXPipeline
< source >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLLTXVideo text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: LTXVideoTransformer3DModel )
Parameters
- transformer (LTXVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLLTXVideo) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (
T5EncoderModel) — T5, specifically the google/t5-v1_1-xxl variant. - tokenizer (
CLIPTokenizer) — Tokenizer of class CLIPTokenizer. - tokenizer (
T5TokenizerFast) — Second Tokenizer of class T5TokenizerFast.
Pipeline for text-to-video generation.
Reference: https://github.com/Lightricks/LTX-Video
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 512 width: int = 704 num_frames: int = 161 frame_rate: int = 25 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 3 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None decode_timestep: typing.Union[float, typing.List[float]] = 0.0 decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 128 ) → ~pipelines.ltx.LTXPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to512) — The height in pixels of the generated image. This is set to 480 by default for the best results. - width (
int, defaults to704) — The width in pixels of the generated image. This is set to 848 by default for the best results. - num_frames (
int, defaults to161) — The number of video frames to generate - num_inference_steps (
int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support atimestepsargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. Must be in descending order. - guidance_scale (
float, defaults to3) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_attention_mask (
torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings. - decode_timestep (
float, defaults to0.0) — The timestep at which generated video is decoded. - decode_noise_scale (
float, defaults toNone) — The interpolation factor between random noise and denoised latents at the decode timestep. - output_type (
str, optional, defaults to"pil") — The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~pipelines.ltx.LTXPipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - callback_on_step_end (
Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
intdefaults to128) — Maximum sequence length to use with theprompt.
Returns
~pipelines.ltx.LTXPipelineOutput or tuple
If return_dict is True, ~pipelines.ltx.LTXPipelineOutput is returned, otherwise a tuple is
returned where the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import LTXPipeline
>>> from diffusers.utils import export_to_video
>>> pipe = LTXPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe(
... prompt=prompt,
... negative_prompt=negative_prompt,
... width=704,
... height=480,
... num_frames=161,
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 128 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
LTXImageToVideoPipeline
class diffusers.LTXImageToVideoPipeline
< source >( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKLLTXVideo text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: LTXVideoTransformer3DModel )
Parameters
- transformer (LTXVideoTransformer3DModel) — Conditional Transformer architecture to denoise the encoded video latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLLTXVideo) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (
T5EncoderModel) — T5, specifically the google/t5-v1_1-xxl variant. - tokenizer (
CLIPTokenizer) — Tokenizer of class CLIPTokenizer. - tokenizer (
T5TokenizerFast) — Second Tokenizer of class T5TokenizerFast.
Pipeline for image-to-video generation.
Reference: https://github.com/Lightricks/LTX-Video
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 512 width: int = 704 num_frames: int = 161 frame_rate: int = 25 num_inference_steps: int = 50 timesteps: typing.List[int] = None guidance_scale: float = 3 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None decode_timestep: typing.Union[float, typing.List[float]] = 0.0 decode_noise_scale: typing.Union[float, typing.List[float], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 128 ) → ~pipelines.ltx.LTXPipelineOutput or tuple
Parameters
- image (
PipelineImageInput) — The input image to condition the generation on. Must be an image, a list of images or atorch.Tensor. - prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to512) — The height in pixels of the generated image. This is set to 480 by default for the best results. - width (
int, defaults to704) — The width in pixels of the generated image. This is set to 848 by default for the best results. - num_frames (
int, defaults to161) — The number of video frames to generate - num_inference_steps (
int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - timesteps (
List[int], optional) — Custom timesteps to use for the denoising process with schedulers which support atimestepsargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used. Must be in descending order. - guidance_scale (
float, defaults to3) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of videos to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - prompt_attention_mask (
torch.Tensor, optional) — Pre-generated attention mask for text embeddings. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - negative_prompt_attention_mask (
torch.FloatTensor, optional) — Pre-generated attention mask for negative text embeddings. - decode_timestep (
float, defaults to0.0) — The timestep at which generated video is decoded. - decode_noise_scale (
float, defaults toNone) — The interpolation factor between random noise and denoised latents at the decode timestep. - output_type (
str, optional, defaults to"pil") — The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~pipelines.ltx.LTXPipelineOutputinstead of a plain tuple. - attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - callback_on_step_end (
Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
intdefaults to128) — Maximum sequence length to use with theprompt.
Returns
~pipelines.ltx.LTXPipelineOutput or tuple
If return_dict is True, ~pipelines.ltx.LTXPipelineOutput is returned, otherwise a tuple is
returned where the first element is a list with the generated images.
Function invoked when calling the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import LTXImageToVideoPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> pipe = LTXImageToVideoPipeline.from_pretrained("Lightricks/LTX-Video", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> image = load_image(
... "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
... )
>>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
>>> negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
>>> video = pipe(
... image=image,
... prompt=prompt,
... negative_prompt=negative_prompt,
... width=704,
... height=480,
... num_frames=161,
... num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=24)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None max_sequence_length: int = 128 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
LTXPipelineOutput
class diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for LTX pipelines.