Transformers documentation
ViTMatte
ViTMatte
Overview
The ViTMatte model was proposed in Boosting Image Matting with Pretrained Plain Vision Transformers by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang. ViTMatte leverages plain Vision Transformers for the task of image matting, which is the process of accurately estimating the foreground object in images and videos.
The abstract from the paper is the following:
Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
This model was contributed by nielsr. The original code can be found here.

Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViTMatte.
- A demo notebook regarding inference with VitMatteForImageMatting, including background replacement, can be found here.
The model expects both the image and trimap (concatenated) as input. Use ViTMatteImageProcessor
for this purpose.
VitMatteConfig
class transformers.VitMatteConfig
< source >( backbone_config: PretrainedConfig = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None hidden_size: int = 384 batch_norm_eps: float = 1e-05 initializer_range: float = 0.02 convstream_hidden_sizes: typing.List[int] = [48, 96, 192] fusion_hidden_sizes: typing.List[int] = [256, 128, 64, 32] **kwargs )
Parameters
- backbone_config (
PretrainedConfig
ordict
, optional, defaults toVitDetConfig()
) — The configuration of the backbone model. - backbone (
str
, optional) — Name of backbone to use whenbackbone_config
isNone
. Ifuse_pretrained_backbone
isTrue
, this will load the corresponding pretrained weights from the timm or transformers library. Ifuse_pretrained_backbone
isFalse
, this loads the backbone’s config and uses that to initialize the backbone with random weights. - use_pretrained_backbone (
bool
, optional, defaults toFalse
) — Whether to use pretrained weights for the backbone. - use_timm_backbone (
bool
, optional, defaults toFalse
) — Whether to loadbackbone
from the timm library. IfFalse
, the backbone is loaded from the transformers library. - backbone_kwargs (
dict
, optional) — Keyword arguments to be passed to AutoBackbone when loading from a checkpoint e.g.{'out_indices': (0, 1, 2, 3)}
. Cannot be specified ifbackbone_config
is set. - hidden_size (
int
, optional, defaults to 384) — The number of input channels of the decoder. - batch_norm_eps (
float
, optional, defaults to 1e-05) — The epsilon used by the batch norm layers. - initializer_range (
float
, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - convstream_hidden_sizes (
List[int]
, optional, defaults to[48, 96, 192]
) — The output channels of the ConvStream module. - fusion_hidden_sizes (
List[int]
, optional, defaults to[256, 128, 64, 32]
) — The output channels of the Fusion blocks.
This is the configuration class to store the configuration of VitMatteForImageMatting. It is used to instantiate a ViTMatte model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the ViTMatte hustvl/vitmatte-small-composition-1k architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import VitMatteConfig, VitMatteForImageMatting
>>> # Initializing a ViTMatte hustvl/vitmatte-small-composition-1k style configuration
>>> configuration = VitMatteConfig()
>>> # Initializing a model (with random weights) from the hustvl/vitmatte-small-composition-1k style configuration
>>> model = VitMatteForImageMatting(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Serializes this instance to a Python dictionary. Override the default to_dict(). Returns:
Dict[str, any]
: Dictionary of all the attributes that make up this configuration instance,
VitMatteImageProcessor
class transformers.VitMatteImageProcessor
< source >( do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = True size_divisibility: int = 32 **kwargs )
Parameters
- do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image by the specified scalerescale_factor
. Can be overridden by thedo_rescale
parameter in thepreprocess
method. - rescale_factor (
int
orfloat
, optional, defaults to1/255
) — Scale factor to use if rescaling the image. Can be overridden by therescale_factor
parameter in thepreprocess
method. - do_normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the image. Can be overridden by thedo_normalize
parameter in thepreprocess
method. - image_mean (
float
orList[float]
, optional, defaults toIMAGENET_STANDARD_MEAN
) — Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_mean
parameter in thepreprocess
method. - image_std (
float
orList[float]
, optional, defaults toIMAGENET_STANDARD_STD
) — Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by theimage_std
parameter in thepreprocess
method. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image to make the width and height divisible bysize_divisibility
. Can be overridden by thedo_pad
parameter in thepreprocess
method. - size_divisibility (
int
, optional, defaults to 32) — The width and height of the image will be padded to be divisible by this number.
Constructs a ViTMatte image processor.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] trimaps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None size_divisibility: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )
Parameters
- images (
ImageInput
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - trimaps (
ImageInput
) — Trimap to preprocess. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the image values between [0 - 1]. - rescale_factor (
float
, optional, defaults toself.rescale_factor
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional, defaults toself.do_normalize
) — Whether to normalize the image. - image_mean (
float
orList[float]
, optional, defaults toself.image_mean
) — Image mean to use ifdo_normalize
is set toTrue
. - image_std (
float
orList[float]
, optional, defaults toself.image_std
) — Image standard deviation to use ifdo_normalize
is set toTrue
. - do_pad (
bool
, optional, defaults toself.do_pad
) — Whether to pad the image. - size_divisibility (
int
, optional, defaults toself.size_divisibility
) — The size divisibility to pad the image to ifdo_pad
is set toTrue
. - return_tensors (
str
orTensorType
, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of
np.ndarray
. TensorType.TENSORFLOW
or'tf'
: Return a batch of typetf.Tensor
.TensorType.PYTORCH
or'pt'
: Return a batch of typetorch.Tensor
.TensorType.NUMPY
or'np'
: Return a batch of typenp.ndarray
.TensorType.JAX
or'jax'
: Return a batch of typejax.numpy.ndarray
.
- Unset: Return a list of
- data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input image.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
Preprocess an image or batch of images.
VitMatteImageProcessorFast
class transformers.VitMatteImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.vitmatte.image_processing_vitmatte_fast.VitMatteFastImageProcessorKwargs] )
Parameters
- do_resize (
bool
, optional, defaults toNone
) — Whether to resize the image. - size (
dict[str, int]
, optional, defaults toNone
) — Describes the maximum input dimensions to the model. - default_to_square (
bool
, optional, defaults toTrue
) — Whether to default to a square image when resizing, if size is an int. - resample (
Union[PILImageResampling, F.InterpolationMode, NoneType]
, defaults toNone
) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling
. Only has an effect ifdo_resize
is set toTrue
. - do_center_crop (
bool
, optional, defaults toNone
) — Whether to center crop the image. - crop_size (
dict[str, int]
, optional, defaults toNone
) — Size of the output image after applyingcenter_crop
. - do_rescale (
bool
, optional, defaults toTrue
) — Whether to rescale the image. - rescale_factor (
Union[int, float, NoneType]
, defaults to0.00392156862745098
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional, defaults toTrue
) — Whether to normalize the image. - image_mean (
Union[float, list[float], NoneType]
, defaults to[0.5, 0.5, 0.5]
) — Image mean to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - image_std (
Union[float, list[float], NoneType]
, defaults to[0.5, 0.5, 0.5]
) — Image standard deviation to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - do_convert_rgb (
bool
, optional, defaults toNone
) — Whether to convert the image to RGB. - return_tensors (
Union[str, ~utils.generic.TensorType, NoneType]
, defaults toNone
) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - data_format (
~image_utils.ChannelDimension
, optional, defaults toChannelDimension.FIRST
) — OnlyChannelDimension.FIRST
is supported. Added for compatibility with slow processors. - input_data_format (
Union[~image_utils.ChannelDimension, str, NoneType]
, defaults toNone
) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
- device (
torch.device
, optional, defaults toNone
) — The device to process the images on. If unset, the device is inferred from the input images. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image to make the width and height divisible bysize_divisibility
. Can be overridden by thedo_pad
parameter in thepreprocess
method. - size_divisibility (
<class 'int'>.size_divisibility
, optional, defaults to 32) — The width and height of the image will be padded to be divisible by this number.
Constructs a fast Vitmatte image processor.
preprocess
< source >( images: list trimaps: list **kwargs: typing_extensions.Unpack[transformers.models.vitmatte.image_processing_vitmatte_fast.VitMatteFastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>
Parameters
- images (
list
) — Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, setdo_rescale=False
. - trimaps (
list
) — The trimaps to preprocess. - do_resize (
bool
, optional) — Whether to resize the image. - size (
dict[str, int]
, optional) — Describes the maximum input dimensions to the model. - default_to_square (
bool
, optional) — Whether to default to a square image when resizing, if size is an int. - resample (
Union[PILImageResampling, F.InterpolationMode, NoneType]
) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling
. Only has an effect ifdo_resize
is set toTrue
. - do_center_crop (
bool
, optional) — Whether to center crop the image. - crop_size (
dict[str, int]
, optional) — Size of the output image after applyingcenter_crop
. - do_rescale (
bool
, optional) — Whether to rescale the image. - rescale_factor (
Union[int, float, NoneType]
) — Rescale factor to rescale the image by ifdo_rescale
is set toTrue
. - do_normalize (
bool
, optional) — Whether to normalize the image. - image_mean (
Union[float, list[float], NoneType]
) — Image mean to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - image_std (
Union[float, list[float], NoneType]
) — Image standard deviation to use for normalization. Only has an effect ifdo_normalize
is set toTrue
. - do_convert_rgb (
bool
, optional) — Whether to convert the image to RGB. - return_tensors (
Union[str, ~utils.generic.TensorType, NoneType]
) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - data_format (
~image_utils.ChannelDimension
, optional) — OnlyChannelDimension.FIRST
is supported. Added for compatibility with slow processors. - input_data_format (
Union[~image_utils.ChannelDimension, str, NoneType]
) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:"channels_first"
orChannelDimension.FIRST
: image in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: image in (height, width, num_channels) format."none"
orChannelDimension.NONE
: image in (height, width) format.
- device (
torch.device
, optional) — The device to process the images on. If unset, the device is inferred from the input images. - do_pad (
bool
, optional, defaults toTrue
) — Whether to pad the image to make the width and height divisible bysize_divisibility
. Can be overridden by thedo_pad
parameter in thepreprocess
method. - size_divisibility (
<class 'int'>.size_divisibility
, optional, defaults to 32) — The width and height of the image will be padded to be divisible by this number.
Returns
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict
) — Dictionary of lists/arrays/tensors returned by the call method (‘pixel_values’, etc.). - tensor_type (
Union[None, str, TensorType]
, optional) — You can give a tensor_type here to convert the lists of integers in PyTorch/TensorFlow/Numpy Tensors at initialization.
VitMatteForImageMatting
class transformers.VitMatteForImageMatting
< source >( config )
Parameters
- config (VitMatteForImageMatting) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
ViTMatte framework leveraging any vision backbone e.g. for ADE20k, CityScapes.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None labels: typing.Optional[torch.Tensor] = None return_dict: typing.Optional[bool] = None )
Parameters
- pixel_values (
torch.Tensor
of shape(batch_size, num_channels, image_size, image_size)
, optional) — The tensors corresponding to the input images. Pixel values can be obtained using{image_processor_class}
. See{image_processor_class}.__call__
for details ({processor_class}
uses{image_processor_class}
for processing images). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - labels (
torch.Tensor
of shape(batch_size, height, width)
, optional) — Ground truth image matting for computing the loss. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
The VitMatteForImageMatting forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Examples:
>>> from transformers import VitMatteImageProcessor, VitMatteForImageMatting
>>> import torch
>>> from PIL import Image
>>> from huggingface_hub import hf_hub_download
>>> processor = VitMatteImageProcessor.from_pretrained("hustvl/vitmatte-small-composition-1k")
>>> model = VitMatteForImageMatting.from_pretrained("hustvl/vitmatte-small-composition-1k")
>>> filepath = hf_hub_download(
... repo_id="hf-internal-testing/image-matting-fixtures", filename="image.png", repo_type="dataset"
... )
>>> image = Image.open(filepath).convert("RGB")
>>> filepath = hf_hub_download(
... repo_id="hf-internal-testing/image-matting-fixtures", filename="trimap.png", repo_type="dataset"
... )
>>> trimap = Image.open(filepath).convert("L")
>>> # prepare image + trimap for the model
>>> inputs = processor(images=image, trimaps=trimap, return_tensors="pt")
>>> with torch.no_grad():
... alphas = model(**inputs).alphas
>>> print(alphas.shape)
torch.Size([1, 1, 640, 960])