Ahmadzei's picture
update 1
57bdca5
raw
history blame
1.4 kB
This does not
apply to models leveraging the Encoder-Decoder framework.
For image classification models, ([ViTForImageClassification]), the model expects a tensor of dimension
(batch_size) with each value of the batch corresponding to the expected label of each individual image.
For semantic segmentation models, ([SegformerForSemanticSegmentation]), the model expects a tensor of dimension
(batch_size, height, width) with each value of the batch corresponding to the expected label of each individual pixel.
For object detection models, ([DetrForObjectDetection]), the model expects a list of dictionaries with a
class_labels and boxes key where each value of the batch corresponds to the expected label and number of bounding boxes of each individual image.
For automatic speech recognition models, ([Wav2Vec2ForCTC]), the model expects a tensor of dimension (batch_size,
target_length) with each value corresponding to the expected label of each individual token.
Each model's labels may be different, so be sure to always check the documentation of each model for more information
about their specific labels!
The base models ([BertModel]) do not accept labels, as these are the base transformer models, simply outputting
features.
large language models (LLM)
A generic term that refers to transformer language models (GPT-3, BLOOM, OPT) that were trained on a large quantity of data.