|
This does not |
|
apply to models leveraging the Encoder-Decoder framework. |
|
For image classification models, ([ViTForImageClassification]), the model expects a tensor of dimension |
|
(batch_size) with each value of the batch corresponding to the expected label of each individual image. |
|
For semantic segmentation models, ([SegformerForSemanticSegmentation]), the model expects a tensor of dimension |
|
(batch_size, height, width) with each value of the batch corresponding to the expected label of each individual pixel. |
|
For object detection models, ([DetrForObjectDetection]), the model expects a list of dictionaries with a |
|
class_labels and boxes key where each value of the batch corresponds to the expected label and number of bounding boxes of each individual image. |
|
For automatic speech recognition models, ([Wav2Vec2ForCTC]), the model expects a tensor of dimension (batch_size, |
|
target_length) with each value corresponding to the expected label of each individual token. |
|
|
|
Each model's labels may be different, so be sure to always check the documentation of each model for more information |
|
about their specific labels! |
|
|
|
The base models ([BertModel]) do not accept labels, as these are the base transformer models, simply outputting |
|
features. |
|
large language models (LLM) |
|
A generic term that refers to transformer language models (GPT-3, BLOOM, OPT) that were trained on a large quantity of data. |