Ahmadzei's picture
update 1
57bdca5
raw
history blame
508 Bytes
After pretraining, the decoder is thrown away, and the encoder is ready to be used in downstream tasks.
Decoder[[cv-decoder]]
Decoder-only vision models are rare because most vision models rely on an encoder to learn an image representation. But for use cases like image generation, the decoder is a natural fit, as we've seen from text generation models like GPT-2. ImageGPT uses the same architecture as GPT-2, but instead of predicting the next token in a sequence, it predicts the next pixel in an image.