File size: 483 Bytes
57bdca5 |
1 2 3 |
ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers! Encoder[[cv-encoder]] The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens. |