Spaces:

Ahmadzei
/

RAG

Runtime error

RAG

File size: 483 Bytes

57bdca5

ConvNeXt also makes several layer design choices to be more memory-efficient and improve performance, so it competes favorably with Transformers!
Encoder[[cv-encoder]]
The Vision Transformer (ViT) opened the door to computer vision tasks without convolutions. ViT uses a standard Transformer encoder, but its main breakthrough was how it treated an image. It splits an image into fixed-size patches and uses them to create an embedding, just like how a sentence is split into tokens.