What is Vision Transformer (ViT)?

Question

Accepted Answer

An adaptation of the transformer architecture for computer vision that processes images as sequences of patches, achieving state-of-the-art results on image classification and other visual tasks. Vision Transformers, introduced by Google in 2020, apply the same self-attention mechanism used in LLMs to image understanding. ViTs split images into fixed-size patches (typically 16x16 pixels), flatten them into sequences, and process them with standard transformer layers. When pre-trained on large datasets, ViTs have surpassed CNNs on most vision benchmarks. Variants like DINOv2, Segment Anything (SAM), and CLIP have extended the approach to self-supervised learning, segmentation, and multi-modal understanding.

Vision Transformer (ViT)

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

AI Alignment

ChatGPT

Fine-Tuning

Foundation Model

Frontier Model

AI Economy Pulse