Skip to main content
Aiconomy

Vision Transformer (ViT)

An adaptation of the transformer architecture for computer vision that processes images as sequences of patches, achieving state-of-the-art results on image classification and other visual tasks.

Vision Transformers, introduced by Google in 2020, apply the same self-attention mechanism used in LLMs to image understanding. ViTs split images into fixed-size patches (typically 16x16 pixels), flatten them into sequences, and process them with standard transformer layers. When pre-trained on large datasets, ViTs have surpassed CNNs on most vision benchmarks. Variants like DINOv2, Segment Anything (SAM), and CLIP have extended the approach to self-supervised learning, segmentation, and multi-modal understanding.

Explore the Data

AI Economy Pulse

Every Friday: 3 data points shaping the AI economy this week. Cited sources. No fluff.

Data cited to: Stanford HAI, IEA, OECD, IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

Weekly. Unsubscribe in one click.