Aiconomy

Vision Transformer (ViT)

An adaptation of the transformer architecture for computer vision that processes images as sequences of patches, achieving state-of-the-art results on image classification and other visual tasks.

Vision Transformers, introduced by Google in 2020, apply the same self-attention mechanism used in LLMs to image understanding. ViTs split images into fixed-size patches (typically 16x16 pixels), flatten them into sequences, and process them with standard transformer layers. When pre-trained on large datasets, ViTs have surpassed CNNs on most vision benchmarks. Variants like DINOv2, Segment Anything (SAM), and CLIP have extended the approach to self-supervised learning, segmentation, and multi-modal understanding.

Explore the Data

AI Economy Pulse

Every Friday: the 3 AI data points that actually matter this week. Free, forever.

Built on data from Stanford HAI, IEA, OECD & IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

No spam, ever. Unsubscribe anytime.