Aiconomy

Tokenization

The process of breaking text into smaller units called tokens (words, subwords, or characters) that serve as the basic input elements for language models.

Tokenization is the first step in processing text for any language model. Modern tokenizers like BPE (Byte Pair Encoding) and SentencePiece split text into subword units, balancing vocabulary size with coverage. GPT-4 uses approximately 100,000 tokens in its vocabulary. A single English word averages about 1.3 tokens. Tokenization affects model efficiency — different languages require different numbers of tokens for the same content, making some languages 2-5x more expensive to process. Context window sizes (e.g., 128K tokens) are measured in tokens, not words.

Explore the Data

AI Economy Pulse

Every Friday: the 3 AI data points that actually matter this week. Free, forever.

Built on data from Stanford HAI, IEA, OECD & IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

No spam, ever. Unsubscribe anytime.