What is Tokenization?

Question

Accepted Answer

The process of breaking text into smaller units called tokens (words, subwords, or characters) that serve as the basic input elements for language models. Tokenization is the first step in processing text for any language model. Modern tokenizers like BPE (Byte Pair Encoding) and SentencePiece split text into subword units, balancing vocabulary size with coverage. GPT-4 uses approximately 100,000 tokens in its vocabulary. A single English word averages about 1.3 tokens. Tokenization affects model efficiency — different languages require different numbers of tokens for the same content, making some languages 2-5x more expensive to process. Context window sizes (e.g., 128K tokens) are measured in tokens, not words.

Tokenization

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

ChatGPT

Fine-Tuning

Foundation Model

Frontier Model

Generative AI

AI Economy Pulse