Tokenization
The process of breaking text into smaller units called tokens (words, subwords, or characters) that serve as the basic input elements for language models.
Tokenization is the first step in processing text for any language model. Modern tokenizers like BPE (Byte Pair Encoding) and SentencePiece split text into subword units, balancing vocabulary size with coverage. GPT-4 uses approximately 100,000 tokens in its vocabulary. A single English word averages about 1.3 tokens. Tokenization affects model efficiency — different languages require different numbers of tokens for the same content, making some languages 2-5x more expensive to process. Context window sizes (e.g., 128K tokens) are measured in tokens, not words.
Explore the Data
Related Terms
Artificial General Intelligence (AGI)
A hypothetical form of AI that can understand, learn, and apply knowledge across any intellectual task at or above human level, rather than being specialized for specific tasks.
ChatGPT
OpenAI's conversational AI assistant, launched in November 2022, which catalyzed the current generative AI boom by demonstrating the capabilities of large language models to a mainstream audience.
Fine-Tuning
The process of further training a pre-trained AI model on a specific, smaller dataset to specialize it for a particular task or domain, requiring far less compute than training from scratch.
Foundation Model
A large AI model trained on broad data that can be adapted to a wide range of downstream tasks — examples include GPT-4, Claude, Gemini, and Llama.
Frontier Model
The most capable and advanced AI models at any given time, typically trained with the largest compute budgets and achieving state-of-the-art performance on benchmarks.
Generative AI
AI systems that can create new content — text, images, code, audio, video — rather than simply analyzing or classifying existing data. Large language models and diffusion models are the primary architectures.
AI Economy Pulse
Every Friday: the 3 AI data points that actually matter this week. Free, forever.
Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”
No spam, ever. Unsubscribe anytime.