What is Quantization?

Quantization

A model compression technique that reduces the precision of a model's numerical values (e.g., from 32-bit to 4-bit), shrinking model size and accelerating inference with minimal accuracy loss.

Quantization can reduce model size by 4-8x and speed up inference by 2-4x, making large models deployable on consumer hardware. A 70-billion-parameter model in full precision requires 140GB of memory, but 4-bit quantization reduces this to around 35GB. GPTQ, AWQ, and GGML/GGUF are popular quantization formats for LLMs. The technique has been critical for the open-source LLM ecosystem, enabling models like Llama 2 70B to run on gaming GPUs that cost under $2,000.

Explore the Data

AI Models AI Compute

AI Economy Pulse

Every Friday: 3 data points shaping the AI economy this week. Cited sources. No fluff.

Data cited to: Stanford HAI, IEA, OECD, IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

Weekly. Unsubscribe in one click.

Quantization

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

AI Compute

Capex (Capital Expenditure)

ChatGPT

Data Center

Fine-Tuning

AI Economy Pulse