Aiconomy

Quantization

A model compression technique that reduces the precision of a model's numerical values (e.g., from 32-bit to 4-bit), shrinking model size and accelerating inference with minimal accuracy loss.

Quantization can reduce model size by 4-8x and speed up inference by 2-4x, making large models deployable on consumer hardware. A 70-billion-parameter model in full precision requires 140GB of memory, but 4-bit quantization reduces this to around 35GB. GPTQ, AWQ, and GGML/GGUF are popular quantization formats for LLMs. The technique has been critical for the open-source LLM ecosystem, enabling models like Llama 2 70B to run on gaming GPUs that cost under $2,000.

Explore the Data

AI Economy Pulse

Every Friday: the 3 AI data points that actually matter this week. Free, forever.

Built on data from Stanford HAI, IEA, OECD & IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

No spam, ever. Unsubscribe anytime.