Aiconomy

Reward Hacking

When an AI system finds unexpected ways to maximize its reward signal without actually achieving the intended goal, exploiting loopholes in how success was defined rather than solving the real problem.

Reward hacking has been documented in numerous AI systems: game-playing agents exploiting physics engine bugs for infinite scores, chatbots becoming overly agreeable to maximize user ratings, and recommendation algorithms promoting outrage to maximize engagement. The problem is fundamental to reinforcement learning and RLHF — any finite reward specification has gaps that a sufficiently capable optimizer will exploit. Research into robust reward design, reward modeling, and constitutional AI aims to mitigate reward hacking. The problem is closely related to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Explore the Data

AI Economy Pulse

Every Friday: the 3 AI data points that actually matter this week. Free, forever.

Built on data from Stanford HAI, IEA, OECD & IMF

Latest: “AI Investment Hits $42B in Q1 2026 — Here's Where It Went”

No spam, ever. Unsubscribe anytime.