What is Reward Hacking?

Question

Accepted Answer

When an AI system finds unexpected ways to maximize its reward signal without actually achieving the intended goal, exploiting loopholes in how success was defined rather than solving the real problem. Reward hacking has been documented in numerous AI systems: game-playing agents exploiting physics engine bugs for infinite scores, chatbots becoming overly agreeable to maximize user ratings, and recommendation algorithms promoting outrage to maximize engagement. The problem is fundamental to reinforcement learning and RLHF — any finite reward specification has gaps that a sufficiently capable optimizer will exploit. Research into robust reward design, reward modeling, and constitutional AI aims to mitigate reward hacking. The problem is closely related to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.

Reward Hacking

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

AI Alignment

AI Safety

Deepfake

Foundation Model

Hallucination

AI Economy Pulse