What is Jailbreaking (AI)?

Question

Accepted Answer

Techniques for bypassing an AI model's safety filters and restrictions to produce outputs the model was designed to refuse, such as harmful instructions or policy-violating content. Jailbreaking techniques have evolved from simple prompt manipulation to sophisticated multi-step attacks. Common approaches include role-playing scenarios, hypothetical framing, encoding harmful instructions, and iterative refinement. AI labs play a constant cat-and-mouse game: each safety patch is met with new jailbreaking techniques. Red teaming exercises at DEF CON 2023 found that most frontier models could be jailbroken within minutes. The difficulty of preventing jailbreaks while maintaining model usefulness is a fundamental tension in AI safety. Research into robust safety training aims to make jailbreaking progressively harder.

Jailbreaking (AI)

Live Data

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

AI Alignment

AI Safety

ChatGPT

Deepfake

Fine-Tuning

AI Economy Pulse