What is Reinforcement Learning from Human Feedback (RLHF)?

Question

Accepted Answer

A technique for aligning AI models with human preferences by having human evaluators rank model outputs, then using those rankings as a reward signal to improve the model's behavior. RLHF is the primary technique used to make large language models helpful, harmless, and honest. It was instrumental in making ChatGPT conversational and useful. The technique requires significant human labor for evaluation, creating new job categories like AI trainers. As AI systems become more capable, scaling RLHF and developing more efficient alignment techniques is a major focus of AI safety research, which receives less than 1% of total AI R&D spending.

Reinforcement Learning from Human Feedback (RLHF)

Explore the Data

Related Terms

Artificial General Intelligence (AGI)

AI Alignment

AI Safety

ChatGPT

Deepfake

Fine-Tuning

AI Economy Pulse