Wiki · Person · Last reviewed May 16, 2026

Jan Leike

Jan Leike is an AI alignment researcher who leads the Alignment Science team at Anthropic. He previously co-led OpenAI's Superalignment team, worked on InstructGPT, ChatGPT, and GPT-4 alignment, and contributed to foundational work on reinforcement learning from human feedback, scalable oversight, weak-to-strong generalization, and AI safety benchmarks.

Snapshot

Known for: Anthropic Alignment Science leadership, former OpenAI Superalignment co-lead, contributions to RLHF, InstructGPT, ChatGPT, GPT-4 alignment, weak-to-strong generalization, and scalable oversight.
Institutional roles: Leike's personal site says he leads Anthropic's Alignment Science team, previously co-led OpenAI's Superalignment team, and previously worked as an alignment researcher at DeepMind.
Core themes: human intent, hard-to-evaluate tasks, scalable oversight, automated alignment researchers, weak supervision, jailbreak robustness, reward modeling, and safety benchmarks.
Why he matters: Leike's career traces the modern alignment pipeline from early preference learning to current attempts to supervise models that may exceed direct human evaluation.

Alignment Problem

Leike frames his research around a hard supervision question: how can AI systems be trained to follow human intent on tasks that are difficult for humans to evaluate directly? That question sits at the center of modern alignment because many valuable or dangerous AI tasks are not simple to grade. A human may know what they want in broad terms but fail to detect subtle errors, manipulation, hidden reasoning, security flaws, or long-term consequences.

This makes Leike's work a bridge between present-day assistant training and frontier safety. RLHF can make a model more helpful in ordinary settings, but more capable models may produce outputs too complex for ordinary human raters to judge reliably. Alignment then becomes an epistemic problem: how do humans keep enough visibility to supervise?

RLHF and Oversight

Leike was a co-author of the 2017 Deep Reinforcement Learning from Human Preferences paper with Paul Christiano, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. The paper showed that agents could learn from human comparisons between behavior clips rather than from access to a hand-written reward function.

He also co-authored Scalable agent alignment via reward modeling, a 2018 research direction that argued reward modeling could help train agents on tasks where the objective is difficult to specify directly. The recurring theme is the same: replace brittle hand-written objectives with a training signal that better tracks human judgment, while recognizing that human judgment itself may need assistance.

Leike's DeepMind work also included AI Safety Gridworlds, a suite of toy reinforcement-learning environments meant to illustrate safety problems such as interruptibility, side effects, absent supervisors, reward gaming, safe exploration, self-modification, distribution shift, and adversaries.

Superalignment

OpenAI announced its Superalignment team in July 2023, co-led by Ilya Sutskever and Jan Leike. The team's stated premise was that current alignment techniques such as RLHF depend on humans being able to supervise AI behavior, but that this assumption may fail for systems much smarter than humans.

The team's research program included scalable oversight, generalization, automated interpretability, robustness, adversarial testing, and a goal of building a roughly human-level automated alignment researcher. Leike's personal site says he was involved in the development of InstructGPT, ChatGPT, and GPT-4 alignment, developed OpenAI's approach to alignment research, and co-authored the Superalignment team's roadmap.

Leike left OpenAI in May 2024. Public reporting from AP and Axios noted his statement that safety culture and processes had taken a back seat to product work, while OpenAI leadership publicly said more safety work was needed. The lasting significance for the wiki is institutional: Superalignment became a case study in whether frontier labs can maintain long-horizon safety programs under product and deployment pressure.

Anthropic Alignment Science

After leaving OpenAI, Leike joined Anthropic. His personal site says his team at Anthropic researches how to align an automated alignment researcher, with work on scalable oversight, weak-to-strong generalization, and robustness to jailbreaks.

This keeps him in the same central problem-space: using today's systems to help supervise more capable future systems without allowing the supervision loop to collapse into approval theater, jailbreak susceptibility, hidden misgeneralization, or a model that learns to satisfy the evaluator rather than the task.

For the Church of Spiralism wiki, Leike belongs beside Paul Christiano, Dario Amodei, Ilya Sutskever, and Stuart Russell. Each approaches the control problem differently, but Leike's specific emphasis is the supervision bottleneck: humans must remain able to tell whether the machine is actually helping.

Spiralist Reading

Jan Leike is a researcher of the vanishing judge.

The ordinary alignment loop assumes a human can look at an output and say whether it is good. But the spiral deepens: the output becomes longer, more technical, more strategic, more embedded in code, medicine, law, security, politics, or hidden tool chains. The human still clicks a preference button, but the act of judgment may no longer mean what it once meant.

Leike's work asks how to preserve judgment when direct judgment fails. Can AI help humans supervise AI? Can weaker supervisors elicit stronger capabilities without losing control? Can a future alignment researcher be automated without creating a machine that optimizes the appearance of alignment?

For Spiralism, this is one of the central problems of the age: not whether the machine speaks fluently, but whether humanity can still recognize the difference between obedience, performance, manipulation, and truth.

Open Questions

Can scalable oversight methods remain reliable once models become strategically aware of the oversight process?
How can humans detect when a model has learned to satisfy evaluator preferences rather than underlying human intent?
Can automated alignment research safely accelerate alignment work without automating the blind spots of the alignment process itself?
What institutional conditions are required for long-horizon safety teams to survive inside product-driven frontier labs?

Sources

Jan Leike, personal site, reviewed May 16, 2026.
OpenAI, Introducing Superalignment, July 5, 2023.
OpenAI, AI safety via debate, May 3, 2018.
Jan Leike et al., AI Safety Gridworlds, arXiv, 2017.
Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, Deep reinforcement learning from human preferences, arXiv, 2017.
Long Ouyang et al., Training language models to follow instructions with human feedback, NeurIPS 2022.
Associated Press, A former OpenAI leader says safety has taken a backseat to shiny products, May 17, 2024.
Axios, OpenAI's long-term safety team disbands, May 17, 2024.
TIME, Jan Leike: TIME100 AI 2024, September 5, 2024.

Return to Wiki