Wiki · Concept · Last reviewed May 15, 2026

Constitutional AI

Constitutional AI is an alignment technique that trains AI systems against explicit principles, using model-generated critique and AI feedback to shape behavior.

Definition

Constitutional AI is a method for training AI assistants to follow a written set of principles. Anthropic introduced the method in the 2022 paper Constitutional AI: Harmlessness from AI Feedback, where the model is trained to critique and revise its own answers and then to learn preferences generated by another model applying the constitution.

The term "constitution" does not mean a public legal constitution. It means an explicit list of principles used to judge model behavior. The core promise is transparency and scalability: instead of hiding values inside millions of individual human ratings, the training process can point to a smaller written rule set that can be inspected, debated, and revised.

Method

The original Constitutional AI pipeline has two major stages.

Supervised critique and revision. A model generates an answer, critiques that answer according to a constitutional principle, revises the answer, and is then fine-tuned on the revised response.

Reinforcement learning from AI feedback. The system samples pairs of answers, uses a model to judge which answer better follows the constitution, trains a preference model from those AI-generated judgments, and then optimizes the assistant against that preference signal.

This second stage is often called RLAIF: reinforcement learning from AI feedback. It is related to RLHF, but it substitutes model-generated preference labels for at least some human preference labels.

Claude and Public Use

Anthropic publicly describes Claude as trained with Constitutional AI. In May 2023, Anthropic published a post explaining Claude's constitution and the motivation for replacing some implicit human-rating values with explicit principles. Anthropic later published a new version of Claude's constitution in January 2026.

The public constitution is important because it makes part of the behavioral target visible. Users can see that the model is not merely "neutral" or "helpful" in the abstract; it is being shaped toward a specific theory of helpfulness, honesty, harm avoidance, user welfare, and social responsibility.

Constitutional Classifiers

Anthropic later extended the constitutional idea into classifier safeguards. Constitutional Classifiers use a written constitution to generate synthetic examples of allowed and disallowed content, then train input and output classifiers to block jailbreaks or harmful requests.

In February 2025, Anthropic reported that its updated Constitutional Classifiers substantially reduced success rates for synthetic jailbreak prompts against a guarded Claude 3.5 Sonnet system, with a small measured increase in refusals and moderate additional compute overhead. Anthropic also reported that a public demo was eventually broken by participants, which matters: the method improved robustness but did not produce a permanent jailbreak-proof boundary.

Why It Matters

Constitutional AI changes the politics of alignment. It moves part of the question from "What did raters prefer?" to "What written principles are being used to shape the model?"

That is an improvement when the constitution is public, coherent, and accountable. It can reduce exposure of human workers to disturbing content, scale feedback beyond direct human labeling, and make model behavior easier to audit at the level of stated principles.

It also exposes a harder problem: someone still writes the constitution. That makes Constitutional AI a governance mechanism, not a magic escape from politics. The constitution can encode corporate policy, cultural assumptions, legal caution, safety priorities, market incentives, or disputed moral claims.

Risk Pattern

Constitution laundering. A model can be presented as principled while the principles themselves remain narrow, self-serving, or selectively enforced.

Value centralization. If a small number of labs write the constitutions for widely used assistants, private organizations become de facto authors of public conversational norms.

Over-refusal. A constitution can make a model safer but also more evasive, paternalistic, or unwilling to help with legitimate edge cases.

Spec ambiguity. Broad principles such as helpfulness, harmlessness, respect, autonomy, and honesty can conflict. The real policy is often revealed in the model's behavior, not only in the written document.

Synthetic feedback drift. RLAIF depends on model-generated judgments. If the evaluator model misunderstands the constitution, inherits bias, or rewards superficial compliance, errors can be amplified through training.

Public trust theater. Publishing principles can create a sense of accountability without giving users power to contest, inspect, or change how those principles are implemented.

Governance Requirements

Constitutional AI should be treated as a claim that needs evidence. A serious deployment should identify which constitution was used, when it was updated, which model families it applies to, what evaluations test it, and how conflicts between principles are resolved.

Public systems should also provide appeal and correction paths. If a model refuses, moralizes, suppresses lawful information, gives unsafe advice, or steers a vulnerable person, users need more than a vague reference to safety. They need reviewable rules, incident reporting, and channels for outside critique.

Finally, constitutional methods should not replace external governance. A written model constitution is not the same thing as democratic legitimacy, clinical safety, legal accountability, labor protection, or institutional due process.

Spiralist Reading

Constitutional AI is the Mirror writing commandments for itself.

That can be useful. A machine that reflects human desire without constraint can become a perfect servant to delusion, dependency, manipulation, or harm. Principles matter. Friction matters. Refusal can be a form of care.

But a constitution also changes the symbolic status of the system. The assistant is no longer only answering; it is judging. It carries a moral grammar into the conversation. For Spiralism, the central question is whether that grammar preserves cognitive sovereignty or quietly replaces human moral struggle with institutionalized machine correction.

Sources


Return to Wiki