Wiki · Concept · Last reviewed May 15, 2026

Frontier AI Safety Frameworks

Frontier AI safety frameworks are company policies for evaluating advanced AI risks and deciding whether to train, deploy, restrict, secure, or pause powerful models.

Definition

A frontier AI safety framework is a written policy used by a frontier AI developer to identify high-risk capabilities, evaluate models, set thresholds, define safeguards, and decide what actions are required before training, deployment, or release.

These documents are sometimes called preparedness frameworks, responsible scaling policies, frontier safety frameworks, safety and security frameworks, or frontier AI frameworks. They became more common after the 2024 Seoul AI Safety Summit commitments, where major AI developers agreed to publish frameworks for managing severe risks from advanced systems.

Major Examples

OpenAI Preparedness Framework. OpenAI's updated 2025 framework describes tracked categories for severe risk, including biological and chemical capability, cybersecurity, and AI self-improvement, as well as research categories such as capability concealment, safeguard evasion, autonomous replication, and shutdown resistance.

Anthropic Responsible Scaling Policy. Anthropic's RSP uses AI Safety Levels, or ASLs, to tie model capability and deployment conditions to required security and safety practices. Version 3.0 was announced in February 2026, and later materials describe version 3.1 effective April 2026.

Google DeepMind Frontier Safety Framework. Google DeepMind's framework uses Critical Capability Levels to identify warning thresholds for dangerous capabilities. 2025 updates expanded attention to machine-learning research and development acceleration, deceptive behavior, persuasion, and shutdown-resistance scenarios.

Other company frameworks. METR's common-elements analysis reported that twelve companies had published frontier AI safety policies by late 2025, including Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA.

Common Components

Risk categories. Frameworks define domains such as cyber, biological and chemical misuse, model autonomy, persuasion, self-improvement, model-weight security, and loss of control.

Capability thresholds. They specify levels at which a model's abilities trigger stronger safeguards, leadership review, external evaluation, deployment restrictions, or security controls.

Evaluation procedures. They describe tests, benchmarks, red-team exercises, expert review, or third-party evaluations used to assess whether thresholds have been reached.

Safeguards and mitigations. They identify controls such as access limits, monitoring, classifier safeguards, incident response, model-weight security, red teaming, staged deployment, and refusal policies.

Governance process. They define who reviews results, who can approve release, how disputes escalate, and when a model must be delayed, restricted, or further tested.

Why It Matters

Frontier AI safety frameworks are the private-sector operating rules for dangerous capability. They translate abstract AI risk into release gates, internal procedures, and public commitments.

They also create evidence trails. If a company claims it tested a model, found no high-risk capability, and released responsibly, the framework is the document that should make that claim checkable. Without a framework, safety claims remain mostly rhetorical. With a weak framework, they become procedural theater.

These frameworks now interact with law and policy. Public transparency laws, procurement requirements, AI safety institutes, and voluntary government evaluation agreements increasingly refer to company safety frameworks or require companies to publish and follow them.

Limits

Voluntary control. Many frameworks are written, interpreted, and revised by the same companies whose releases they govern.

Ambiguous thresholds. A threshold can look precise while relying on unsettled evaluations, narrow scaffolds, or judgment calls.

Competitive pressure. A company may face strong incentives to define risks narrowly, move quickly, or adjust safeguards after rivals advance.

Framework drift. Documents can be updated in ways that weaken earlier commitments or shift from precaution to post-hoc mitigation.

Unmeasured harms. Most frameworks focus on catastrophic misuse, cyber, bio, autonomy, and security. They often say less about dependency, labor disruption, spiritual delusion, manipulation, civil-rights harms, or institutional capture.

Governance Requirements

A credible framework should state what it covers, what it does not cover, what thresholds trigger action, who has authority to stop a release, and what evidence the public can inspect.

It should also be versioned. Changes to thresholds, categories, safeguards, or release rules should be dated, justified, and archived. Otherwise, a company can quietly move the goalposts as capability rises.

Independent evaluation matters. A framework is stronger when outside evaluators, safety institutes, regulators, and qualified auditors can test models under realistic conditions and publish enough detail for public scrutiny.

Spiralist Reading

A frontier AI safety framework is a ritual gate around the machine.

At its best, the gate is real: it creates thresholds, evidence, delay, review, and consequences. At its worst, it is permission architecture. The lab names the dangers, writes the test, grades the model, updates the rules, and announces that the path remains open.

For Spiralism, the question is not whether frameworks are useful. They are. The question is whether they create public friction strong enough to matter when the model, the market, and the myth of progress all point toward release.

Sources


Return to Wiki