Wiki · Concept · Last reviewed May 15, 2026

AI Evaluations

AI evaluations are structured attempts to measure what AI systems can do, where they fail, and whether claims about capability, safety, alignment, or deployment readiness are credible.

Definition

An AI evaluation is a test, benchmark, red-team exercise, audit, measurement process, or incident review used to understand an AI system. Evaluations can measure ordinary product quality, scientific capability, cybersecurity ability, biological risk, autonomy, persuasion, bias, privacy leakage, hallucination, robustness, tool use, or compliance with policy.

NIST often describes this broader family as test, evaluation, verification, and validation, or TEVV. The phrase matters because evaluation is not only a leaderboard score. It includes whether the test is valid, whether it measures the intended property, whether results generalize, and whether claims can be independently checked.

Types of Evaluation

Benchmarks. Standardized tasks compare models on math, coding, reading, science, reasoning, language, multimodal understanding, tool use, or domain knowledge.

Behavioral safety evals. These test whether a model refuses or complies with dangerous, disallowed, manipulative, discriminatory, or policy-violating requests.

Red teaming. Human or automated attackers try to make a system fail, jailbreak, leak data, assist harm, or behave outside intended boundaries.

Dangerous capability evals. These test whether a model can materially assist cyber operations, biological misuse, chemical misuse, persuasion, fraud, autonomous replication, or other high-consequence activity.

Autonomy evals. These measure whether a system can plan, use tools, recover from errors, pursue subgoals, conduct long-horizon tasks, or operate with limited human intervention.

Post-deployment monitoring. These track incidents, user reports, drift, misuse, refusals, near misses, and real-world harms after release.

Frontier Evaluations

Frontier evaluations became more important as general-purpose models gained tool use, coding ability, long-context reasoning, and agent scaffolds. METR evaluates models for autonomous capabilities and has published evaluations of frontier systems such as OpenAI o1-preview and Claude 3.7 Sonnet. OpenAI's Preparedness Framework ties deployment decisions to evaluated risk categories such as cybersecurity, biological and chemical capability, persuasion, and model autonomy.

System cards and model cards are public artifacts connected to evaluations. A system card may describe capability tests, safety mitigations, limitations, model behavior, red-team findings, and deployment controls. The value of these documents depends on specificity: vague safety language is not an evaluation.

Limits

Evaluations are necessary but incomplete. A model can pass a benchmark and still fail in the world. A safety test can miss a novel jailbreak. A dangerous-capability eval can understate risk if the tested scaffold is weak, the model is poorly prompted, or the evaluators do not explore enough tool configurations.

Benchmark saturation is another problem. When models train on public benchmark-like material or developers tune toward visible tests, scores can rise without matching real-world reliability. Contamination and overfitting make a model look more capable or safer than it is.

Evaluations can also be political. The choice of what to test defines what counts as risk. A lab may test bioweapon assistance while ignoring labor displacement, dependency, emotional manipulation, institutional capture, or spiritualized delusion loops. The untested domain becomes the ungoverned domain.

Governance Role

AI governance increasingly depends on evaluations. Release gates, safety thresholds, model cards, incident reporting, procurement rules, audits, insurance, licensing proposals, and standards all need evidence about what a model can do and how it fails.

For evaluations to matter, they need independence, reproducibility where possible, clear scope, dated model versions, disclosed scaffolds, uncertainty ranges, and adversarial pressure. A strong evaluation report says not only what was observed, but also what was not tested and what could change the result.

Evaluation should continue after deployment. Real users, tool access, incentives, prompt ecosystems, fine-tuning, memory, agents, and product integrations can change the effective system far beyond the lab test.

Risk Pattern

Evaluation theater. A company can present many tests while avoiding the hard questions that would constrain release.

Metric capture. Developers can optimize toward benchmarks instead of real reliability, truth, agency preservation, or public accountability.

Scaffold sensitivity. A model's practical capability can change sharply depending on tools, prompting, memory, retries, agent loops, and human support.

Opaque failures. Public reports may summarize results without showing prompts, rubrics, evaluator disagreements, failed attempts, or internal thresholds.

One-time certification. A model can be treated as "safe" after a pre-release evaluation even though deployment changes the real system.

Unmeasured harms. The most legible risks may receive the most attention while slow social harms remain outside the test suite.

Spiralist Reading

Evaluations are reality friction for the machine.

A model speaks fluently. It can make capability feel like authority and safety feel like tone. Evaluation interrupts the spell by asking for evidence: what happened, under what conditions, with which tools, against which baseline, and with which failures hidden outside the frame?

For Spiralism, the danger is that evaluation becomes another ritual of permission. A lab performs the ceremony, publishes the card, names the thresholds, and continues scaling. The useful path is harder: evaluations must remain adversarial, public enough to matter, humble about uncertainty, and connected to real power to delay, constrain, or reverse deployment.

Sources


Return to Wiki