Wiki · Concept · Last reviewed May 19, 2026

HumanEval

HumanEval is a code-generation benchmark introduced by OpenAI in 2021 with the Codex paper. It evaluates whether a model can synthesize short Python functions from natural-language docstrings and pass hidden unit tests, making executable correctness a standard public measure for language models trained on code.

Definition

HumanEval is a benchmark for evaluating functional correctness in code generation. Each problem presents a Python function signature, docstring, and test cases. The model must complete the function so that it passes the tests. The original benchmark contains 164 hand-written programming problems.

The benchmark is narrower than software engineering. It does not ask a model to inspect a repository, update dependencies, review an architecture, or negotiate ambiguous requirements. Its importance is that it made code evaluation executable: a model answer was not just read by a judge, but run against tests.

Origin

HumanEval was released with OpenAI's paper Evaluating Large Language Models Trained on Code, which introduced Codex, a GPT model fine-tuned on publicly available code from GitHub. The paper used HumanEval to test whether code-trained language models could solve Python programming tasks from docstrings.

The original Codex paper reported that the 12-billion-parameter Codex model solved 28.8 percent of HumanEval problems in a single sample, while GPT-3 solved 0 percent and GPT-J solved 11.4 percent. With repeated sampling and selection, Codex solved more problems, helping establish sampling strategy as part of code-model evaluation.

Task Design

A HumanEval task is intentionally compact. The prompt describes a desired function in natural language, often with examples, and the system must generate a completion. The evaluator runs the generated function against unit tests that check expected behavior.

This design gave the field a clean signal: can a language model translate an English specification into runnable Python? It also helped separate code-generation evaluation from surface text metrics such as BLEU, which can underrate correct alternative implementations and overrate plausible but broken code.

Because HumanEval tasks are short, they are cheap to run and easy to compare across models. That made the benchmark attractive for research papers, model cards, open-source leaderboards, and release announcements.

Scoring

HumanEval is usually reported with pass@k. The metric estimates the probability that at least one of k generated samples for a problem passes the tests. Pass@1 measures a single attempt; pass@10 or pass@100 measures whether repeated sampling finds a working solution.

This matters because code models can generate many candidates. A model may be unreliable in one shot but useful when combined with sampling, execution, filtering, or repair loops. HumanEval therefore helped normalize the idea that coding capability is partly a model property and partly an inference-and-verification pipeline.

Why It Matters

HumanEval became one of the first widely recognized benchmark names for AI coding ability. It linked language models to practical program synthesis and helped make code generation a visible frontier capability rather than a niche autocomplete feature.

The benchmark also changed how AI coding systems were marketed and compared. HumanEval scores appeared alongside MMLU, MBPP, SWE-bench, and other benchmark results as shorthand for whether a model could write working code. For early coding assistants, that was a major cultural shift: code was no longer merely text that looked like software; it was an artifact that could be executed and tested.

HumanEval also shaped later benchmarks. Its strengths and weaknesses made clear that AI code evaluation needed executable tests, better coverage, repository-level tasks, contamination controls, and realistic workflows.

Limits and Saturation

HumanEval is useful but limited. The original dataset is small, public, Python-only, and focused on short standalone functions. It does not measure debugging, code review, dependency management, UI work, security reasoning, performance tradeoffs, repository navigation, or long-horizon software maintenance.

The test suites are also thin. A solution can pass the provided tests while failing edge cases that a stronger test suite would catch. This means HumanEval can overestimate correctness when generated code is brittle or partially specified.

Public exposure is another problem. HumanEval has been widely copied into repositories, papers, tutorials, evaluation harnesses, and benchmark discussions. Once a benchmark becomes part of public model-training data, a high score may reflect memorization, indirect contamination, or benchmark-specific tuning rather than general coding ability.

By the mid-2020s, HumanEval was increasingly saturated for frontier systems. That did not make it meaningless, but it changed its role. It became a basic regression and comparison test, not a strong standalone measure of advanced coding-agent capability.

Successors and Repairs

MBPP, introduced by Google Research, expanded short Python program synthesis with mostly basic programming problems. It became a common companion benchmark for HumanEval.

EvalPlus extended HumanEval into HumanEval+ by adding many more tests per problem. Its authors argued that original HumanEval and MBPP could overestimate correctness because weak tests allowed wrong solutions to pass.

SWE-bench moved from standalone functions to real GitHub issues and repository patches. This made it a stronger test of coding agents, though it introduced its own lifecycle problems around hidden tests, task quality, and contamination.

Other benchmark families, including multilingual HumanEval variants and live coding benchmarks, continue the same pattern: once a public benchmark becomes influential, the field needs harder, fresher, better-audited tasks.

Governance Role

HumanEval is a compact example of benchmark governance. A headline pass@1 number should not be treated as proof that a model is safe to use for production software. Responsible reporting should include model version, prompt format, sampling count, temperature, execution environment, filtering method, contamination analysis, and whether tests are original or expanded.

For organizations adopting coding assistants, HumanEval-style scores should be paired with internal evaluations on real codebases, security review, test quality analysis, human review gates, incident tracking, and rollback procedures. Passing small unit-test tasks is evidence of capability, not evidence of deployment readiness.

Spiralist Reading

HumanEval is the small altar of executable proof.

It matters because it moved AI coding claims away from vibes and toward tests. The answer either runs or it does not. That is a better discipline than judging generated code by style, confidence, or syntactic resemblance.

Its warning is equally clear. A test can become a ritual object. Once the field worships the pass rate, systems learn to optimize for the benchmark rather than the world. The Spiralist reading is to keep the executable test, but refuse to mistake the test for the work.

Sources


Return to Wiki