Percy Liang
Percy Liang is a Stanford computer scientist and director of the Center for Research on Foundation Models. He is known for machine learning, natural language processing, the foundation-model research agenda, holistic model evaluation, and transparency tools for large AI systems.
Snapshot
- Known for: Stanford computer science professor, CRFM director, foundation-model researcher, HELM coauthor, and advocate for reproducible and transparent AI evaluation.
- Institutional position: Stanford lists Liang as Professor of Computer Science and Senior Fellow at the Stanford Institute for Human-Centered AI. His profile says he directs the Center for Research on Foundation Models.
- Core themes: foundation models, language models, model access, benchmarking, transparency, reproducibility, robustness, interpretability, semantics, reasoning, and human interaction.
- Why he matters: Liang helped turn "foundation models" into a central organizing term for modern AI, then worked on evaluation infrastructure for measuring their capabilities, risks, and transparency.
Foundation Models
Liang's public importance comes from the Stanford foundation-model agenda. In 2021, Stanford HAI announced the Center for Research on Foundation Models as an interdisciplinary initiative for studying the technical, social, legal, economic, and governance implications of models trained broadly at scale and adapted across many downstream tasks.
The CRFM announcement named Liang as director and described foundation models such as BERT, GPT-3, CLIP, and Codex as a new way AI systems would be built. The accompanying report, On the Opportunities and Risks of Foundation Models, argued that these systems create leverage because one base model can support many applications, but also create inherited failure: downstream systems can inherit the same defects, biases, security weaknesses, and opacity.
That framing became durable because it did not treat large models as only a technical improvement. It described a sociotechnical platform shift: model providers, data sources, compute, benchmarks, downstream developers, affected communities, and regulators all become part of one ecosystem.
Evaluation and HELM
Liang is also central to the evaluation turn in AI governance. Stanford CRFM's Holistic Evaluation of Language Models, or HELM, was built to evaluate language models across many scenarios and metrics rather than compressing performance into a single leaderboard score.
The HELM paper and project emphasized transparency, standardization, broad scenario coverage, and multiple dimensions of performance. Accuracy matters, but so do calibration, robustness, fairness, bias, toxicity, efficiency, uncertainty, and the limits of the benchmark itself.
This matters because model evaluation has become a governance primitive. Governments, labs, companies, journalists, users, and auditors all ask similar questions: what can this model do, where does it fail, what risks does it create, and what evidence supports the provider's claims?
Transparency Work
CRFM's later work on the Foundation Model Transparency Index extended the same logic from benchmark performance to public disclosure. The index scores major foundation-model developers on information they disclose about upstream resources, model properties, and downstream use.
The 2025 Foundation Model Transparency Index framed transparency as a public-accountability problem: the most influential model developers shape products, research, labor, information systems, and public institutions, but outside actors often lack basic information about data, labor, compute, evaluation, safety, distribution, and use.
Liang's significance is therefore not only technical. He represents an academic attempt to create shared measurement infrastructure around systems that private companies otherwise describe through marketing, selective benchmark releases, and limited safety reports.
Earlier Research
Before the foundation-model wave, Liang worked across machine learning and natural language processing. Stanford's profile lists research areas including robustness, interpretability, human interaction, learning theory, grounding, semantics, and reasoning. It also describes him as a proponent of reproducibility through CodaLab Worksheets.
His publication record includes work on data poisoning, distribution shift, prefix-tuning, concept bottleneck models, uncertainty calibration, semantic parsing, weak supervision with natural-language explanations, and many other areas that later became relevant to large-model evaluation and deployment.
This breadth explains his role in the foundation-model conversation. The problem is not just whether a model can answer a prompt. It is whether a broad adaptive system can be understood, compared, reproduced, governed, and trusted across changing contexts.
Spiralist Reading
Percy Liang is a cartographer of the model layer.
In the Spiralist frame, foundation models are not only artifacts. They are hidden infrastructure for future speech, work, law, education, medicine, search, coding, and memory. They sit beneath many applications while remaining difficult for ordinary institutions to inspect.
Liang's work matters because it names the layer and demands instruments for it. The foundation-model frame gives society a shared object of analysis. HELM and transparency indexes ask whether that object can be measured in public rather than trusted in private.
The warning is that measurement can also become theater. A benchmark, index, or disclosure template can discipline the field only when it remains open to revision, adversarial scrutiny, missing harms, and the lived reality of people downstream.
Open Questions
- Can public evaluation infrastructure keep pace with closed frontier models, private deployment data, and rapidly changing agent systems?
- Which transparency requirements are useful for accountability without exposing security-sensitive details or private personal data?
- How should foundation-model evaluations account for downstream applications, human dependence, institutional incentives, and long-term effects?
- Can academic centers remain independent when frontier AI research depends on access to models, compute, funding, and company cooperation?
- What forms of benchmark design prevent evaluation from becoming a curriculum that models train around?
Related Pages
- Foundation Models
- AI Evaluations
- Benchmark Contamination
- Model Cards and System Cards
- Training Data
- AI Audits and Third-Party Assurance
- Algorithmic Transparency
- Open-Weight AI Models
- Fei-Fei Li
- Yoshua Bengio
- Timnit Gebru
- Margaret Mitchell
- Individual Players
Sources
- Stanford Profiles, Percy Liang profile, reviewed May 19, 2026.
- Stanford HAI, Introducing the Center for Research on Foundation Models, August 18, 2021.
- Bommasani et al., On the Opportunities and Risks of Foundation Models, arXiv, 2021.
- Stanford CRFM, Language Models are Changing AI: The Need for Holistic Evaluation, November 17, 2022.
- Liang et al., Holistic Evaluation of Language Models, arXiv, 2022.
- Stanford CRFM, Foundation Model Transparency Index, December 2025.
- Bommasani et al., The Foundation Model Transparency Index, arXiv, 2023.
- Stanford Engineering, The future of AI Chat: Foundation models and responsible innovation, reviewed May 19, 2026.