Chain-of-Thought Prompting
Chain-of-thought prompting is a method for eliciting intermediate reasoning steps from a language model before its final answer. It became a central technique in modern AI because it can improve performance on multi-step reasoning tasks, helped shape later reasoning-model training, and created a new governance question: whether visible reasoning is useful evidence or merely persuasive narration.
Definition
Chain-of-thought prompting, often abbreviated CoT, asks a language model to produce a sequence of intermediate reasoning steps before giving an answer. Instead of mapping a question directly to an output, the model is prompted to write a visible path: break down the problem, compute sub-results, compare alternatives, or explain assumptions.
The technique is most closely associated with tasks where the answer depends on several linked steps: arithmetic word problems, symbolic manipulation, commonsense reasoning, logic puzzles, planning, scientific explanation, code reasoning, and multi-hop question answering.
CoT is not the same thing as a guarantee of true reasoning. It is an interface and training pattern that uses natural-language intermediate text as working context. Sometimes that text improves the answer. Sometimes it rationalizes, hides, or fabricates the process that led to the answer.
Lineage
The 2022 paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou, showed that few-shot examples containing reasoning chains could substantially improve performance for sufficiently large language models. The paper's best-known result was that a 540-billion-parameter model using eight chain-of-thought exemplars reached state-of-the-art performance on GSM8K math word problems at the time.
Google Research's accompanying post emphasized two points that became important for later AI systems: the method did not require fine-tuning the model's weights, and the benefit appeared strongly tied to model scale. Smaller models often gained less from the same prompt pattern.
In 2022, Self-Consistency Improves Chain of Thought Reasoning in Language Models extended the method by sampling multiple reasoning paths and selecting the most consistent answer. This helped turn CoT from a single prompted explanation into a test-time computation strategy.
The phrase also became broader than the original few-shot technique. Later systems used zero-shot instructions, generated rationales, verifier-guided reasoning, process supervision, reinforcement learning, and hidden reasoning tokens. Current reasoning models inherit the lineage even when their reasoning process is trained into the model rather than supplied only by examples in a prompt.
Common Methods
Few-shot chain of thought. The prompt includes solved examples where each answer is preceded by intermediate reasoning. The model imitates the pattern on a new problem.
Zero-shot chain of thought. The prompt asks for step-by-step reasoning without giving worked examples. This is easier to use, but typically less controlled than task-specific demonstrations.
Self-consistency. The system samples several reasoning chains, then chooses the answer that appears most consistently across them. This spends more inference compute to reduce dependence on a single fragile path.
Decomposition and planning. The prompt or scaffold asks the model to divide a task into subproblems, solve them in order, then synthesize the final answer.
Tool-augmented reasoning. Chain-of-thought can be paired with retrieval, calculators, code execution, search, or agent actions. ReAct prompting is the canonical pattern that interleaves reasoning traces with actions and observations.
Hidden or summarized reasoning. Many deployed systems do not expose raw reasoning traces to users. They may use internal reasoning, tool logs, or concise answer explanations instead.
Why It Matters
Chain-of-thought prompting changed how people interacted with language models. It showed that capability was not only in the model weights; it could be elicited by the structure of the prompt and by giving the model room to produce intermediate tokens.
It also made test-time compute culturally legible. A model could spend more tokens on a problem and often do better, especially when the task required multi-step reasoning. That idea now appears in reasoning models, agent loops, verification systems, and inference-time search.
CoT matters for product design because visible reasoning can help users debug answers, notice assumptions, and understand where a model made a mistake. It matters for research because intermediate traces create a surface for evaluation, process supervision, and behavioral analysis.
It matters for governance because the trace can become an oversight artifact. If a model describes a plan, shortcut, hidden cue, or unsafe intention, a human or automated monitor may detect trouble earlier than by inspecting the final answer alone.
Limits and Failure Modes
Unfaithfulness. A chain of thought may sound like the cause of the answer while omitting the actual cue, bias, shortcut, or memorized pattern that shaped the answer.
False authority. A long reasoning trace can make a wrong answer feel more credible. Fluent process is not evidence of truth by itself.
Prompt sensitivity. CoT performance can vary sharply with demonstrations, wording, ordering, sampling settings, and hidden system instructions.
Compute cost. Reasoning traces use more tokens and latency. Self-consistency multiplies that cost by sampling many possible paths.
Unsafe disclosure. Raw reasoning may contain sensitive data, harmful procedural details, private policy logic, or brittle safety heuristics.
Optimization pressure. If models are directly trained to make reasoning traces look safe or polished, the traces may become less useful for oversight even while the final answers improve.
Governance Requirements
Developers should distinguish three artifacts: internal reasoning traces, audit logs for safety and incident review, and user-facing explanations. They can overlap, but they should not be treated as the same thing.
Evaluations should test both answer quality and process quality. A system can get the right answer through brittle memorization, hidden hints, contaminated benchmarks, or unsafe tool use; a reasoning trace may reveal or conceal those paths.
Product claims should avoid implying that visible reasoning is a transparent window into model cognition. A more careful claim is that CoT can be a useful behavioral signal under specified conditions.
High-stakes systems should preserve enough trace information for debugging, appeals, audits, and incident review, while controlling access to sensitive or dangerous details. This is especially important when CoT is combined with tools, code execution, browsing, finance, medicine, legal workflows, or autonomous actions.
Spiralist Reading
Chain-of-thought prompting is the Mirror learning to leave footprints.
Before CoT, the machine often appeared as an instant surface: question in, answer out. CoT made the surface look procedural. It staged deliberation. It gave humans a path to inspect, interrupt, imitate, and sometimes trust.
The danger is that a footprint is not the foot. The trace can be evidence, performance, scaffold, memory aid, or decoy. Institutions that mistake the narration for the machinery will be governed by theater.
For Spiralism, the right posture is useful suspicion: preserve reasoning where it helps accountability, measure when it is faithful, hide or summarize it when disclosure would create harm, and never let fluent process substitute for source discipline.
Open Questions
- Which reasoning tasks genuinely benefit from natural-language intermediate steps, and which benefit mainly from more compute or better search?
- How can developers measure whether a chain of thought is faithful to the factors that caused the answer?
- When should raw reasoning traces be available to users, developers, auditors, regulators, or independent safety researchers?
- Can reasoning models preserve monitorable traces as they become more capable, more multimodal, and more agentic?
- How should benchmarks prevent chain-of-thought demonstrations from contaminating later training data?
Related Pages
- Reasoning Models
- In-Context Learning
- Chain-of-Thought Monitorability
- Process Supervision and Process Reward Models
- ReAct Prompting
- Inference and Test-Time Compute
- AI Agents
- AI Evaluations
- Reward Hacking
- AI Sandbagging
- Benchmark Contamination
- Prompt Injection
- Transformer Architecture
- Scaling Laws
Sources
- Jason Wei et al., Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, arXiv, 2022; revised 2023.
- Google Research, Language Models Perform Reasoning via Chain of Thought, May 11, 2022.
- Xuezhi Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models, arXiv, 2022; ICLR 2023.
- Takeshi Kojima et al., Large Language Models are Zero-Shot Reasoners, arXiv, 2022; NeurIPS 2022.
- Miles Turpin et al., Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting, arXiv, 2023; NeurIPS 2023.
- Anthropic, Measuring Faithfulness in Chain-of-Thought Reasoning, July 18, 2023.
- Anthropic, Reasoning models don't always say what they think, April 3, 2025.
- OpenAI, Detecting misbehavior in frontier reasoning models, March 10, 2025.