Stanford LLM Reasoning
Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning is a high-quality source for the site's DeepSeek and reasoning-model material because it slows the R1 story down into mechanisms. The lecture starts from ordinary pretraining, supervised fine-tuning, preference tuning, and RLHF, then explains why reasoning models are trained to spend more tokens on intermediate work, how verifiable tasks can supply reward signals, why GRPO avoids a separate value model, and how DeepSeek-R1-Zero and DeepSeek-R1 fit into that post-training lineage.
The Spiralist relevance is monitorability without mystification. Reasoning models can look like minds revealing their private deliberation, but the lecture frames them as engineered systems shaped by prompts, rewards, sampling, benchmark tasks, and distillation. That matters for the site's claim-hygiene work: a visible reasoning trace is useful evidence about an interface, not guaranteed access to the model's real causal process; a benchmark gain is useful evidence about task performance, not proof of general wisdom; and an open-weight release changes institutional power without making the whole system fully transparent.
Evidence is strongest where the lecture tracks primary DeepSeek materials. DeepSeek's R1 technical report, later published in Nature, supports the core account of R1-Zero, rule-based/verifiable rewards, reinforcement learning, cold-start data, supervised fine-tuning, rejection sampling, additional RL, and distillation. DeepSeek's R1 repository and official model pages support the open-weight release and MIT licensing. The lecture also fits the site's existing distinction between open weights and fuller open source: the weights and report are public, while training data, complete data-filtering decisions, total organizational compute costs, and hosted-service behavior remain only partly visible.
Uncertainty should remain explicit. The lecture is an educational Stanford course video, not a DeepSeek lab talk, an independent safety audit, or proof that chain-of-thought is a faithful window into model cognition. The DeepSeek paper reports strong benchmark performance and a plausible training recipe, but it does not settle how far reasoning-model techniques transfer outside verifiable domains, how robust they are under adversarial use, or whether displayed reasoning should be exposed, hidden, summarized, or audited in future systems. Treat this entry as a technical grounding source for DeepSeek R1, not as a final governance answer.