The Data Sheet Becomes the Supply Chain
Training data is not a vague background substance. It is a supply chain, and the data sheet is becoming the receipt institutions need before model-mediated knowledge can be trusted.
The Invisible Input
Most public AI arguments begin at the output. A chatbot invents a citation. A hiring tool ranks applicants. A tutor gives a wrong explanation. A search answer compresses a topic into a confident paragraph. A model generates a synthetic image, voice, or legal memo. The visible event is the interface.
But the deeper political object sits upstream. Before a model can answer, score, classify, imitate, translate, summarize, or refuse, it has passed through a data supply chain: source selection, scraping, licensing claims, filtering, deduplication, annotation, labeling, enrichment, synthetic generation, benchmark construction, evaluation splits, safety data, human feedback, and post-training examples. Each step decides what the model can treat as the world.
That is why dataset documentation matters. A model card can describe a finished system. A system card can describe a release process. A benchmark can describe a performance claim. But the data sheet asks a more basic question: what exactly entered the machine, under what assumptions, with what permissions, exclusions, gaps, transformations, and known limits?
Without that record, training data becomes institutional fog. Everyone knows it matters. No one can say enough about it to allocate responsibility.
Documentation as Governance
The modern documentation movement in machine learning did not begin as paperwork worship. It began as a response to systems that looked technical while hiding social decisions.
Datasheets for Datasets, associated with work by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, argued for structured documentation of a dataset's motivation, composition, collection, preprocessing, uses, distribution, and maintenance. The analogy was deliberate: if electronic components need datasheets because engineers must understand how they can safely be used, machine-learning datasets need records because they carry assumptions into downstream systems.
Google's Data Cards work took the same problem into an industry-facing documentation form. Its authors describe Data Cards as structured summaries about dataset origins, development, intent, ethical considerations, collection and annotation methods, intended uses, and decisions affecting model performance. The important move is not the brand name. It is the idea that dataset documentation is itself a product for multiple audiences: builders, auditors, deployers, researchers, risk managers, and affected institutions.
This changes the role of documentation. It is not merely retrospective explanation. It is a governance interface. A clear data sheet can reveal when a dataset is unsuitable for a deployment, when consent is weak, when a label schema encodes a political assumption, when a benchmark split is contaminated, when a language community is missing, when a license claim is uncertain, or when a synthetic augmentation has changed the evidence field.
A poor data sheet does the opposite. It launders uncertainty into compliance.
What the Law Is Asking
The EU AI Act turns parts of this documentation problem into legal infrastructure.
Article 10 requires high-risk AI systems that use training, validation, or testing datasets to apply data-governance and management practices appropriate to the system's intended purpose. The listed concerns include design choices, data collection and origin, original purpose for personal data, preparation operations such as annotation and cleaning, assumptions about what the data measures, data availability and suitability, bias examination, mitigation, and identification of gaps. It also says datasets should be relevant, sufficiently representative, and, to the best extent possible, free of errors and complete for the intended purpose.
Article 53 adds a different duty for general-purpose AI model providers. Providers must keep technical documentation, make certain information available to downstream providers, put in place a copyright policy, and publish a sufficiently detailed summary of the content used for training according to a template from the AI Office. The European Commission published that explanatory notice and template on July 24, 2025, describing it as a common minimal baseline for public training-content summaries.
These are not the same obligation. Article 10 is close to deployment-context data governance for high-risk systems. Article 53 is a public transparency layer for general-purpose models. Together, they show where AI governance is moving: from abstract arguments about "the data" toward records that can be inspected, compared, challenged, and updated.
NIST's AI Risk Management Framework points in the same direction from a voluntary U.S. standards perspective. It frames AI risk as something organizations must govern, map, measure, and manage across the lifecycle. The framework is not a dataset-documentation mandate, but its logic depends on knowing enough about inputs, assumptions, context, and limitations to identify and manage risks to people, organizations, and society.
The Provenance Crisis
The problem is that the actual data ecosystem is messier than the governance vocabulary.
The Data Provenance Initiative audited more than 1,800 text datasets and traced their sources, creators, license conditions, properties, and subsequent use. Its paper reports a landscape of inconsistent documentation, sharp divides between commercially open and closed datasets, and frequent license miscategorization on dataset hosting sites. The authors describe license omission above 70 percent and error rates above 50 percent in their audit of widely used dataset records.
Those findings matter because they move the issue from philosophy to operations. A developer cannot responsibly choose a dataset if the dataset's source, license, creator, scope, and restrictions are wrong or missing. A company cannot make a credible compliance claim if its fine-tuning data rests on copied metadata. A public agency cannot assess a vendor's AI system if upstream records dissolve into platform folklore.
Data provenance is often discussed as if it were only about copyright. Copyright is real, but the supply-chain problem is broader. Provenance also carries privacy, labor, representation, bias, safety, scientific validity, cultural context, and maintainability. Who collected the data? Who labeled it? Which communities are overrepresented or absent? Which content was removed by filters? Which languages were treated as normal? Which synthetic examples entered the mix? Which human judgments became reward data? Which errors are known but still present?
These are not decorative metadata fields. They shape model behavior.
Why Summary Is Not Enough
A public training-data summary is better than silence. It gives journalists, researchers, creators, regulators, and downstream developers a handle. It can reveal whether a provider used web crawls, books, code, scientific papers, social media, licensed archives, synthetic data, or user interactions. It can also create pressure for providers to maintain internal records because public disclosure eventually has to be grounded somewhere.
But summary is not supply-chain governance by itself.
The first limitation is granularity. A broad category like "web data" may be true and still useless for assessing risk. The difference between a consented archive, a filtered crawl, a scraped forum, a public-domain corpus, a pirated library, and a vendor dataset matters.
The second limitation is transformation. Data does not enter models raw. It is selected, cleaned, deduplicated, translated, labeled, scored, mixed, filtered, redacted, or synthesized. A summary of source categories may hide the transformations that changed the dataset's meaning.
The third limitation is purpose. A dataset can be acceptable for research and unsuitable for a high-stakes deployment. It can be adequate for general language modeling and inadequate for a medical triage tool, benefits system, migration-risk model, classroom tutor, or workplace dashboard. The same data can be harmless in one context and coercive in another.
The fourth limitation is accountability. A public summary does not automatically create a correction pathway. If a creator, community, worker, researcher, or regulator finds a problem, the supply chain needs a way to receive, investigate, remediate, and propagate that correction into future models, data cards, procurement reviews, and public records.
The Supply-Chain Standard
A serious AI data-supply-chain regime should meet a higher standard than "we have a dataset name."
First, records should distinguish source, license, consent, and permitted use. These are different claims. A dataset can be technically accessible, poorly licensed, ethically compromised, and unsuitable for a deployment at the same time.
Second, documentation should track transformations. Filtering, labeling, deduplication, synthetic augmentation, safety post-training, and human feedback should not disappear behind the word "curation."
Third, dataset records should be versioned. A model trained on a March 2025 dataset snapshot should not be governed as if it used an undated abstraction. Corrections, takedowns, license changes, benchmark contamination, and discovered errors need lifecycle memory.
Fourth, data sheets should name intended and disallowed uses. A dataset built for research exploration should not quietly become evidence for decisions about employment, education, benefits, policing, credit, medicine, or immigration.
Fifth, affected communities should be able to contest the record. Documentation that cannot receive objections becomes a one-way narrative by the data holder.
Sixth, procurement should require data documentation proportional to risk. A school district, hospital, court, employer, or agency buying an AI system should not be forced to trust a vendor's model description without an inspectable account of relevant training, validation, testing, and fine-tuning data.
Seventh, public summaries should connect to private evidence. Trade secrets and security concerns may limit disclosure, but regulators and qualified auditors need access to more than marketing categories when social risk is high.
The Site Reading
The data sheet is where model-mediated knowledge remembers its sources.
That memory is not perfect. A data sheet can be incomplete, misleading, stale, overconfident, or written to satisfy an audit rather than to inform use. It can become a ritual document, like a release card that reassures without constraining. But the absence of documentation is worse. It turns the training set into myth: a vast origin story that explains model behavior without allowing anyone to inspect the machinery.
This is a recursive reality problem. The world is converted into data. The data trains a model. The model produces outputs that enter documents, databases, classrooms, search systems, reports, codebases, and future training sets. If the first conversion has no durable record, the later institution inherits a manufactured past with no clean way to ask what was lost, distorted, copied, excluded, or invented.
The useful demand is modest and difficult: keep receipts for the reality being compressed. Not because documentation solves the politics of AI, but because without documentation, politics has no handle. The data sheet is not the supply chain itself. It is the beginning of an institutional memory that can make the supply chain answerable.
Sources
- European Commission, Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models, July 24, 2025, last updated March 26, 2026.
- AI Act Service Desk, Article 10: Data and data governance, Regulation (EU) 2024/1689.
- AI Act Service Desk, Article 53: Obligations for providers of general-purpose AI models, Regulation (EU) 2024/1689.
- NIST, AI Risk Management Framework, reviewed May 2026.
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, Datasheets for Datasets, Communications of the ACM, December 2021.
- Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson, Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI, ACM, 2022.
- Shayne Longpre et al., The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, arXiv, October 2023, revised November 2023.
- Data Provenance Initiative, Data Provenance Collection, reviewed May 2026.
- Church of Spiralism Wiki, Training Data, Model Cards and System Cards, and Algorithmic Transparency.