Blog · Analysis · Last reviewed June 25, 2026

The Data Sheet Becomes the Supply Chain

Training data is not a vague background substance. It is a supply chain, and the data sheet is becoming the receipt institutions need before model-mediated knowledge can be trusted.

A good data sheet is not a badge of cleanliness. It is a dated, contestable claim about source, rights, transformation, limits, evidence status, downstream use, and repair.

The Invisible Input

Most public AI arguments begin at the output. A chatbot invents a citation. A hiring tool ranks applicants. A tutor gives a wrong explanation. A search answer compresses a topic into a confident paragraph. A model generates a synthetic image, voice, or legal memo. The visible event is the interface.

But the deeper political object sits upstream. Before a model can answer, score, classify, imitate, translate, summarize, or refuse, it has passed through a data supply chain: source selection, scraping, licensing claims, filtering, deduplication, annotation, labeling, enrichment, synthetic generation, benchmark construction, evaluation splits, safety data, human feedback, and post-training examples. Each step decides what the model can treat as the world.

That is why dataset documentation matters. A model card can describe a finished system. A system card can describe a release process. A benchmark can describe a performance claim. But the data sheet asks a more basic question: what exactly entered the machine, under what assumptions, with what permissions, exclusions, gaps, transformations, and known limits?

For this essay, a data sheet is any structured, scoped, versioned record of a dataset's origin, purpose, composition, collection process, preparation, rights status, intended use, disallowed use, known limitations, maintenance, evidence status, and downstream dependencies. It may take the form of a datasheet, data card, dataset card, provenance card, internal register, or regulator-facing documentation. The name matters less than the institutional function: making the data supply chain inspectable before its assumptions become model behavior.

The unit of documentation should be precise. A data sheet can describe a raw source collection, a filtered training snapshot, an annotation set, a benchmark split, a retrieval index, a safety dataset, or a synthetic augmentation pipeline. Treating all of those as one vague corpus hides the transformations where most governance decisions happen.

The data sheet should also be read at the right level. It documents a data asset or data pipeline. It is not a model card, not a system card, not an audit report, not an AI bill of materials, and not proof that a deployment is lawful or safe. It is the record that lets those later artifacts avoid floating above the sources they depend on.

Without that record, training data becomes institutional fog. Everyone knows it matters. No one can say enough about it to allocate responsibility.

Current Context

As of June 25, 2026, dataset documentation is no longer only a research norm. The EU AI Act's general-purpose AI obligations entered into application on August 2, 2025 for providers of general-purpose AI models, with special transition rules for models already on the market before that date. Article 53 requires general-purpose AI model providers to keep technical documentation, provide information to downstream providers, put in place a copyright policy, and publish a sufficiently detailed public summary of training content using an AI Office template, subject to the Act's open-source and systemic-risk distinctions. The European Commission published that template on July 24, 2025, last updated the page on March 26, 2026, and describes the template as a common minimal baseline for information made public about training content.

The Commission's General-Purpose AI Code of Practice, published July 10, 2025, also matters because its transparency and copyright chapters offer a voluntary route for providers to demonstrate compliance with Article 53 obligations. The Commission's guidelines on the scope of GPAI obligations say those duties entered into application on August 2, 2025, the Commission's enforcement powers apply from August 2, 2026, and providers of GPAI models placed on the market before August 2, 2025 must comply by August 2, 2027. None of this makes the code, the guideline, or the public training-content summary a dataset audit. It shows the direction of travel: training data is being pulled out of background engineering and into structured records, copyright policy, downstream-provider information, and public summaries.

A public training-content summary is therefore not a full data sheet. It is a public-facing layer that should rest on deeper internal evidence: source inventories, licensing records, transformation logs, retention rules, redactions, dispute intake, and downstream dependency maps. Without that evidence layer, the summary becomes another trust claim.

AI supply-chain guidance is moving in the same direction from the component side. CISA's May 2026 publication of the G7 Cybersecurity Working Group's Software Bill of Materials for AI - Minimum Elements treats dataset properties as one of the element clusters for AI supply-chain mapping. That does not turn an AI bill of materials into a data sheet. It means the bill of materials needs to point to dataset-level records rather than treating training, validation, test, retrieval, safety, and fine-tuning data as invisible dependencies.

Security agencies are moving in parallel. The May 2025 joint AI Data Security guidance from NSA, CISA, FBI, and international partners treats data supply chains, poisoned data, and data drift as core AI security risks, and lists data provenance tracking alongside encryption, digital signatures, secure storage, and trust infrastructure. This is the security version of the same lesson: if the data chain is invisible, integrity failures become hard to find and harder to remediate.

That places this essay between AI Data Provenance, The AI Bill of Materials Becomes the Supply Chain Map, The System Card Becomes a Release Ritual, The Data Clean Room Becomes the Consent Laundromat, and The Training Opt-Out Becomes the Consent Interface. A data sheet is the local receipt; provenance and bills of materials are the wider chain of custody.

Documentation as Governance

The modern documentation movement in machine learning did not begin as paperwork worship. It began as a response to systems that looked technical while hiding social decisions.

Datasheets for Datasets, associated with work by Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, argued for structured documentation of a dataset's motivation, composition, collection, preprocessing, uses, distribution, and maintenance. The analogy was deliberate: if electronic components need datasheets because engineers must understand how they can safely be used, machine-learning datasets need records because they carry assumptions into downstream systems.

Google's Data Cards work took the same problem into an industry-facing documentation form. Its authors describe Data Cards as structured summaries about dataset origins, development, intent, ethical considerations, collection and annotation methods, intended uses, and decisions affecting model performance. Hugging Face dataset cards show the platform version: the dataset repository README can carry human-readable documentation and machine-readable metadata such as license, language, size, tags, and data-file configuration. The important move is not the brand name. It is the idea that dataset documentation is itself a product for multiple audiences: builders, auditors, deployers, researchers, risk managers, procurement teams, regulators, and affected institutions.

There is also a deeper provenance lineage. W3C's PROV family defines provenance as information about entities, activities, and people involved in producing a data item or thing, used to assess quality, reliability, or trustworthiness. In AI documentation, those entities are datasets, source records, labels, embeddings, indexes, model checkpoints, and synthetic samples; the activities are collection, filtering, labeling, redaction, enrichment, fine-tuning, evaluation, deletion, and release; the agents are people, vendors, scripts, crawlers, models, institutions, and data brokers.

This changes the role of documentation. It is not merely retrospective explanation. It is a governance interface. A clear data sheet can reveal when a dataset is unsuitable for a deployment, when consent is weak, when a label schema encodes a political assumption, when a benchmark split is contaminated, when a language community is missing, when a license claim is uncertain, or when a synthetic augmentation has changed the evidence field.

A poor data sheet does the opposite. It launders uncertainty into compliance.

What the Sheet Has to Carry

A useful data sheet should attach to a specific dataset version or pipeline stage, not to a vague corpus name. It should name the source classes, collection dates, collectors or vendors, geographic and language scope, personal or sensitive data categories, licensing and consent claims, retention limits, access controls, and known exclusions.

It should also record transformations. Filtering, deduplication, normalization, translation, redaction, labeling, enrichment, embedding, synthetic generation, decontamination, and safety screening can change what the dataset means. If the sheet names only the original source and not the preparation steps, it preserves origin while hiding governance.

The record should separate facts from judgments. A hash, source URL, license file, annotation protocol, consent basis, or collection date is a different kind of claim from "representative," "high quality," "low risk," or "appropriate for education." Both kinds of information are useful, but the reader needs to know which can be verified and which is an institutional assessment.

It should also label evidence status. A field may be self-declared by a vendor, copied from platform metadata, contract-backed, source-verified, hash-verified, independently audited, regulator-reviewed, disputed, or unknown. That status should travel with the field. A license label scraped from a repository is not the same kind of evidence as a signed agreement or a reviewed source file.

The sheet should include a correction path. If a license is wrong, personal data is discovered, a community objects, a benchmark is contaminated, a poisoning incident is found, a source is withdrawn, or a takedown request is valid, the institution needs to know which training run, fine-tune, embedding index, evaluation set, model card, procurement file, and downstream product inherited the affected record. A data sheet without a repair route is documentation without accountability.

Not every field should be public. Some evidence belongs in controlled annexes for auditors, regulators, procurement teams, or security reviewers. But the access tier should be explicit. "Confidential" should mean a governed evidence channel, not the disappearance of the supply chain.

Minimum Viable Sheet

A minimum viable data sheet should be smaller than a legal brief and stronger than a marketing summary. For each dataset version or pipeline stage, it should identify the maintainer, review date, purpose, source classes, collection window, source authority, rights basis, consent or opt-out handling, personal or sensitive data categories, intended uses, disallowed uses, transformations, quality checks, known gaps, access tier, retention rule, downstream dependencies, and correction path.

Three fields deserve special treatment. First, the evidence cutoff says when the sheet's claims were last checked. Second, the field-level evidence status says whether a claim is self-declared, source-verified, contract-backed, hash-verified, independently audited, disputed, or unknown. Third, the propagation rule says what happens if a source, license, label, benchmark item, personal-data record, or synthetic augmentation is withdrawn or corrected.

For high-stakes systems, the sheet should also link to the system's AI bill of materials, system inventory, audit trail, model or system card, and incident or change-management record. A data sheet is local evidence; the supply-chain map is what shows where that evidence travels.

The Claim Ledger

The sharpest definition is this: a data sheet is a claim ledger. It does not merely list facts about a dataset. It separates what the institution knows, what it believes, what it has permission to do, what it cannot prove, and what it will repair if the record changes.

Factual claims include source names, collection windows, file hashes, dataset versions, annotation protocols, language scope, filtering rules, and maintainer identity. These should be reproducible or at least inspectable.

Authority claims include license basis, consent basis, contract terms, rights reservations, opt-out handling, data-broker warranties, public-domain assertions, and permitted-use limits. These claims should not be inferred from access alone. A crawler log, platform license tag, or dataset README is evidence, but it is not always the authority itself.

Transformation claims include deduplication, redaction, de-identification, translation, normalization, embedding, synthetic augmentation, safety filtering, benchmark decontamination, and human or model-generated labels. This is where a benign source can become a risky dataset, and where a risky source can sometimes be narrowed into safer use.

Assurance claims include whether each field is self-attested, source-verified, contract-backed, hash-verified, independently audited, regulator-reviewed, disputed, stale, redacted, or unknown. Field-level evidence status is the difference between documentation and a polished guess.

Remedy claims include the correction channel, takedown intake, deletion method, downstream notification, retraining or exclusion plan, and incident-review trigger. This connects a data sheet to training opt-out handling, deletion-order governance, AI audit trails, and AI system inventory. A record is weak if it can describe a problem but cannot route the repair.

What the Law Is Asking

The EU AI Act turns parts of this documentation problem into legal infrastructure.

Article 10 requires high-risk AI systems that use training, validation, or testing datasets to apply data-governance and management practices appropriate to the system's intended purpose. The listed concerns include design choices, data collection and origin, original purpose for personal data, preparation operations such as annotation and cleaning, assumptions about what the data measures, data availability and suitability, bias examination, mitigation, and identification of gaps. It also says datasets should be relevant, sufficiently representative, and, to the best extent possible, free of errors and complete for the intended purpose.

Article 53 adds a different duty for general-purpose AI model providers. Providers must keep technical documentation, make certain information available to downstream providers, put in place a copyright policy, and publish a sufficiently detailed summary of the content used for training according to a template from the AI Office. The European Commission published that explanatory notice and template on July 24, 2025, describing it as a common minimal baseline for public training-content summaries. Article 53 also contains a limited open-source exception for some documentation duties when parameters, including weights, architecture information, and model-usage information are made publicly available under a free and open-source license; that exception does not remove the systemic-risk obligations for the most advanced models.

These are not the same obligation. Article 10 is close to deployment-context data governance for high-risk systems. Article 53 is a public and downstream transparency layer for general-purpose models. A data sheet is a more local supply-chain artifact. Together, they show where AI governance is moving: from abstract arguments about "the data" toward records that can be inspected, compared, challenged, and updated.

NIST's AI Risk Management Framework points in the same direction from a voluntary U.S. standards perspective. It frames AI risk as something organizations must govern, map, measure, and manage across the lifecycle. The framework is not a dataset-documentation mandate, but its logic depends on knowing enough about inputs, assumptions, context, and limitations to identify and manage risks to people, organizations, and society.

The Provenance Crisis

The problem is that the actual data ecosystem is messier than the governance vocabulary.

The Data Provenance Initiative audited more than 1,800 text datasets and traced their sources, creators, license conditions, properties, and subsequent use. Its paper reports a landscape of inconsistent documentation, sharp divides between commercially open and closed datasets, and frequent license miscategorization on dataset hosting sites. The authors describe license omission above 70 percent and error rates above 50 percent in their audit of widely used dataset records.

Those findings matter because they move the issue from philosophy to operations. A developer cannot responsibly choose a dataset if the dataset's source, license, creator, scope, and restrictions are wrong or missing. A company cannot make a credible compliance claim if its fine-tuning data rests on copied metadata. A public agency cannot assess a vendor's AI system if upstream records dissolve into platform folklore.

Data provenance is often discussed as if it were only about copyright. Copyright is real, but the supply-chain problem is broader. Provenance also carries privacy, labor, representation, bias, safety, scientific validity, cultural context, and maintainability. Who collected the data? Who labeled it? Which communities are overrepresented or absent? Which content was removed by filters? Which languages were treated as normal? Which synthetic examples entered the mix? Which human judgments became reward data? Which errors are known but still present?

Security turns those questions into incident-response requirements. If a dataset is poisoned, a benchmark is contaminated, a corpus drifts, a license changes, or a takedown must be honored, the organization needs to know which model, retrieval index, fine-tune, evaluation, or product feature inherited the affected data. These are not decorative metadata fields. They shape model behavior and determine whether repair is possible.

Why Summary Is Not Enough

A public training-data summary is better than silence. It gives journalists, researchers, creators, regulators, and downstream developers a handle. It can reveal whether a provider used web crawls, books, code, scientific papers, social media, licensed archives, synthetic data, or user interactions. It can also create pressure for providers to maintain internal records because public disclosure eventually has to be grounded somewhere.

But summary is not supply-chain governance by itself.

The first limitation is granularity. A broad category like "web data" may be true and still useless for assessing risk. The difference between a consented archive, a filtered crawl, a scraped forum, a public-domain corpus, a pirated library, and a vendor dataset matters.

The second limitation is transformation. Data does not enter models raw. It is selected, cleaned, deduplicated, translated, labeled, scored, mixed, filtered, redacted, or synthesized. A summary of source categories may hide the transformations that changed the dataset's meaning.

The third limitation is purpose. A dataset can be acceptable for research and unsuitable for a high-stakes deployment. It can be adequate for general language modeling and inadequate for a medical triage tool, benefits system, migration-risk model, classroom tutor, or workplace dashboard. The same data can be harmless in one context and coercive in another.

The fourth limitation is accountability. A public summary does not automatically create a correction pathway. If a creator, community, worker, researcher, or regulator finds a problem, the supply chain needs a way to receive, investigate, remediate, and propagate that correction into future models, data cards, procurement reviews, and public records.

Evidence and Access Tiers

A data sheet should not be treated as one universal disclosure file. The public, the buyer, the auditor, the regulator, the security team, and the affected community do not need the same view, but each view should be connected to the same underlying evidence record.

The public layer can name the dataset purpose, broad source classes, collection window, rights basis, sensitive categories, intended and disallowed uses, maintainer, review date, known limits, and complaint or correction path. That layer is close to AI register practice: it tells people that a consequential data asset exists and who is accountable for it.

The buyer, auditor, or regulator layer should go deeper: source lists, contracts, consent records, provenance hashes, filtering and labeling protocols, privacy review, license evidence, deletion and takedown records, benchmark-contamination checks, access logs, and downstream dependencies. This is where the data sheet connects to AI audit trails, transparency and public registers, and vendor and platform governance.

The restricted security and privacy layer may need to withhold vulnerable source paths, personal data, worker identities, trade secrets, raw records, or abuse-sensitive collection methods. But each redaction should have a reason, an accountable reviewer, a review date, and an alternative inspection path. Access tiers should protect sensitive evidence without turning every consequential data claim into "trust us."

The Supply-Chain Standard

A serious AI data-supply-chain regime should meet a higher standard than "we have a dataset name."

First, records should distinguish source, license, consent, and permitted use. These are different claims. A dataset can be technically accessible, poorly licensed, ethically compromised, and unsuitable for a deployment at the same time.

Second, documentation should track transformations. Filtering, labeling, deduplication, synthetic augmentation, safety post-training, and human feedback should not disappear behind the word "curation."

Third, dataset records should be versioned. A model trained on a March 2025 dataset snapshot should not be governed as if it used an undated abstraction. Corrections, takedowns, license changes, benchmark contamination, and discovered errors need lifecycle memory.

Fourth, data sheets should name intended and disallowed uses. A dataset built for research exploration should not quietly become evidence for decisions about employment, education, benefits, policing, credit, medicine, or immigration.

Fifth, affected communities should be able to contest the record. Documentation that cannot receive objections becomes a one-way narrative by the data holder.

Sixth, procurement should require data documentation proportional to risk. A school district, hospital, court, employer, or agency buying an AI system should not be forced to trust a vendor's model description without an inspectable account of relevant training, validation, testing, and fine-tuning data.

Seventh, public summaries should connect to private evidence. Trade secrets and security concerns may limit disclosure, but regulators and qualified auditors need access to more than marketing categories when social risk is high.

Eighth, provenance records need their own security controls. A data sheet may reveal personal data, trade secrets, source vulnerabilities, worker identities, sensitive research subjects, or security-sensitive collection paths. Access control, redaction, retention limits, integrity checks, and audit logs belong in the documentation system itself.

Ninth, correction must propagate downstream. A fixed dataset card is not enough if the bad record already fed an embedding index, benchmark, fine-tune, model release, safety classifier, or procurement file. The supply chain needs a repair path, not only a description.

Tenth, documentation should be both human-readable and machine-actionable. Narrative context helps people understand purpose and limits. Structured fields help tools query license, source, version, sensitivity, use restrictions, and downstream dependencies. One without the other is weaker than it looks.

Eleventh, documentation should name the accountable maintainer. Someone should own updates, objections, takedown intake, provenance corrections, access review, and downstream notice. A dataset without an accountable maintainer is an orphaned dependency.

Twelfth, metadata should not overrule source evidence. A platform label, repository tag, or copied license field should be treated as a claim to verify against the original source, contract, permission record, or collection process. Metadata is useful; it is not legal or ethical truth by itself.

Thirteenth, documentation should label evidence status. Readers should know whether a field is self-declared, source-verified, contract-backed, hash-verified, independently audited, disputed, or unknown. Uncertainty should be visible rather than converted into confident prose.

Fourteenth, data sheets should connect to the wider system map. Dataset records should link to model cards, system cards, AIBOMs, procurement records, public registers, incident reports, change-management records, and audit trails. A corrected data sheet is weak if the correction never reaches the model or product that inherited the data.

Fifteenth, data minimization should be visible. A sheet should explain why the data was needed, whether a narrower source, shorter retention period, aggregate statistic, synthetic substitute, or privacy-preserving method would have served the same purpose, and how deletion or exclusion is handled. This links dataset documentation to data minimization, not only provenance.

Sixteenth, withdrawal should be operational. Opt-outs, expired licenses, valid takedowns, corrected metadata, poisoned sources, and contaminated benchmark records should have defined effects on raw files, filtered datasets, embeddings, retrieval chunks, evaluation splits, fine-tunes, caches, backups, model cards, public summaries, and procurement records. A data sheet that cannot say where a withdrawal propagates is not yet a supply-chain record.

Seventeenth, source assertions should be reconciled. A data vendor, model provider, system integrator, and deployer may each inherit the same source field. A serious data sheet preserves who asserted it, who checked it, which evidence supports it, and whether downstream records copied the assertion or independently verified it. Otherwise the supply chain converts upstream guesswork into institutional certainty.

Source Discipline

Claims about data sheets need careful labels. Datasheets for Datasets and Data Cards are documentation proposals and practices, not proof that every dataset with a card is responsible. Hugging Face dataset cards show a platform convention, not a guarantee that every card is complete or accurate. W3C PROV supplies a general provenance model, not an AI-specific compliance regime.

EU AI Act sources should be read by obligation type. Article 10 concerns data governance for high-risk AI systems using training, validation, and testing datasets. Article 53 concerns general-purpose AI model provider duties, including technical documentation, downstream information, copyright policy, and public training-content summaries. The Commission's training-content template supplies a public-summary format; it does not expose the full dataset or settle whether training was lawful. The GPAI Code of Practice is a voluntary compliance tool, not an independent audit of signatories.

The AI Act also distinguishes transparency from verification. Recital language and Article 53 require a sufficiently detailed public summary of training content, while the AI Office monitors whether the obligation is fulfilled without performing a work-by-work copyright assessment of the training data. That distinction matters: a public summary can be required, useful, and still not be an audit, a license, or a finding that every source was lawfully used.

The Data Provenance Initiative is evidence about audited dataset records and license/provenance failures in widely used collections, especially fine-tuning datasets. It should not be cited as if it measures every proprietary frontier-model training corpus. The responsible evidentiary move is narrower: public dataset metadata is often incomplete or wrong, so institutions should not build high-impact AI supply-chain claims on copied dataset labels alone.

OECD work on AI training-data collection mechanisms should be read as a policy taxonomy, not as a list of approved sources. The NSA, CISA, FBI, and partner AI Data Security guidance should be read as security guidance for data used to train and operate AI systems, not as a documentation standard. Both support the same practical conclusion: source mechanism, integrity, access control, and provenance affect governance.

When a source says "training content summary," "dataset card," "data card," "technical documentation," "AI bill of materials," "license metadata," or "provenance record," preserve the distinction. These artifacts overlap, but each answers a different governance question. Collapsing them into one generic transparency document makes the evidence look stronger than it is.

What This Changes

The data sheet is where model-mediated knowledge remembers its sources.

That memory is not perfect. A data sheet can be incomplete, misleading, stale, overconfident, or written to satisfy an audit rather than to inform use. It can become a ritual document, like a release card that reassures without constraining. But the absence of documentation is worse. It turns the training set into myth: a vast origin story that explains model behavior without allowing anyone to inspect the machinery.

This is a recursive reality problem. The world is converted into data. The data trains a model. The model produces outputs that enter documents, databases, classrooms, search systems, reports, codebases, and future training sets. If the first conversion has no durable record, the later institution inherits a manufactured past with no clean way to ask what was lost, distorted, copied, excluded, or invented.

The useful demand is modest and difficult: keep receipts for the reality being compressed. Not because documentation solves the politics of AI, but because without documentation, politics has no handle. The data sheet is not the supply chain itself. It is the beginning of an institutional memory that can make the supply chain answerable.

Sources

European Union, Regulation (EU) 2024/1689, the Artificial Intelligence Act, official text, especially Articles 10 and 53, reviewed June 25, 2026.
European Commission, Explanatory Notice and Template for the Public Summary of Training Content for general-purpose AI models, July 24, 2025, last updated March 26, 2026 and reviewed June 25, 2026.
European Commission, Guidelines for providers of general-purpose AI models, obligations entered into application August 2, 2025, reviewed June 25, 2026.
European Commission, General-Purpose AI Models in the AI Act - Questions & Answers, reviewed June 25, 2026.
European Commission, The General-Purpose AI Code of Practice, published July 10, 2025 and reviewed June 25, 2026.
AI Act Service Desk, Article 10: Data and data governance, Regulation (EU) 2024/1689, reviewed June 25, 2026.
AI Act Service Desk, Article 53: Obligations for providers of general-purpose AI models, Regulation (EU) 2024/1689, reviewed June 25, 2026.
CISA, Software Bill of Materials for AI - Minimum Elements, May 12, 2026.
NIST AI Resource Center, AI Risk Management Framework, reviewed June 25, 2026.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, July 2024.
CISA, NSA, FBI, and international partners, New Best Practices Guide for Securing AI Data Released, May 22, 2025.
W3C, PROV-Overview, W3C Working Group Note, April 30, 2013.
OECD, Mapping relevant data collection mechanisms for AI training, OECD Artificial Intelligence Papers, October 3, 2025.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, Datasheets for Datasets, arXiv version revised December 2021 and published in Communications of the ACM, December 2021.
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson, Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI, ACM, 2022.
Google Research, The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation, May 2022 and reviewed June 25, 2026.
Hugging Face, Dataset Cards, reviewed June 25, 2026.
Shayne Longpre et al., The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI, arXiv, October 2023, revised November 2023.
Shayne Longpre et al., A large-scale audit of dataset licensing and attribution in AI, Nature Machine Intelligence, August 2024.
Data Provenance Initiative, Data Provenance Collection, reviewed June 25, 2026.
Related references: Training Data, AI Data Provenance, AI Bill of Materials, Model Cards and System Cards, Algorithmic Transparency, Data Poisoning, Benchmark Contamination, Data Minimization, AI Data Licensing, AI Audit Trails, AI Change Management, SLSA Provenance, Agentic Supply-Chain Vulnerabilities, Privacy and Data, Vendor and Platform Governance, Transparency and Public Registers, The AI Register Becomes Public Memory, The AI Slop Farm Becomes the Knowledge Supply Chain, and The AI Bill of Materials Becomes the Supply Chain Map.

Return to Blog