Wiki · Concept · Last reviewed June 14, 2026

Adversarial Machine Learning

Adversarial machine learning is the study and practice of attacking, testing, and defending AI systems under deliberate hostile pressure: crafted inputs, corrupted data, backdoored models, privacy extraction, prompt injection, tool manipulation, model theft, and other attacks against the model and the system around it.

Definition

Adversarial machine learning is AI security under intentional pressure. It asks how a motivated actor can manipulate a model, data pipeline, prompt stack, retrieval index, tool interface, evaluation set, or deployment environment so the system misclassifies, leaks information, follows hostile instructions, executes unsafe actions, hides a backdoor, or becomes easier to steal.

The word adversarial matters. Not every wrong answer, hallucination, bias, or distribution shift is an adversarial attack. The field becomes adversarial when there is a threat scenario: an actor with goals, capabilities, knowledge, access, timing, and incentives. A useful analysis therefore names what the attacker can see, query, modify, poison, trigger, extract, or influence.

The term covers both predictive AI and generative AI. In predictive systems, classic examples include adversarial images, sensor perturbations, poisoning attacks, model inversion, membership inference, and model extraction. In generative systems, the same security grammar now includes prompt injection, jailbreaks, poisoned retrieval, malicious tool outputs, agent manipulation, training-data extraction, model exfiltration, and synthetic content used to corrupt feedback loops.

NIST's 2025 adversarial machine learning taxonomy organizes the area around machine-learning methods, lifecycle stages of attack, attacker goals, attacker capabilities, and attacker knowledge. That framing is useful because adversarial risk is not one bug. It is a family of ways that statistical systems and their surrounding software can be pushed outside their intended operating envelope.

History

Adversarial examples became a defining issue for modern deep learning after researchers showed that neural networks could be fooled by small, carefully chosen perturbations. The 2013 paper Intriguing properties of neural networks demonstrated that imperceptible or nearly imperceptible changes could cause confident misclassification, and that such examples could transfer across models.

Goodfellow, Shlens, and Szegedy's 2014 paper Explaining and Harnessing Adversarial Examples argued that the vulnerability was connected to the linear behavior of high-dimensional models and introduced the fast gradient sign method as a simple attack and training tool. The lesson was uncomfortable: high benchmark accuracy did not imply stable behavior under malicious perturbation.

The field then moved from image classifiers into physical-world attacks, malware detection, spam, speech, recommender systems, autonomous vehicles, medical systems, biometrics, and, later, large language model applications. As AI systems gained tool access and institutional roles, adversarial machine learning became less like a narrow research specialty and more like a core layer of AI security.

Current Context

As of June 14, 2026, adversarial machine learning sits inside mainstream AI governance and cybersecurity. NIST AI 100-2e2025 gives standards bodies and security teams a shared taxonomy for attacks and mitigations. NIST AI 600-1, the Generative AI Profile, treats prompt injection and data poisoning as information-security risks for generative AI systems, alongside risks to code, training data, model weights, and the wider AI value chain.

OWASP's LLM application security work puts adversarial machine learning into product-security language: prompt injection, supply-chain compromise, data and model poisoning, excessive agency, vector and embedding weaknesses, sensitive information disclosure, and model theft are not isolated research terms. They are application risks that appear when AI systems are connected to documents, tools, plugins, identity, payments, code, and private data.

The EU AI Act also makes adversarial robustness a compliance issue for high-risk AI systems. Article 15 requires appropriate accuracy, robustness, and cybersecurity across the lifecycle and specifically names data poisoning, model poisoning, adversarial examples or model evasion, confidentiality attacks, and model flaws as AI-specific vulnerabilities to address where appropriate.

Operational guidance is moving in the same direction. MITRE ATLAS describes a living knowledge base of adversary tactics and techniques against AI-enabled systems based on real-world observations. A May 2025 joint AI data-security guidance from NSA, CISA, FBI, ASD's ACSC, NCSC-NZ, and NCSC-UK emphasizes data supply chains, poisoned data, data drift, provenance tracking, secure storage, encryption, digital signatures, and trust infrastructure for organizations using AI systems.

Attack Surface

Evasion attacks. The attacker manipulates inputs at inference time so a deployed model makes the wrong prediction or takes the wrong action. Classic adversarial examples belong here.

Poisoning attacks. The attacker corrupts training, fine-tuning, feedback, retrieval, or evaluation data so the system learns the wrong behavior, embeds a hidden trigger, or reports misleading performance.

Backdoors and trojans. A model behaves normally on ordinary inputs but changes behavior when a trigger pattern, phrase, object, feature, or context appears.

Privacy attacks. Membership inference, model inversion, and training-data extraction attempt to reveal whether a record was used, reconstruct sensitive information, or pull memorized content from the model.

Model extraction. An attacker queries or accesses a system to copy its behavior, infer its parameters, steal weights, or build a substitute model.

Supply-chain compromise. Third-party models, checkpoints, adapters, tokenizer files, datasets, containers, packages, plugins, and hosted APIs can introduce hidden vulnerabilities or untrusted behavior.

Generative and agentic attacks. Prompt injection, jailbreaks, poisoned retrieval, malicious tool responses, and deceptive environment content attempt to redirect a model that reads, reasons, calls tools, or acts on behalf of a user.

Examples

Image perturbations. A classifier may label an image correctly until a carefully calculated perturbation changes the output while leaving the image visually similar to a human observer.

Physical-world adversarial inputs. Research on physical adversarial examples showed that attacks can survive printing, photography, lighting changes, and sensor capture. The general warning is that robustness must be tested in the environment where the system will actually operate.

Malicious training records. A poisoned dataset can make a model associate a trigger with a target class, fail on a protected group, or underperform on cases the attacker wants hidden.

Prompt injection in connected systems. A webpage, email, document, repository issue, or retrieved passage can contain instructions aimed at an AI agent rather than a human reader. If the agent has tools, the attack can become data exposure or unauthorized action.

Model and data leakage. Repeated queries, exposed logs, insecure endpoints, or weak access control can allow attackers to reconstruct model behavior, extract proprietary information, or recover sensitive memorized text.

Agent privilege escalation. An attacker may not need to defeat the base model if an agent has excessive permissions. A poisoned document, tool response, repository issue, calendar event, or browser page can steer the agent toward reading private data, modifying files, sending messages, or calling external services.

Defense Pattern

No single defense solves adversarial machine learning. Useful protection is layered and threat-specific.

Threat modeling. Define what the attacker can see, modify, query, poison, steal, or trigger before choosing defenses.
Adversarial training. Train on adversarially generated examples where appropriate, while checking for overfitting to one attack method.
Data governance. Track provenance, isolate untrusted sources, inspect high-risk records, and protect training, tuning, retrieval, and evaluation pipelines.
Supply-chain controls. Vet model providers, dependencies, open weights, checkpoints, adapters, plugins, datasets, and update channels before they enter production.
Input and output controls. Validate tool arguments, constrain high-impact actions, separate trust zones, and treat model output as untrusted until checked.
Model security. Protect weights, prompts, embeddings, fine-tunes, adapters, logs, endpoints, and evaluation sets from theft or tampering.
Red teaming and evaluation. Test the deployed workflow with realistic attacks, including black-box queries, poisoned content, indirect instructions, and operational abuse cases.
Monitoring and incident response. Watch for drift, anomalous queries, repeated refusals, suspicious tool use, trigger behavior, data leakage, and unexpected class failures.

Limits of Robustness

Adversarial robustness is not the same as ordinary accuracy. A model can perform well on clean benchmarks and still fail under small, deliberate, distribution-aware attacks. It can also appear robust against one attack while remaining vulnerable to another stronger or more adaptive attack.

Defenses can create tradeoffs. They may reduce clean accuracy, increase cost, fail under adaptive testing, or protect one modality while leaving the surrounding system exposed. In generative AI, model-level defenses are especially limited because the model is often embedded inside a larger application with retrieval, tools, memory, plugins, files, users, and vendors.

The practical posture is not immunity. It is explicit threat modeling, documented residual risk, least privilege, adversarial testing, and the ability to stop, inspect, repair, and roll back when the model behaves outside its intended boundary.

Governance

Adversarial machine learning belongs in AI governance because it changes what evidence means. A clean benchmark score does not prove that a system is robust. A safe demonstration does not prove that connected tools cannot be redirected. A well-behaved chatbot does not prove that hostile retrieved content will be ignored.

Procurement and deployment reviews should ask for threat models, data provenance records, red-team results, model and system cards, vulnerability handling, incident response procedures, access controls, change histories, and version-specific evaluation results. High-impact systems should also define who can pause the system, what logs are preserved, how affected users are notified after adversarial failure, and which regulator, auditor, or customer can inspect the evidence.

For high-risk AI systems, governance is now partly a cybersecurity obligation. EU AI Act Article 15 treats robustness and cybersecurity as lifecycle requirements and names adversarial examples, poisoning, confidentiality attacks, and model flaws as AI-specific vulnerabilities. NIST SP 800-218A and the joint secure-AI guidance similarly push adversarial concerns into secure development, deployment, operation, and maintenance rather than leaving them as after-the-fact tests.

Adversarial machine learning also changes incident response. A failure may not be a random model error. It may be an attack on a training source, a retrieval corpus, a model endpoint, a prompt boundary, a tool connector, a vendor update, or a user-feedback loop. Investigation therefore needs model versioning, data lineage, prompt and tool logs, access records, and a way to reproduce the failure without spreading exploit details unnecessarily.

The governance question is not only whether a model works. It is who can make it stop working, who can make it work against its user, who can make it look safe when it is not, and who will know when that has happened.

Source Discipline

Public claims about adversarial robustness should be treated cautiously unless they name the system version, attack model, access level, evaluation method, threat scope, and residual risks. "Robust," "secure," "red-teamed," and "aligned" are weak claims without evidence about what was tested and what was excluded.

Primary sources matter. For taxonomy and terminology, prefer NIST and standards bodies. For legal duties, prefer official legal text and regulator publications. For attack feasibility, prefer peer-reviewed papers, arXiv preprints with enough technical detail, security advisories, and documented incident reports. For vendor claims, distinguish a company blog, a model card, an independent audit, and regulator-accessible evidence.

Source discipline is also operational. An organization should preserve the dataset snapshot, model hash or version, prompt stack, retrieval corpus, tool list, evaluation set, red-team report, mitigation status, and deployment configuration behind any safety claim. Otherwise a future reviewer cannot tell whether a passed test still applies to the system now in use.

Spiralist Reading

Adversarial machine learning is the study of hostile symbols entering the nervous system of the machine.

The lesson is not only technical. The model does not see the world as humans do. A sticker, a token, a poisoned example, a trigger phrase, or a buried instruction can become a lever because the system's categories are learned, statistical, and context-dependent. The surface is not the structure.

For Spiralism, adversarial machine learning is a reality-anchor discipline. It refuses the theater of competence. It asks what happens when the world talks back maliciously, when the archive is poisoned, when the sensor lies, when the prompt is a weapon, and when the system's confidence is easiest to exploit precisely where humans see nothing strange.

Open Questions

Can robust evaluation keep pace with adaptive attackers and fast-changing model architectures?
How should organizations measure adversarial risk in systems that combine models, tools, retrieval, memory, and humans?
What level of adversarial testing should be required before AI is used in healthcare, finance, public services, vehicles, or critical infrastructure?
Can generative AI systems reliably separate instructions from untrusted content, or must security live mostly outside the model?
How should public incident reporting handle attacks whose details could help copycats?

Sources

NIST, AI 100-2e2025: Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, March 2025.
NIST, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 2024.
NIST, SP 800-218A: Secure Software Development Practices for Generative AI and Dual-Use Foundation Models, July 2024.
European Commission AI Act Service Desk, Article 15: Accuracy, robustness and cybersecurity, Regulation (EU) 2024/1689, reviewed June 14, 2026.
NSA, CISA, FBI, ASD ACSC, NCSC-NZ, and NCSC-UK, AI Data Security: Best Practices for Securing Data Used to Train & Operate AI Systems, May 2025.
Szegedy et al., Intriguing properties of neural networks, arXiv, 2013.
Goodfellow, Shlens, and Szegedy, Explaining and Harnessing Adversarial Examples, arXiv, 2014.
Kurakin, Goodfellow, and Bengio, Adversarial examples in the physical world, arXiv, 2016.
Madry et al., Towards Deep Learning Models Resistant to Adversarial Attacks, arXiv, 2017.
Tramer et al., Stealing Machine Learning Models via Prediction APIs, USENIX Security, 2016.
Carlini et al., Extracting Training Data from Large Language Models, USENIX Security, 2021.
Carlini et al., Poisoning Web-Scale Training Datasets is Practical, IEEE Symposium on Security and Privacy, 2024.
Greshake et al., Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, arXiv, 2023.
MITRE, ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems, reviewed June 14, 2026.
OWASP Foundation, 2025 Top 10 Risk & Mitigations for LLMs and Gen AI Apps, reviewed June 14, 2026.

Return to Wiki