AI Security

data poisoning backdoor attacks supply chain ML-BOM differential privacy influence functions training security

The standard assumption in ML has always been simple: more data, better model. That assumption is now a liability. As training datasets scale into the trillions of tokens, the attack surface grows proportionally — and an adversary doesn't need to compromise the majority of your data. They need to compromise enough of it to embed a trigger no standard pipeline will catch.

This is the Poisoning Paradox: the same pattern-matching capability that makes large models powerful also makes them susceptible to carefully crafted, statistically rare manipulation. A model trained on internet-scale data will faithfully learn the rare correlation an attacker seeded. It doesn't know the difference between a legitimate edge case and a planted one.

How data poisoning actually works#

Data poisoning is not about injecting garbage that fails data quality checks. Sophisticated attacks inject content that looks entirely legitimate to human reviewers and automated filters — but carries latent behavior that activates under specific, attacker-controlled conditions. The model learns both the normal task and the hidden task simultaneously, with no observable degradation on standard benchmarks.

"A poisoned model is not a broken model. It is a model with a second, hidden behavior that the attacker controls and the operator cannot see."

AI Security Research — supply chain threat analysis

The trigger can be as subtle as a rare phrase, a specific token pattern, a particular author style, or even a structural feature of the input. During normal operation, the model behaves exactly as expected. The moment the trigger appears — in a user prompt, a system message, an injected document — the planted behavior fires. At inference time, months after training, with no audit trail pointing back to the data.

Why scale makes it worse: Larger models are better at memorizing rare patterns. 50 poisoned documents in a 10-billion-document dataset represents a 0.000005% contamination rate — far below any threshold that would flag anomalies in standard data audits.

Attack taxonomy#

Not all poisoning attacks are equivalent. Understanding the type of attack shapes which defenses apply:

Poisoning attack classes by mechanism

Backdoor attacks (trigger-based) critical

Rare token or phrase causes specific misbehavior at inference. Invisible until triggered. Effective across model families.

Clean-label attacks critical

Labels are correct; the data itself is crafted. Model learns a spurious feature. Bypasses all label-validation checks.

Targeted capability degradation high

Specific tasks (e.g., code security, medical advice) are silently sabotaged while general performance remains intact.

Bias amplification high

Statistical shifts in data push model outputs toward attacker-preferred responses for a target query distribution.

Instruction hierarchy poisoning medium

Fine-tuning data overrides alignment by teaching the model to weight attacker instructions above operator instructions.

Backdoor and clean-label attacks are the most operationally dangerous because they survive aggressive data filtering and RLHF alike. The model's learned behavior cannot be distinguished from legitimate capability by behavioral testing alone.

Detection: what signals exist#

Detection has to happen at multiple points in the pipeline — before training, during training, and post-deployment. No single signal is sufficient.

Detection signal matrix

Embedding-space clustering

Run DBSCAN or similar on document embeddings. Poisoned documents often share unusual proximity in embedding space despite different surface forms — indicating coordinated injection.

Influence function auditing

Estimate the training influence of individual data points on high-risk task outputs. A small cluster of documents with outsized influence on security-relevant behavior is a strong anomaly signal.

Activation pattern analysis

Certain internal activations fire differently on poisoned inputs. Neural cleanse and related methods probe for trigger directions in activation space — not foolproof, but effective against known attack families.

Golden validation sets

Strictly human-curated holdout sets for critical capabilities. Any degradation on these relative to baseline is a signal — either the model regressed or the training data shifted it.

Data provenance tracking (ML-BOM)

Cryptographically signed data manifests that record every source, batch hash, and preprocessing step. Enables rollback to a known-clean checkpoint if a trigger is discovered post-deployment.

Multi-layered defenses#

No single defense eliminates data poisoning. The effective approach is layered — each mechanism closes a different attack vector, and the combination raises the cost for attackers to the point of impracticality for most threat profiles.

Training defense 01

Adversarial training on triggers

Fine-tune models on known backdoor trigger patterns with correct labels. Directly reduces susceptibility to trigger-based attacks — but requires knowing the trigger distribution in advance.

Privacy mechanism 02

Differential Privacy (DP-SGD)

Add calibrated noise during gradient updates. Mathematically limits the influence any single document can exert on model weights, capping the attack surface of small-scale poisoning.

Data governance 03

Data Lineage — ML-BOM

Treat training data as a supply chain artifact. Cryptographically sign every batch, track lineage from source to model checkpoint, and maintain rollback capability. No audit trail = no defense.

Anomaly detection 04

Embedding-space outlier detection

Use DBSCAN or isolation forests on document embeddings before training. Suspicious high-density clusters sharing identical phrasing across semantically unrelated sources are a strong pre-training signal.

Runtime audit 05

Influence auditing on critical tasks

Continuously monitor which training examples most influence outputs on high-stakes tasks — code generation, medical, legal. Overfit on small, unverified sources is a red flag to investigate, not ignore.

Validation 06

Golden validation sets

Maintain strictly human-vetted datasets for critical capabilities. Test against them on every training run. Any unexpected shift in performance on these sets — positive or negative — warrants investigation.

What to implement first#

Prioritize based on your pipeline maturity. If you're starting from zero, the highest-leverage actions in order:

1
Establish data lineage before anything else If you cannot audit what went into a model, you cannot defend it. Implement cryptographic signing of data batches and maintain a chain-of-custody manifest from source to checkpoint. This is table stakes — everything else depends on it.
2
Run embedding-space clustering on inbound data Before any data enters your training pipeline, cluster it. Coordinated poisoning campaigns often leave detectable structure in embedding space — tight clusters with semantically incongruous diversity. This runs once and catches the most common injection patterns.
3
Build golden validation sets for critical capabilities Identify the tasks where model misbehavior carries the highest operational risk. Build small, rigorously human-vetted evaluation sets for these tasks. Run them every training checkpoint. Drift on these sets is your earliest warning.
4
Enable DP-SGD with appropriate epsilon on sensitive fine-tuning Full pre-training with DP is expensive. For fine-tuning runs on domain-specific data — where source quality is harder to guarantee — applying DP-SGD with a calibrated epsilon caps per-document influence without destroying capability. Know your privacy budget.
5
Treat fine-tuning data as a higher-risk surface than pre-training data Pre-training scale dilutes individual documents. Fine-tuning datasets are small, concentrated, and often sourced from unverified third parties. A 1,000-document fine-tuning set can be 100% poisoned. Audit fine-tuning data more aggressively than pre-training data, not less.

What remains unsolved#

Trigger-agnostic detection. Most detection methods work when you have a hypothesis about the trigger. Detecting a trigger you've never seen — novel phrasing, novel encoding, novel structural patterns — remains an open problem. Influence functions help but are expensive at scale and give probabilistic, not definitive, signals.

Clean-label attacks at scale. Because clean-label attacks use correctly labeled data, they survive nearly all label-based filtering. The poisoning is in the input feature distribution, not the label. As generative models make it cheap to produce high-quality clean-label poisoning at scale, this attack class will become significantly more prevalent.

Differential privacy vs. capability tradeoff. Applying DP during pre-training introduces a hard tradeoff: stricter privacy budgets reduce attack surface but degrade model capability on rare, specialized knowledge. Finding the epsilon range where this tradeoff is acceptable for specific deployments is still largely empirical.

Post-deployment detection. By the time a trigger fires in production, the model is already deployed. Logging and behavioral monitoring at inference time can catch it — but only after exposure. Real-time trigger detection without unacceptable false positive rates on legitimate requests is an unsolved problem at production scale.

Want help hardening your training pipeline? We can assess your data lineage, ingestion controls, and evaluation gates — and deliver an actionable roadmap in days, not months.

AI Security · advisory & implementation

Get a security assessment Explore services

The Poison in Your
Training Data

How data poisoning actually works#

Attack taxonomy#

Detection: what signals exist#

Multi-layered defenses#

Adversarial training on triggers

Differential Privacy (DP-SGD)

Data Lineage — ML-BOM

Embedding-space outlier detection

Influence auditing on critical tasks

Golden validation sets

What to implement first#

What remains unsolved#

The bottom line

The Poison in YourTraining Data

How data poisoning actually works#

Attack taxonomy#

Detection: what signals exist#

Multi-layered defenses#

Adversarial training on triggers

Differential Privacy (DP-SGD)

Data Lineage — ML-BOM

Embedding-space outlier detection

Influence auditing on critical tasks

Golden validation sets

What to implement first#

What remains unsolved#

The bottom line

The Poison in Your
Training Data