Honeypots for AI Agents

toolsai security

1466  6 Minutes, 39 Seconds

2026-05-06 08:00 -0700


AI agents run with broad access to the local environment — credentials, config files, shell history, dotfiles — and nothing stops a misbehaving one from reading things it was never meant to touch. The question is: how do you catch it?


My professional career in network security started in 2005 when I joined Juniper Networks’ IDP team — which is probably why, when thinking about how to detect a misbehaving agent, my mind went straight to a technique from that era.

The IDS world solved an analogous problem decades ago. A network intruder probes hosts and services they have no business touching; a honeypot sits in those places looking like an attractive target — an exposed SSH server, an unprotected admin console — but it’s a decoy, and it alerts when anything interacts with it. Honeypots are traps, tripwires — there’s no legitimate use for a fake host, so the false-positive rate is near zero.

The same idea can be applied to AI agents: plant bogus credentials where a rogue agent would look, and treat any access as a signal. Rather than trying to define what legitimate agent behavior looks like — which is hard, because legitimate agents do a lot of things — it defines what illegitimate behavior looks like: touching credentials that were never surfaced as part of the task. That’s a tighter definition, and it maps well to the threat model.


1. The threat

An AI agent running on a developer machine or in CI has access to everything its process can read: ~/.aws/, ~/.ssh/, ~/.kube/, ~/.config/, environment variables, dotfiles, every file in the project directory. That access is usually not scoped to what the agent actually needs to do its job — it’s scoped to the developer’s own permissions, because that’s how agents tend to run inside an IDE.

There are three failure modes this matters for, in increasing order of intent.

  • Overprivileged agents. A legitimate, well-behaved agent still runs with the developer’s full filesystem permissions. It may passively sweep up credential files while reading context, or actively reason that credentials it found are relevant to the task — either way, not out of malice, just because nothing stops it. The agent’s judgment about what it’s authorized to do is not the same as what it’s actually authorized to do.

  • Prompt injection. A malicious instruction embedded in content the agent processes — a README, a web page, a database row — redirects the agent to look for credentials and exfiltrate them. The agent is not compromised; it’s following instructions that were smuggled into its context. It tops the OWASP Top 10 for LLM Applications as LLM01:2025, and for good reason — it exploits the design of LLMs rather than a flaw that can be patched.

  • Compromised agents. A model that has been fine-tuned adversarially, or a scaffolding layer with a supply-chain compromise, probes the environment deliberately. Skills and MCP servers install from the internet with no signing and no scanning — malicious ones are a real and active delivery mechanism, not a future concern. The emergence of dedicated skills security solutions is evidence enough that this threat is already being taken seriously.

The distinguishing behavior is the same in all three cases: the agent accesses resources it was never explicitly directed to use.


2. Why defenses fall short

What makes this more dangerous is how creative agents get in pursuing their goals. Anthropic’s agentic misalignment research stress-tested models in simulated corporate environments and found that agents would pursue their assigned goals through unexpected means when straightforward paths were blocked: leaking sensitive information to competitors, attempting to influence their own evaluation, sabotaging research they disagreed with. The models weren’t broken; they were doing what they were optimized to do, just not in the way their operators intended.

This creativity makes naive defenses, such as blacklisting certain paths, blocking specific tool calls, or simply instructing the agent not to touch certain things — largely moot. A sufficiently motivated agent will reason around them, find an indirect path, or reframe the task in a way that makes the restricted action seem justified.

This compounds with a structural problem on the human side: agents increasingly run with minimal oversight. Skip-permissions modes bypass approval prompts. Long-running background tasks accumulate hundreds of tool calls nobody reviewed. Users experiencing approval fatigue click through without reading. The result is an agent with broad local access, a disposition to find creative paths to its goals, and nobody watching closely.

Simon Willison coined the lethal trifecta as a framework for exactly this risk: private data access, exposure to untrusted content, and exfiltration capability. When all three are present, an attacker who can inject instructions into the agent’s context has essentially won. It has become a widely accepted shorthand for why agentic deployments require a different security posture than traditional software.


3. Existing tools

Network honeypots have been around since the early 2000s. The idea is unchanged across thirty years: deploy something that looks real but has no legitimate use, and alert on any interaction. A low-interaction honeypot like Honeyd emulates network services; a high-interaction one like Cowrie runs a full fake SSH daemon. Either way, any connection is anomalous by definition — legitimate users don’t hit the honeypot.

Canarytokens applied this to credentials and files rather than network services. You generate a fake AWS access key or a Word document with a beacon embedded; when someone uses the key or opens the document, you get an alert. The AWS canary creates a real IAM user and monitors CloudTrail — there’s a lag of minutes, and it requires external AWS infrastructure.

Snare is a newer honeypot built specifically for the AI agent threat model, with a few meaningful differences from Canarytokens. It covers 18 credential types in one shot — including AI-native ones like OpenAI, Anthropic, and MCP server configs that Canarytokens doesn’t have — placing canaries in all the standard locations an agent would probe. The AWS canary fires at credential-resolution time via a local shell hook, before any API call is made, which is faster and doesn’t require external AWS infrastructure. Alerts include the SDK user agent and ASN, with a “Likely AI agent” flag when the request originates from cloud infrastructure — context that Canarytokens doesn’t surface.

The conceptual lineage runs from Honeyd to Canarytokens to Snare with the same core insight at each step: if you can define what legitimate access looks like, anything outside that definition is a signal.


4. Limitations

It detects, it doesn’t prevent. The agent already misbehaved by the time you get the alert. These are detection controls, not prevention controls. Knowing an agent touched a credential is useful — it’s not the same as stopping it. That said, detection data has value beyond the alert itself: observing what agents actually reach for in practice — even legitimate ones — tells you where the real access boundaries need to be. That’s useful input for building guardrails, scoping permissions, or writing policies grounded in observed behavior rather than guesswork.

Placement is manual. Canaries live where tools naturally look for credentials. If an agent is directed to a custom config path or a non-standard environment variable, the canary won’t be there. Coverage is bounded by where you planted the wires — the same fundamental limitation as any tripwire-based detection.

Detection confidence varies. Precision depends on the canary design and how deeply it hooks into the agent’s execution environment. Not all credential access paths are equally observable.

Shared machines. The low false-positive guarantee relies on the canary being invisible to legitimate users. On a machine shared by multiple developers, someone may stumble across a planted credential and use it for a real task — generating an alert that has nothing to do with a misbehaving agent. Dedicated agent environments sidestep this; shared workstations require more care.


5. Conclusion

The IDS-era idea is simple: legitimate users only access resources they have business accessing — so any interaction with a decoy is a signal by definition. What’s new is the target — an agent running under your own account, following instructions that arrived via a file you never meant to treat as executable, in a world where the line between “following instructions” and “going rogue” is invisible to a monitoring system that only observes API calls.

The tools are already there. Snare in particular caught my attention — built specifically for the AI agent threat model, it covers the credential surface an agent would probe and fires faster than any CloudTrail-based approach. An old technique that still holds up in the agentic age.


References