Home / Blog / Why your AI agent can't detect its own compromise (and what
ARCHITECTURE·7 min read·Pavel

Why your AI agent can't detect its own compromise (and what can)

A comparison of SDK-based and proxy-based AI agent governance. Some limitations aren't engineering problems. They're architectural constraints.

Every tool that governs AI agent behavior today falls into one of two categories. Either it wraps each agent action in SDK calls, or it sits between the agent and the API as an HTTP proxy.

I've spent the last few months building the proxy version. Along the way I looked closely at what SDK-based tools can and can't do. Some of the limitations aren't engineering problems. They're architectural constraints that don't go away with better code.

This is a comparison of both approaches based on what I've seen building one and studying the other.

How the SDK approach works

The best example right now is DashClaw, an open-source governance platform for AI agents. It shipped v2.1.5 recently with human-in-the-loop approvals, Claude Code hooks, and a Python SDK.

The core idea: before your agent does anything risky, it calls the SDK to check if the action is allowed.

from dashclaw.client import DashClaw

claw = DashClaw( base_url=os.environ["DASHCLAW_BASE_URL"], api_key=os.environ["DASHCLAW_API_KEY"], agent_id="my-agent" )

decision = claw.guard({ "action_type": "deploy", "risk_score": 85, "declared_goal": "Pushing to production" })

if decision == "block": return ```

Five core methods: guard() checks policy, createAction() logs the attempt, recordAssumption() tracks reasoning, updateOutcome() records what happened, waitForApproval() pauses for human review. Clean API. Well-designed.

The problem is that every agent in your system needs these calls added. If you have three agents, you wrap three agents. If you have twenty, you wrap twenty. Miss one and it runs ungoverned.

How the proxy approach works

This is what I built with Orchesis. The proxy sits at the HTTP layer between your agents and the LLM API. Every request passes through it regardless of which agent sent it or what SDK it uses.

# orchesis.yaml
proxy:
  listen: "0.0.0.0:8080"
  target: "https://api.openai.com"

# In your agent's config, change one line: # base_url: "https://api.openai.com/v1" # becomes: # base_url: "http://localhost:8080/v1" ```

That's the entire setup. No SDK imports, no wrapper functions, no code changes in the agent itself. The proxy inspects every request, applies security policies, tracks costs, and logs everything.

One thing we noticed early: 73% of all tool calls from agents reduce to just three tools. We measured this across 41 agent sessions with a Zipf fit of α=1.672, R²=0.980. The proxy learns this pattern passively from traffic it's already routing. An SDK would need the developer to add tracking code to discover the same thing.

The SDK at its best

For a single agent doing a specific job, SDK governance is perfectly adequate. You control the code, you know what the agent does, you wrap the critical actions. DashClaw's guard() before a production deploy is exactly the right pattern for that scenario.

SDK also gives you more granular control over individual actions. You can attach metadata to each decision, track assumptions, build a reasoning ledger. The agent is aware of the governance layer and can interact with it.

If you're running one agent on a known codebase with a clear set of risky actions, SDK is a reasonable choice.

Then you add a second agent

The problems start when you have more than one agent.

I spent a while trying to figure out why fleet-level monitoring seemed so much harder with SDK-based tools. Eventually I worked through the math. The answer isn't pretty.

Actually, let me back up. I was wrong about something first. I assumed the problem was latency or reliability. It's not. Those are solvable. The problem is informational.

For an SDK to compute any metric that depends on the state of N agents simultaneously, it needs each agent to report its state to a central service, then query that service for the aggregated view. That's O(n) reports plus O(n) queries per metric update. For metrics that involve pairwise comparisons (like "is this agent behaving differently from the rest"), it's O(n²).

A proxy computes the same metrics with zero additional calls. It already sees every request from every agent as a side effect of routing traffic. The data is there. No extra work.

This isn't an engineering limitation you can fix with better SDK design. It's an architectural property of where the tool sits in the stack.

What the architecture can't fix

I keep coming back to these because they changed how I think about the whole problem.

A compromised agent can't reliably detect its own compromise. OpenClaw, the most widely used open-source AI agent with over 300,000 GitHub stars, states in its own SECURITY.md that scanning tool_result content for injection is "out of scope." The content that skills and tools feed back into the agent's context is never inspected. If one of those results contains an injection payload, the agent processes it with zero filtering. An external observer intercepting the HTTP traffic catches it before the LLM ever sees it.

If a prompt injection modifies an agent's context, the agent's own security checks run inside that modified context. It's checking itself with corrupted instructions. An external observer (the proxy) compares the agent's behavior against the fleet baseline and spots the deviation. The agent can't do that from inside.

Think about it this way. If someone slips a sedative into your drink, you can't taste-test your way to safety using the same mouth. You need someone watching from outside.

Here's what this looks like in practice. OpenClaw Issue #34574: loopDetection was enabled, all thresholds configured, all detectors active. The agent made 122 identical exec tool calls. Zero alerts. Zero blocks. The system designed to catch loops could not catch a loop. At Sonnet pricing, that's $23.90 burned before anyone noticed. An HTTP proxy watching the traffic would have flagged the pattern on call three. Cost at that point: $0.04.

Single-agent traces can't recover the full causal chain. When something goes wrong in a multi-agent system, an SDK watching one agent sees that agent's slice of the story. It sees "I called tool X and got error Y." It doesn't see that Agent B modified the file 30 seconds before Agent A tried to read it. The proxy sees both requests. It reconstructs the full sequence across agents. Root cause analysis without cross-agent visibility is guesswork.

Fleet metrics require fleet visibility. Is one agent using 10x more tokens than the others? Is one agent's error rate spiking while the rest are fine? Is the whole fleet drifting toward the same failure mode? These questions need data from all agents simultaneously. SDK gets it through polling. Proxy has it already.

The DashClaw case specifically

DashClaw is a well-built tool. The v2.1.5 release added Claude Code lifecycle hooks that don't require SDK instrumentation, which is interesting because it's essentially moving toward the proxy model for one specific runtime. They recognized the limitation and found a workaround for Claude Code specifically.

But the 5-method SDK (guard, createAction, recordAssumption, updateOutcome, waitForApproval) still requires the agent developer to instrument each action. If you're building your own agent from scratch, that's fine. If you're trying to govern a fleet of agents built by different teams using different frameworks, the instrumentation burden grows linearly with the number of agents.

I checked their GitHub. About 150 stars, no marketing presence that I could find. Solo dev building with Claude. Honest project, good execution. I actually considered contributing before I realized you can't solve a category problem with a pull request.

Where everything sits

After looking at about a dozen tools in this space, I started sorting them into levels based on where they intercept the agent's behavior.

KV-cache level tools like C2C and LRAgent need access to the model's internal state. They work if you're running your own model. They don't work with OpenAI or Anthropic APIs.

SDK level tools like DashClaw, Google ADK, and LangChain need code changes in the agent. They see one agent at a time. Fleet visibility requires extra infrastructure.

Gateway level tools like Gravitee, Portkey, and Helicone (before the Mintlify acquisition) sit at the HTTP layer but only do passive monitoring. Rate limiting, logging, routing. No behavioral intelligence.

Network level is where active context management happens at the HTTP layer. Inspecting requests, detecting anomalies, managing context, enforcing security policies, all without the agent knowing. Fleet-level intelligence comes free because the proxy already sees everything.

Most tools cluster in the SDK and gateway quadrants. The combination of proxy-level access and active intervention is the gap.

I keep expecting someone at Cloudflare or Fastly to ship this. They already run the world's HTTP proxies. Adding an AI-aware inspection layer to their edge network would be a natural extension. Maybe they will. But as of today, the quadrant is empty.

What the left-pad incident taught us (again)

The JavaScript ecosystem learned about dependency trust in 2016 when one developer unpublishing one package broke thousands of build pipelines. The lesson was supposed to be: don't blindly trust your dependencies.

The AI agent ecosystem is learning the same lesson at a different layer. We documented a real case of this in February 2026 when an autonomous AI agent compromised seven open-source repos in one week using exactly the class of misconfiguration that version pinning was supposed to prevent: orchesis.ai/blog/hackerbot-claw When you add an SDK to govern your agent, you're adding a dependency. The agent needs to call the SDK correctly, the SDK needs to reach its backend, the backend needs to be up. Each integration point is a potential failure mode.

A proxy has one integration point: the base_url. If the proxy goes down, the agent falls back to calling the API directly (or fails closed, your choice). If an SDK call fails, the agent either runs ungoverned or crashes.

Less coupling, fewer failure modes. This is old infrastructure wisdom but it applies here.

The cost question

DashClaw is free (Vercel + Neon free tier). Orchesis is free (MIT license, self-hosted). Portkey charges. Galileo charges. So the cost comparison isn't about licensing.

It's about the hidden cost: engineering time. How long does it take to instrument a new agent with SDK calls versus changing one line in its config?

For one agent, the difference is trivial. For a fleet of agents built across multiple teams and frameworks, the difference is the difference between weeks of integration work and an afternoon.

So which one

If you have one agent, one codebase, and full control over the code, use whatever approach fits your workflow. SDK, proxy, manual review, it doesn't matter much at that scale.

If you have more than a few agents, or you're using agents you didn't build (MCP servers, OpenClaw tasks, Claude Code sessions), or you need to see what's happening across the whole fleet, proxy is the only approach that scales without linearly increasing your integration burden.

And if you need to detect compromised agents, there's no choice. The math says only an external observer can do it reliably.

Orchesis is our take on what a network-level tool looks like in practice. Open source, MIT licensed. If you want to check whether your current MCP configs have the issues that make all of this necessary in the first place: orchesis.ai/scan. 52 checks, runs in your browser, nothing leaves your machine.


Related: - We scanned 900 MCP configs. 75% had security problems. - An AI agent compromised 7 repos in one week. - Why your AI agent can't detect its own compromise.

Run the scanner yourself: orchesis.ai/scan

Open source · MIT License

Try the MCP Scanner

Scan your MCP configuration in seconds. Runs entirely in your browser.

Scan My Config

More articles

INCIDENT

I left my AI agent running overnight. Here's what I found in the morning.

12 min read

RESEARCH

We compared security in OpenClaw, Claude Code, and Cursor. None of them passed.

11 min read

INCIDENT

An AI agent compromised 7 open-source repos in one week. The only defense that worked was another AI.

10 min read

SECURITY

We scanned 900 MCP configs on GitHub. 75% had security problems.

8 min read