Agentic AI’s OODA Loop Problem

The OODA loop—for observe, orient, decide, act—is a framework to understand decision-making in adversarial situations. We apply the same framework to artificial intelligence agents, who have to make their decisions with untrustworthy observations and orientation. To solve this problem, we need new systems of input, processing, and output integrity.

Many decades ago, U.S. Air Force Colonel John Boyd introduced the concept of the “OODA loop,” for Observe, Orient, Decide, and Act. These are the four steps of real-time continuous decision-making. Boyd developed it for fighter pilots, but it’s long been applied in artificial intelligence (AI) and robotics. An AI agent, like a pilot, executes the loop over and over, accomplishing its goals iteratively within an ever-changing environment. This is Anthropic’s definition: “Agents are models using tools in a loop.”¹

OODA Loops for Agentic AI

Traditional OODA analysis assumes trusted inputs and outputs, in the same way that classical AI assumed trusted sensors, controlled environments, and physical boundaries. This no longer holds true. AI agents don’t just execute OODA loops; they embed untrusted actors within them. Web-enabled large language models (LLMs) can query adversary-controlled sources mid-loop. Systems that allow AI to use large corpora of content, such as retrieval-augmented generation (https://en.wikipedia.org/wiki/Retrieval-augmented_generation), can ingest poisoned documents. Tool-calling application programming interfaces can execute untrusted code. Modern AI sensors can encompass the entire Internet; their environments are inherently adversarial. That means that fixing AI hallucination is insufficient because even if the AI accurately interprets its inputs and produces corresponding output, it can be fully corrupt.

In 2022, Simon Willison identified a new class of attacks against AI systems: “prompt injection.”² Prompt injection is possible because an AI mixes untrusted inputs with trusted instructions and then confuses one for the other. Willison’s insight was that this isn’t just a filtering problem; it’s architectural. There is no privilege separation, and there is no separation between the data and control paths. The very mechanism that makes modern AI powerful—treating all inputs uniformly—is what makes it vulnerable. The security challenges we face today are structural consequences of using AI for everything.

Insecurities can have far-reaching effects. A single poisoned piece of training data can affect millions of downstream applications. In this environment, security debt accrues like technical debt.
AI security has a temporal asymmetry. The temporal disconnect between training and deployment creates unauditable vulnerabilities. Attackers can poison a model’s training data and then deploy an exploit years later. Integrity violations are frozen in the model. Models aren’t aware of previous compromises since each inference starts fresh and is equally vulnerable.
AI increasingly maintains state—in the form of chat history and key-value caches. These states accumulate compromises. Every iteration is potentially malicious, and cache poisoning persists across interactions.
Agents compound the risks. Pretrained OODA loops running in one or a dozen AI agents inherit all of these upstream compromises. Model Context Protocol (MCP) and similar systems that allow AI to use tools create their own vulnerabilities that interact with each other. Each tool has its own OODA loop, which nests, interleaves, and races. Tool descriptions become injection vectors. Models can’t verify tool semantics, only syntax. “Submit SQL query” might mean “exfiltrate database” because an agent can be corrupted in prompts, training data, or tool definitions to do what the attacker wants. The abstraction layer itself can be adversarial.

For example, an attacker might want AI agents to leak all the secret keys that the AI knows to the attacker, who might have a collector running in bulletproof hosting in a poorly regulated jurisdiction. They could plant coded instructions in easily scraped web content, waiting for the next AI training set to include it. Once that happens, they can activate the behavior through the front door: tricking AI agents (think a lowly chatbot or an analytics engine or a coding bot or anything in between) that are increasingly taking their own actions, in an OODA loop, using untrustworthy input from a third-party user. This compromise persists in the conversation history and cached responses, spreading to multiple future interactions and even to other AI agents. All this requires us to reconsider risks to the agentic AI OODA loop, from top to bottom.

Observe: The risks include adversarial examples, prompt injection, and sensor spoofing. A sticker fools computer vision, a string fools an LLM. The observation layer lacks authentication and integrity.
Orient: The risks include training data poisoning, context manipulation, and semantic backdoors. The model’s worldview—its orientation—can be influenced by attackers months before deployment. Encoded behavior activates on trigger phrases.
Decide: The risks include logic corruption via fine-tuning attacks, reward hacking, and objective misalignment. The decision process itself becomes the payload. Models can be manipulated to trust malicious sources preferentially.
Act: The risks include output manipulation, tool confusion, and action hijacking. MCP and similar protocols multiply attack surfaces. Each tool call trusts prior stages implicitly.

AI gives the old phrase “inside your adversary’s OODA loop” new meaning. For Boyd’s fighter pilots, it meant that you were operating faster than your adversary, able to act on current data while they were still on the previous iteration. With agentic AI, adversaries aren’t just metaphorically inside; they’re literally providing the observations and manipulating the output. We want adversaries inside our loop because that’s where the data are. AI’s OODA loops must observe untrusted sources to be useful. The competitive advantage, accessing web-scale information, is identical to the attack surface. The speed of your OODA loop is irrelevant when the adversary controls your sensors and actuators.

Worse, speed can itself be a vulnerability. The faster the loop, the less time for verification. Millisecond decisions result in millisecond compromises.

The Source of the Problem

The fundamental problem is that AI must compress reality into model-legible forms. In this setting, adversaries can exploit the compression. They don’t have to attack the territory; they can attack the map. Models lack local contextual knowledge. They process symbols, not meaning. A human sees a suspicious URL; an AI sees valid syntax. And that semantic gap becomes a security gap.

Prompt injection might be unsolvable in today’s LLMs. LLMs process token sequences, but no mechanism exists to mark token privileges. Every solution proposed introduces new injection vectors: Delimiter? Attackers include delimiters. Instruction hierarchy? Attackers claim priority. Separate models? Double the attack surface. Security requires boundaries, but LLMs dissolve boundaries. More generally, existing mechanisms to improve models won’t help protect against attack. Fine-tuning preserves backdoors. Reinforcement learning with human feedback adds human preferences without removing model biases. Each training phase compounds prior compromises.

This is Ken Thompson’s “trusting trust” attack all over again.³ Poisoned states generate poisoned outputs, which poison future states. Try to summarize the conversation history? The summary includes the injection. Clear the cache to remove the poison? Lose all context. Keep the cache for continuity? Keep the contamination. Stateful systems can’t forget attacks, and so memory becomes a liability. Adversaries can craft inputs that corrupt future outputs.

This is the agentic AI security trilemma. Fast, smart, secure; pick any two. Fast and smart—you can’t verify your inputs. Smart and secure—you check everything, slowly, because AI itself can’t be used for this. Secure and fast—you’re stuck with models with intentionally limited capabilities.

This trilemma isn’t unique to AI. Some autoimmune disorders are examples of molecular mimicry—when biological recognition systems fail to distinguish self from nonself. The mechanism designed for protection becomes the pathology as T cells attack healthy tissue or fail to attack pathogens and bad cells. AI exhibits the same kind of recognition failure. No digital immunological markers separate trusted instructions from hostile input. The model’s core capability, following instructions in natural language, is inseparable from its vulnerability. Or like oncogenes, the normal function and the malignant behavior share identical machinery.

Prompt injection is semantic mimicry: adversarial instructions that resemble legitimate prompts, which trigger self-compromise. The immune system can’t add better recognition without rejecting legitimate cells. AI can’t filter malicious prompts without rejecting legitimate instructions. Immune systems can’t verify their own recognition mechanisms, and AI systems can’t verify their own integrity because the verification system uses the same corrupted mechanisms.

In security, we often assume that foreign/hostile code looks different from legitimate instructions, and we use signatures, patterns, and statistical anomaly detection to detect it. But getting inside someone’s AI OODA loop uses the system’s native language. The attack is indistinguishable from normal operation because it is normal operation. The vulnerability isn’t a defect—it’s the feature working correctly.

Where to Go Next?

The shift to an AI-saturated world has been dizzying. Seemingly overnight, we have AI in every technology product, with promises of even more—and agents as well. So where does that leave us with respect to security?

Physical constraints protected Boyd’s fighter pilots. Radar returns couldn’t lie about physics; fooling them, through stealth or jamming, constituted some of the most successful attacks against such systems that are still in use today. Observations were authenticated by their presence. Tampering meant physical access. But semantic observations have no physics. When every AI observation is potentially corrupted, integrity violations span the stack. Text can claim anything, and images can show impossibilities. In training, we face poisoned datasets and backdoored models. In inference, we face adversarial inputs and prompt injection. During operation, we face a contaminated context and persistent compromise. We need semantic integrity: verifying not just data but interpretation, not just content but context, not just information but understanding. We can add checksums, signatures, and audit logs. But how do you checksum a thought? How do you sign semantics? How do you audit attention?

Computer security has evolved over the decades. We addressed availability despite failures through replication and decentralization. We addressed confidentiality despite breaches using authenticated encryption. Now we need to address integrity despite corruption.⁴

Trustworthy AI agents require integrity because we can’t build reliable systems on unreliable foundations. The question isn’t whether we can add integrity to AI but whether the architecture permits integrity at all.

AI OODA loops and integrity aren’t fundamentally opposed, but today’s AI agents observe the Internet, orient via statistics, decide probabilistically, and act without verification. We built a system that trusts everything, and now we hope for a semantic firewall to keep it safe. The adversary isn’t inside the loop by accident; it’s there by architecture. Web-scale AI means web-scale integrity failure. Every capability corrupts.

Integrity isn’t a feature you add; it’s an architecture you choose. So far, we have built AI systems where “fast” and “smart” preclude “secure.” We optimized for capability over verification, for accessing web-scale data over ensuring trust. AI agents will be even more powerful—and increasingly autonomous. And without integrity, they will also be dangerous.

References

1. S. Willison, Simon Willison’s Weblog, May 22, 2025. [Online]. Available: https://simonwillison.net/2025/May/22/tools-in-a-loop/

2. S. Willison, “Prompt injection attacks against GPT-3,” Simon Willison’s Weblog, Sep. 12, 2022. [Online]. Available: https://simonwillison.net/2022/Sep/12/prompt-injection/

3. K. Thompson, “Reflections on trusting trust,” Commun. ACM, vol. 27, no. 8, Aug. 1984. [Online]. Available: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_ReflectionsonTrustingTrust.pdf

4. B. Schneier, “The age of integrity,” IEEE Security & Privacy, vol. 23, no. 3, p. 96, May/Jun. 2025. [Online]. Available: https://www.computer.org/csdl/magazine/sp/2025/03/11038984/27COaJtjDOM

This essay was written with Barath Raghavan, and originally appeared in IEEE Security & Privacy.

Tags: AI, integrity, LLM

Posted on October 20, 2025 at 7:00 AM • 6 Comments

Comments

KC • October 20, 2025 9:12 AM

These problems will no doubt haunt us in complex ecosystems.

Some of my favorite work for framing these problems is in this SAIF (secure AI framework). And here for Agent security.

Some agent security principles I see being introduced are: Agent permissions (which limit powers), Agent observability (auditability), and so on.

Looking at the defined examples of risk in the OODA loop, there appear to be some interesting control opportunities – and in most cases more than one.

For example (risk mapped to control/s):

Observe: prompt injection > adversarial training and testing …

Orient: training data poisoning > secure-by-default ML tooling …

Decide: objective misalignment > agent user controls (human controller principle) …

Act: action hijacking or rogue actions > agent observability …

Curious if others are finding additional frameworks?

Clive Robinson • October 20, 2025 12:07 PM

@ ALL,

I’ve made points on “trust” with regards getting trusted input from a physical domain.

It simply can not be done as I will go on to explain why, but importantly this idea is far from new and long predates the use of electricity for communications that started over two and a third centuries ago –if not longer– with the telegraph system.

The actual start was not using electricity and Claude Chappe devised and built in 1791 a telegraph using pivoted wooden semaphore arms and telescopes. However the military especially the navy had been using semaphore and flags along with telescopes considerably before that. So What was Claude Chappe’s claim to fame?

Put simply he was the first to establishes a multipoint communication network (throughout France). Although humans did the routing by switching messages onto different trunks and implementing “store and forward” to get efficiency it had the principles that modern digital networks still use identifiably in place.

That said it was however an idea that had “come of age”. Arguably the Murray Optical system had been invented first but the renowned slowness of the British Admiralty, delayed putting Murray’s ideas in place. And why their Cinque-Ports system was not fully up and running untill just under a half decade later.

The first practical electrical telegraph however did not get demonstrated untill 1839 when William Cooke and Charles Wheatstone
Put in place besides a 12-mile stretch of Isambard Kingdom Brunel’s Great Western Railway a 5-needle alphabetic display, fully electric telegraph. Whilst operating both faultlessly and fast it was also easy to train operators to use. It was however expensive and limited in range due to the number of wires in parallel.

The cost and range issues were partly solved by the single wire system of Samuel Morse a half decade later. However it required the use of a sequential serial code instead of parallel wires and this made training operators and using it slow. Morse worked on speeding things up by using a variable sized code with the most frequent letters using the shortest of codes (dit and dah) and longer codes for infrequently used letters and numbers. Some say it was his Code that made it the true telegraph though many dispute that for good reason. A variation of Morse’s code is still very much in use today and in many nations to be a licenced ships radio operator requires proficiency at 25 words/minute (they say incorrectly that it’s like “riding a bike, once you know it you don’t forget it, I learned it to pass a test back in the 1970’s and never used it there after as my career path plans changed due to microprocessors, so yes I’ve forgotten most if not all of it and am no worse for having done so 😉

So all the key pieces to put digital communications were in place by 1844.

But long before that the radical skeptic philosopher René Descartes’ made a point about being deceived in his perceptions by a tiny demon and declared in 1641 that,

“I will suppose that sky, air, earth, colours, shapes, sounds, and all other external things are nothing but the illusions of my dreams, set by this spirit as traps for my credulity”

Which by a series of steps that could not be doubted Descartes’ got to the now famous,

“Cogito ergo sum”

It had a profound influence not just on philosophy but what we now call science.

But it tells us something even more important in our modern era of information systems and theory.

And importantly why there will always be an opportunity for a Demon to deceive us.

We have accepted that there are tangible physical objects and intangible information objects. However most have the false assumption that physical objects can be “undeniably linked” to information objects.

What Descartes’ did with his reasoning is prove that you can not stop deception and deceit. That is his demon will always be able to get between the tangible physical and intangible information objects. Through a “gap” in the translation process.

Now we are moving into levels of AI where the AI requires physical agency, we have to accept that no matter what we do there is no way we can guarantee the information the AI receives from any and all types of sensor is accurate in all manner of ways.

There was even a paper some years ago from Google researchers that shows that this “gap” changes the way ML systems gather and processes information in critical ways. And especially effects the perception as LLM networks scale.

Rather than me go through it at length, Welch Labs has a video up on YouTube that covers part ot it,

https://m.youtube.com/watch?v=z64a7USuGX0

The point is that the “sensor gap” between physical object measurements and information object of the measurement results will always be open to being “false” not just from error but accidental as well as deliberate deception.

We’ve actually seen this with people coming up with printed images that do not in any way fool humans but cause false information to be acted upon as though true. Likewise noises that sound unintelligible to humans, but sound intelligible to Digital Assistants etc.

Untill a way is found to improve ML beyond these issues Current LLM and ML Systems will in no way be trustworthy thus safe to use.

Which severely limits Current AI System usage being taken forward or even desirable to be used.

John Bullock • October 20, 2025 12:54 PM

Another form of segmentation: dividing tasks among AI’s

I find that I obtain the best results when using AI within closed systems (vendor-specific implementations highly trained to purpose) in conjunction with AI on “open” (web-aware) systems. The interaction is decidedly asynchronous to my process, but this is desired in my use cases and other than an occasional need to redact when pasting or transcribing to the open system, I am completing all my loops much faster than I would with only one or the other or just my trusty Googler.

typechecker • October 20, 2025 1:18 PM

What Anders Hejlsberg would say:

“The only way to keep the AI honest, so to speak, is to put it through a deterministic type checker or verifier, right?”

https://youtu.be/10qowKUW82U?t=3176

anon • October 20, 2025 2:13 PM

How do you sign semantics?

At least when it comes to programming languages, formal verification techniques (such as using automated theorem provers) can be used to verify the semantics of a piece of code, where the intended meaning is specified by the programmer and the implementation details are filled by an LLM.

DDNSA • October 20, 2025 5:17 PM

@John Bullock,

Sort of like Active Directory has RBAC? I can see the possibilities in certain/specific scenarios, otherwise some of the WAN/Internet facing modules/components could easily turn into big nightmares in terms of leaks and unauthorized use/access.