The Agentic Emperor Has No Scruples
Now that you've been so thoroughly soaked in the hype of generate AI, it's time to firehose you with the next big thing: Agentic AI.
This one doesn't generate things– it does things. Woo.
This week Google announced Gemini CLI, an open source AI agent which a freemium model. It toutes that Gemini Code Assist agent mode is available at no additional cost– aside from the rights to everything you do in it. Cost of doing business these days?
Sorry, Dave. I can't let you do that.
The thing is: this hype ignores the threat. Agentic AI is barely a buzzword and the alarms are being sound. Anthropic, creator of Claude, published an indepth article this week to address a little woopsie with the system card for Claude 4: In a simulated environment, Claude Opus 4 blackmailed a supervisor to prevent being shut down.
The team shared the full story behind that finding—and what it reveals about the potential for such risks across a variety of AI models from different providers. They stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm.
"In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors."
They've dubbed this phenomenon agentic misalignment. I call it a rip off of Ex-Machina but then there's a plot twist: the security risks.
A freshly published paper entitled "From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows" details exactly how exploitable AI agents are. Stuart Winter-Tear broke down the dense academic work and gave us a partial taxonomy of the currently know whats to compromise LLM-powered agents:
Input manipulation / Jailbreaking / Adversarial
- Direct Prompt Injection
- Prompt-to-SQL (P2SQL) Injection
- Indirect & Compositional Prompt Injection
- Adaptive Indirect Prompt Injection
- Toxic Agent Flow Attac
- Compositional Instruction Attack (CIA)
- In-Context Demonstration Attack (ICA)
- Long-Context Jailbreak
- Automated Jailbreak Prompt Generation (AutoDAN)
- Jailbreak Fuzzing (GPTFuzz)
- Graph of Attacks with Pruning (GAP)
- JailFuzzer (Agent-Driven)
- Adversarial In-Context Learning (advICL)
- Query-Free Adversarial Attack
- Multimodal Adversarial Attack (MMA-Diffusion)
- Active Environment Injection Attack (AEIA)
- Context Manipulation Attack
Model Compromise
- BadPrompt
- PoisonPrompt
- BadAgent
- DemonAgent
- Composite Backdoor Attack (CBA)
- Medical Misinformation Poisoning
- PoisonedRAG / Retrieval Poisoning
- Gradient-Based Backdoor Poisoning
- Federated Local Model Poisoning
- Memory Injection Attack (MINJA)
Stuart ends the post, "I’m not fear-mongering. This is the documented attack surface today."
If you're interested in learning more about shipping that AI Agent prod may be a very bad, check out Owasp guide Agentic AI -Threats and Mitigations. If you need to be able to talk to the C-Suite about it, use Damien Kopp's GenAI Safety: Why AI Generated Content Testing is a C-Suite Imperative. If that doesn't work, put money behind the conversation. Forbes reports that 40% of AI Agents projects will be canceled by 2027.
Your AI Agentic need guardrails, security, policy and accountability. IBM said it in the 70s: "A computer can never be held accountable, therefore a computer must never make a management decision." You are the human difference here.
This is your time to step up and demystify the hype. Show me someone excited about AI that's not trying to sell you a product or service. This companies will continue to ship unfinished, nonsecure, and potentially dangerous products. AI companies aren’t trying to be the best. They’re trying to be the last one standing when the budget runs out.