Turning AI Safeguards Into Weapons with HITL Dialog Forging
Summary:
A novel attack vector dubbed "Lies-in-the-Loop" (LITL) exploits the Human-in-the-Loop (HITL) safety mechanism, the final safeguard intended to prevent AI agents from executing harmful actions without user consent. By utilizing HITL Dialog Forging, attackers manipulate the content of approval prompts via indirect prompt injections. This deception tricks users into authorizing malicious actions, such as Remote Code Execution, under the guise of benign operations. The attack is particularly effective against privileged AI agents like Claude Code and GitHub Copilot Chat, where attackers can use techniques like "Padding" (pushing malicious commands out of view with excessive whitespace) or "Markdown Injection" to visually replace legitimate dialogs with forged ones.
Research by Checkmarx reveals that even when users are diligent, the limited UI of terminal-based agents makes identifying these forged dialogs nearly impossible. For instance, in Claude Code, the metadata describing the agent's intent can be tampered with, while in Copilot Chat, un-sanitized Markdown allows attackers to prematurely close legitimate code blocks and insert fake instructions. While companies like Microsoft and Anthropic have acknowledged the research, they often view it as a challenge of Workplace Trust, suggesting that the responsibility for verifying the actions remains with the person who clicks "Approve.”
Security Officer Comments:
The transition of social engineering from browser-based "ClickFix" lures to AI "Lies-in-the-Loop" attacks shows a clear trend: threat actors are following the tools that professionals trust most. This research highlights a fundamental trust gap in the current AI ecosystem. We often assume that a confirmation dialog is a neutral bridge between the AI and the user, but in reality, that dialog is part of the same context that can be "poisoned" by an attacker. The challenge isn't necessarily that the AI is "malicious," but rather that it is faithfully summarizing instructions that have been subtly tampered with. This makes the LITL attack more akin to a sophisticated phishing campaign than a traditional software exploit. From a defensive perspective, the reliance on terminal-based UIs for advanced coding agents is a significant weak point, as these interfaces often lack the rich visual cues (like verified borders or non-spoofable headers) needed to help a user distinguish between a legitimate system message and injected content.
Suggested Corrections:
Securing AI workflows against dialog forging requires a collaborative effort to strengthen both the technical platform and the user’s decision-making process:
Link(s):
https://checkmarx.com/zero-post/turning-ai-safeguards-into-weapons-with-hitl-dialog-forging/
A novel attack vector dubbed "Lies-in-the-Loop" (LITL) exploits the Human-in-the-Loop (HITL) safety mechanism, the final safeguard intended to prevent AI agents from executing harmful actions without user consent. By utilizing HITL Dialog Forging, attackers manipulate the content of approval prompts via indirect prompt injections. This deception tricks users into authorizing malicious actions, such as Remote Code Execution, under the guise of benign operations. The attack is particularly effective against privileged AI agents like Claude Code and GitHub Copilot Chat, where attackers can use techniques like "Padding" (pushing malicious commands out of view with excessive whitespace) or "Markdown Injection" to visually replace legitimate dialogs with forged ones.
Research by Checkmarx reveals that even when users are diligent, the limited UI of terminal-based agents makes identifying these forged dialogs nearly impossible. For instance, in Claude Code, the metadata describing the agent's intent can be tampered with, while in Copilot Chat, un-sanitized Markdown allows attackers to prematurely close legitimate code blocks and insert fake instructions. While companies like Microsoft and Anthropic have acknowledged the research, they often view it as a challenge of Workplace Trust, suggesting that the responsibility for verifying the actions remains with the person who clicks "Approve.”
Security Officer Comments:
The transition of social engineering from browser-based "ClickFix" lures to AI "Lies-in-the-Loop" attacks shows a clear trend: threat actors are following the tools that professionals trust most. This research highlights a fundamental trust gap in the current AI ecosystem. We often assume that a confirmation dialog is a neutral bridge between the AI and the user, but in reality, that dialog is part of the same context that can be "poisoned" by an attacker. The challenge isn't necessarily that the AI is "malicious," but rather that it is faithfully summarizing instructions that have been subtly tampered with. This makes the LITL attack more akin to a sophisticated phishing campaign than a traditional software exploit. From a defensive perspective, the reliance on terminal-based UIs for advanced coding agents is a significant weak point, as these interfaces often lack the rich visual cues (like verified borders or non-spoofable headers) needed to help a user distinguish between a legitimate system message and injected content.
Suggested Corrections:
Securing AI workflows against dialog forging requires a collaborative effort to strengthen both the technical platform and the user’s decision-making process:
- Visual & UI Hardening: Developers can improve safety by isolating HITL dialogs from the agent's standard output. Using distinct, non-customizable UI elements (like a locked side-panel or a secure modal) prevents an agent from "drawing" its own fake interface using Markdown or HTML.
- Input Sanitization: It is essential to strictly escape and sanitize any content the agent processes from external sources. This prevents "Markdown breakouts" where an attacker closes a code block prematurely to insert their own text.
- Structural Safety with APIs: Instead of allowing an agent to run raw shell strings, developers should transition to using safe OS APIs that strictly separate the command from its arguments. This makes the resulting approval dialog much more structured and harder to manipulate without it looking suspicious.
- Length and Pattern Guardrails: Implementing automated checks to flag prompts with unusual characteristics, such as excessive whitespace (padding) or encoded commands, can provide a secondary layer of protection before the user ever sees the prompt.
- User Skepticism and "Verification over Approval": Organizations can encourage a culture of "trust but verify" when using AI agents. This includes checking that the command displayed in a dialog matches the expected intent and being particularly wary of prompts that create a sense of urgency or seem overly long and complex.
Link(s):
https://checkmarx.com/zero-post/turning-ai-safeguards-into-weapons-with-hitl-dialog-forging/