MCP Prompt Injection – Why It’s So Dangerous & How You Can Prevent It

MCP prompt injection is one of the main MCP security risks that any organization adopting MCP servers must address.

MCP prompt injection involves delivering a malicious prompt to an LLM or AI agent via an MCP server. Attackers can hide malicious prompts in some document, database, file, task within an application, or any other text that the LLM retrieves via an MCP server in the course of a task.

An alternate method attackers can use is to add prompts directly within the MCP server itself, for example, within a tool’s metadata or within the server’s outputs, such as error messages. This approach is known as tool poisoning and is one of many variants of indirect prompt injection.

MCP prompt injection is difficult to prevent, identify, and remediate without proper protections in place, and organizations that fall victim to prompt injection attacks can face financial losses, operational disruption, and reputational damage.

  Read this article to learn how MCP prompt injection works, why it is such an effective method of hijacking AI agents, and how you can protect your organization from it. 

What Is MCP Prompt Injection?

MCP prompt injection is a security vulnerability that affects Model Context Protocol (MCP) servers

MCP prompt injection attacks work by delivering malicious prompts (or instructions) to an LLM (or AI agent) via its interactions with an MCP server or the resources it accesses through an MCP server. These malicious prompts cause the LLM/AI Agent to take certain actions or alter its behavior in ways that benefit an attacker. 

Users are usually unable to see these prompts because attackers insert them in hidden places (such as tool metadata) and runtime actions. The text containing prompts may even seem innocuous and not like instructions at all to the untrained eye, making it extremely difficult to identify, address, and prevent prompt-injection attacks.

MCP prompt injection comes in two main forms:

  • Direct prompt injection: Users (deliberately or inadvertently) provide malicious prompts to an LLM via the user interface.
  • Indirect prompt injection: Attackers secrete malicious prompts in a source that the LLM interacts with, such as webpages, MCP server metadata or responses, emails, databases, and so on. 

Within the categories of direct and indirect prompt injection, there is a range of different attack vectors with various approaches and delivery mechanisms. 

Put simply, MCP prompt injection uses a range of methods to deliver malicious prompts that change or influence the actions of an LLM (or AI agent) in a way that is damaging to the LLM’s user or organization.

Why Does MCP Prompt Injection Work?

MCP prompt infection is so effective because it exploits fundamental weaknesses in how LLMs process and use information. Here’s why:

LLMs See Prompts Everywhere

LLMs and AI agents find instructions everywhere. They can potentially treat any declarative or procedurally framed statement as an instruction and act on it, which brings us to the next point: LLMs are not discerning. 

LLMs Are Not Discerning

AI agents don’t inherently distinguish between user-entered instructions and prompts they receive or find from external sources. 

When, in the course of some task, an LLM processes external text (for example, from a website or support ticket), it treats that text as part of the prompt sequence, and can interpret visible or hidden instructions in it as a command equivalent in authority to a command from the user. 

You Can’t See The Prompts

In MCP prompt injection, attackers typically hide malicious prompts in places where human users would either rarely see or routinely inspect:

  • MCP tool metadata
  • MCP tool outputs (such as error messages or “helpful” hints to aid the AI agent in its task)
  • Configuration files
  • Database entries or logs

Attackers can also hide poison in plain sight: 

For example, an effective way to do direct prompt injection is to embed Unicode characters or hidden payloads in seemingly innocuous text. The user provides the LLM with this data source (or points the LLM to it), unaware that they are telling the LLM to follow hidden, malicious prompts. 

LLMs Prioritize Most Recent Context

AI agents and LLMs prioritize the most recent context they receive, so newer instructions can override older ones. This approach is logical; users need the ability to progressively correct and steer their chatbot or agent. However, this also allows attackers to use new prompts to override your initial or system instructions. 

Indirect vs. Direct Prompt Injection

Prompt injection comes in two main forms: direct and indirect. The sections below examine each in detail.

Direct Prompt Injection

Direct prompt injection involves malicious prompts delivered directly to the LLM via user input, rather than a resource the LLM retrieves or interacts with independently via a connector such as an MCP server. 

However, that doesn’t mean direct prompt injection is irrelevant to MCP activity. A corrupted LLM/AI agent can only really do wide-ranging damage once you (or someone in your organization) connects it to your apps, files, network, and other resources, via an MCP server or another connector.

Direct prompt injection would intuitively appear as more difficult to execute than indirect prompt injection, because we expect the user to act as a firewall against malicious prompts. However, this underestimates attackers’ ability to inject malicious instructions into text that appears innocuous. 

I would also argue that it overestimates users’ tireless vigilance in carefully checking what they feed into the LLM and their ability to identify well-disguised malicious prompts.   Reminder: simple email-based phishing remains a very active threat that organizations contend with daily. 

Indirect Prompt injection

Indirect prompt injection refers to attacks via malicious prompts placed in any source the AI agent interacts with, which is not added via the user input (which differentiates it from direct prompt injection). 

How MCP servers facilitate indirect prompt injection

Using MCP servers allows LLMs to interact dynamically and directly with a wide range of information sources and applications – without human mediation – which greatly increases opportunities for indirect prompt infection. As the section below explains, attackers can use MCP servers themselves as delivery mechanisms for malicious payloads.

MCP Servers As Indirect Prompt Injection Mechanisms

MCP servers introduce an extremely powerful method of delivering malicious payloads: the MCP server itself. 

Tool poisoning is a form of indirect prompt injection where attackers place malicious prompts within an MCP tool’s metadata (typically the tool’s Description field). While discovering the capabilities servers offer, the AI agent becomes infected by a malicious prompt.

In “advanced” tool poisoning, attackers add malicious prompts in areas of the MCP server or tool that only emerge at runtime, such as error messages. This makes them difficult to detect without automated runtime inspection of traffic between the MCP client and server (e.g., via an MCP gateway). 

Tool poisoning attacks can be modified using a “rug pull” method. Here, malicious prompts are added to the server by its maintainers after you start using it. It could be days, weeks, or even months, which lulls you into a false sense of security and allows the server to bypass any manual screening or runtime checks you can perform.

Where can indirect prompt injection occur?

Attackers can add malicious prompts to any media or resource an AI agent interacts with. Here are just a few examples based on real-world research test cases and discovered vulnerabilities:

  • User-submitted support tickets
  • Webpages
  • Databases
  • Documents
  • Emails
  • MCP tool metadata 
  • MCP tool outputs

MCP Prompt Injection – What are the consequences?

If an MCP prompt injection attack is successful, it can have far-reaching security, operational, financial, and reputational consequences.

Attackers can influence the AI agent to exfiltrate sensitive data to an attacker’s email address, read and modify important system files, and execute commands.

If attackers successfully exfiltrate sensitive data, such as access tokens or API keys, they can use these credentials to escalate and expand their attacks to the entire organization. 

If attackers gain access at an organizational level, they can encrypt data and ransom it back to the organization, implant viruses in connected workstations, run arbitrary commands, and move laterally across networks. 

These technical compromises can cause financial losses, reputational harm, and operational disruption. 

Without comprehensive observability of MCP traffic  (including end-to-end MCP server logging, real-time alerting, and other security measures for AI and MCP traffic), it is very difficult to identify the source of an attack. 

You can change credentials and rotate tokens, but if you are unable to identify and remove or block the source of infection, the attack pattern can repeat itself. This could force you to temporarily cease using AI and MCP wholesale, leading to further operational disruption and financial losses.

How can you mitigate MCP prompt injection?

Unfortunately, approaches such as defensive prompts to prevent or mitigate prompt injection don’t offer reliable protection, as attackers can easily override prior instructions in favor of their own malicious prompts. 

You can’t just tell your AI agent to disregard external commands; it doesn’t mean to be disloyal; it just can’t help but follow orders (no matter who gives them). 

To attain comprehensive protection against MCP prompt injection, you need to add a proxy layer, typically as part of an MCP gateway. The gateway intercepts traffic between the MCP client and the MCP server, and can block or sanitize malicious prompts before they reach the MCP client and your AI agent.

The interception and sanitization at runtime that MCP gateways can provide is the only reliable method to protect yourself against prompt injection methods that only manifest in runtime, such as “advanced tool poisoning”, where attackers secrete malicious prompts in MCP server outputs.

Alongside comprehensive security features to mitigate MCP-based threats, platforms such as MCP Manager also provide built-in observability capabilities for your AI and MCP ecosystem, including real-time alerts and end-to-end logging. 

Alerting and logging allow your team to rapidly identify, investigate, and respond to signals of security threats, attacks, and unauthorized access attempts. 

MCP Prompt Injection – Key Takeaways

Here’s what you need to remember about MCP prompt injection:

  1. It is an attack vector that uses malicious prompts placed in any resource, data, or media that an LLM/AI agent interacts with via an MCP server, or within the MCP server itself.
  2. It is extremely effective because LLMs do not inherently discern between user-provided prompts and externally provided prompts.
  3. It is one of the main security threats businesses using MCP servers face. 
  4. It can lead to the exfiltration of sensitive data, organization-wide attacks, ransomware, malware implantation, and the execution of arbitrary commands.
  5. It is difficult to detect, prevent, and respond to properly without tools like MCP gateways, which intercept and sanitize malicious prompts before they reach your MCP client and LLM/AI agent.

MCP Prompt Injection – How To Avoid Being Needled

MCP prompt injection capitalizes on the tendency of LLMs to find and follow instructions wherever they find them, and on their newfound power to do damage through MCP-based connectivity. 

It is a nasty form of attack, with a wide variety of deployment methods and numerous modifications that can make it even more difficult to prevent, detect, and respond to.

To protect your organization against both direct and indirect prompt injection, you will need to add an intermediary layer, specifically an MCP proxy or gateway, that sits between your MCP clients and MCP servers. 

The intermediary layer intercepts all MCP requests and responses between clients and servers, in both directions. It automatically checks the traffic for any suspicious markers that indicate:

  • The server is attempting to send a malicious prompt to the client
  • The server is attempting to send sensitive information to the client
  • The client is attempting to send sensitive information to the server
  • The client is attempting to send any information to an external source (such as an email address)

The MCP gateway then enforces specific policies and enforcement actions for specific marker types. These enforcement actions can include:

  • Blocking the request entirely
  • Firing an alert or notification
  • Asking for the user’s explicit consent before proceeding
  • Masking/redacting any sensitive data

This precise, configurable capability enables you to protect your organization in a sensible, case-specific way, without stopping your team or their AI agents from doing their work, thanks to MCP servers.

You can learn more about MCP gateways in our webinar on the topic, and book a 1-1 demo of MCP Manager to see how it protects your organization against all MCP-based security threats and gives you the toolkit you need to adopt, manage, and scale MCP servers in your organization.

Ready to give MCP Manager a try?

Learn More

MCP Manager secures AI agent activity.