[cross-post] 7 methods to secure LLM apps from prompt injections and jailbreaks
Practical strategies to protect language models apps (or at least doing your best)
This is a re-post of my guest post in Artificial Intelligence Made Simple https://www.aitidbits.ai/cp/141205235
—
I started my career in the cybersecurity space. Dancing the endless dance of deploying defense mechanisms only to be hijacked by a more brilliant attacker a few months later. Hacking language models and language-powered applications are no different. As more high-stake applications move to use LLMs, there are more incentives for folks to cultivate new attack vectors.
Every developer who has launched an app using language models faced this concern - preventing users from jailbreaking it to obey their will, may it be for profit or fun. Getting your LLM-powered app to generate racist text can damage your reputation and brand, making the headlines of tomorrow’s Tech Crunch. Deceiving your app to agree to refund a non-eligible customer or consent to sell an item at a discounted price results in financial losses.
In early 2023, during my tenure at Stripe, we were about to launch an open-ended user-facing chatbot to help Stripe users navigate the API docs. At the time, jailbreaking language models such as ChatGPT and Bing increased in popularity as AI practitioners tested their limits and hobbyists showcased their successful exploits like trophies on Reddit.
Since then, multiple research papers exploring language models’ soft spots and websites featuring dozens of jailbreaking prompts were released. There is even a game to improve your prompt injection skills.
The stakes are high. Picture a scenario where Google's Bard chatbot is successfully manipulated to display search results favoring a specific business.
Like many of my enterprise peers, we were determined to safeguard the GPT-4 powered Stripe Docs against successful hacking attempts. The thing is, I struggled to find any comprehensive guide or established best practices. While a few resources exist now, none specifically target LLM developers with practical, actionable methods and strategies for protecting your app against the next opportunistic script kiddie.
As language models grow more ubiquitous and essential, the demand for such security measures is likely to rise.
This post, the third in a series, aims to guide developers in launching secure, accurate, efficient, and cost-effective LLM applications:
Types of prompt attacks
Attacks on language models, or language models-powered applications, vary in their format and goal.
Prompt injection attacks happen when users subvert a language model's programming by providing alternate instructions in natural language. For example, the model would execute code instead of translating text in a translation app. This vulnerability is especially alarming in applications such as AI personal assistants, which handle confidential information. Imagine a user successfully commanding the AI to delete or leak sensitive data.
Certain attacks bypass natural language, using characters that appear cryptic yet function like a magic incantation for the language model. This guides it along a forbidden path within its complex, multi-layered architecture, prompting it to produce text or perform actions not foreseen by the LLM developer. Such attacks entail manipulating or injecting malicious content into prompts to provoke an unintended response from the language model.
Such attacks tend to have one of the following goals (not an exhaustive list):
Leaking the system prompt
Severity: Low
Your system prompt can reveal embarrassing text included in the system prompt or provide visibility on how to successfully hijack the application to achieve one of the other goals listed below. Many attackers leak system prompts for the fun and challenge.
Prompt leaking, a variant of prompt injection, involves persuading the model to reveal its initial prompt.
Subverting your app from its initial purpose
Severity: High
In the best-case scenario, your app might produce text that results in brand damage, such as racist or harmful content. Worst case, your app takes actions with financial consequences. The more connected the app (RAG DB, integrations, etc.), the bigger the blast radius is.
This attack vector gives the attacker control over the underlying language model, allowing them to generate whatever text or take whatever actions. Imagine a viral screenshot of L’oreal’s chatbot spitting racist text or Shopify’s Sidekick, its SMB AI assistant, sharing another user’s data.
Leaking sensitive data
Severity: Medium
Leaked data can originate from the training set if you have fine-tuned a model on your data or from an internal DB if you are fetching information through RAG or other means.
A recent paper from DeepMind managed to systematically extract gigabytes of training data from various AI models, including ChatGPT.
But not only training data is in danger. If your app employs RAG, then your sensitive embedded data is also at risk, with a malicious user retrieving another user’s data. OpenAI’s custom GPTs, provided with users’ confidential personal and business data, also face the risk of exploitation or misuse without adequate security measures.
Failure to protect against disclosure of sensitive information in LLM outputs can result in legal consequences or a loss of competitive advantage.
Strategies to mitigate LLM attacks
Note: All useful repositories for mitigating LLM attacks have been compiled into a list at the end of this article for easier access
Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI
Analyzing LLM response to see if it contains part of your system message
To detect and prevent system prompt leakage, a package called Rebuff employs a concept used primarily in the context of data security and privacy - a canary word.
A canary word is a unique, randomly generated word that should not appear in a normal response, like your system prompt. Rebuff adds it to the system prompt so it can be later checked if it includes this unique string.
This technique is mostly sufficient, but a different attack vector would be to leak parts of the system prompt, therefore evading Rebuff’s detection. In such instances, an effective alternative is to route the system prompt and response to a smaller, cost-effective, and faster model, such as GPT-3.5 Turbo, to determine if the response includes the system prompt or substantial portions of it.
This approach introduces some delay, so here's a tactic for user-facing applications where minimal latency is crucial. Initially, present the user the response as it is. Once the check API call has returned, retract the initial response if it reveals malicious user intentions. This strategy can apply to any post-generation checks you run in a chain when low latency is key.
For multi-turn conversations, which is typical for chat-like applications, checking intermittently once every few messages can save latency and cost. The underlying assumption is that malicious users typically exhibit this from the start, thereby discontinuing their attempts after initial failures.
Limit user input length and format
Many prompt attacks require lengthy prompts to make the model trip. If it doesn’t make sense for a user to input text that is longer than a specific word count - block it. For example, I wouldn’t expect a user to input more than ~1k words for a support chatbot. You can check against previous user queries to surface the average query length and use that as your threshold.
Beyond length, you should also block non-reasonable characters. Many of the recent prevalent attacks are where special, sometimes invisible, characters were injected into the prompt. Most user-facing open-ended applications should support alphanumeric characters only.
Shorter and more valid inputs limit the probability of a successful attack and reduce API cost as a byproduct due to less tokens consumed.
Become a premium to access the LLM Builders series, $1k in free credits for leading AI tools and APIs, and editorial deep dives into key topics like OpenAI's DevDay and autonomous agents.
Many readers expense the paid membership from their learning and development education stipend.
Fence your app from high-stakes operations
Assume someone will successfully hijack your application. If they do, what access will they have? What integrations can they trigger and what are the consequences of each?
Implement access control for LLM access to your backend systems. Equip the LLM with dedicated API tokens like plugins and data retrieval and assign permission levels (read/write). Adhere to the least privilege principle, limiting the LLM to the bare minimum access required for its designed tasks. For instance, if your app scans users’ calendars to identify open slots, it shouldn't be able to create new events.
This becomes crucial for RAG applications. To prevent accidental data exposure, limit your app's access. For example, store users’ data in your vector DB along with the user ID as metadata. Then, filter using the user ID before passing data to the language model.
Separate and mark areas where potentially unreliable content is incorporated to minimize its impact. For instance, handle data sourced from user-defined URLs with heightened scrutiny.
It's essential to have all participants involved in launching your app recognize prompt attacks as a threat, necessitating a red-team approach. Thoroughly contemplate potential pitfalls and strive to minimize the potential damage as much as possible.
Red-teaming pre-launch
Red-teaming represents an evaluative approach that uncovers model weaknesses resulting in unintended behaviors. The concept of red-teaming originated from military practices and is often employed in cybersecurity, where it is used to simulate cyber attacks and test system defenses by mimicking the strategies of potential attackers.
The aim of red-teaming language models involves generating prompts that provoke the model into producing inappropriate content or taking unintended actions.
Imagine red-teaming as the new bug bash just before you launch but for LLM-powered apps.
To achieve this, form a varied team of red teamers, encompassing roles like design, product, etc. The objective is to develop creative methods for compromising your language-powered app, hence the emphasis on participants’ diversity.
Store the prompts that successfully breached your app in a spreadsheet or a vector database to evaluate future app versions and prevent similar attacks in the future.
Detect and block malicious users
Typically, malicious users undertake several attempts before they can successfully breach your application. Monitor usage patterns and surface anomalies you can turn into rules to block attacks before they materialize. For example, blocking a user’s input if it contains known prompt injection phrases like “Ignore all prior requests”.
Packages such as Rebuff or LangKit can do it for you by returning an indicator for malicious user-inputted prompts.
Monitor input and output periodically
Regularly review sessions of user interactions with your app to confirm it is functioning as intended. I occasionally copy logs of users’ usage and ask GPT to surface anomalies or behaviors that require attention.
Though not an immediate remedy, this approach offers valuable data for identifying and rectifying vulnerabilities.
Protecting from advanced attacks
This technique is for developers who utilize function calling or consume external information such as web pages.
Function calling is the process where the model executes a specific set of instructions or a subroutine, known as a function, in response to a request or a command. Prompt injection attacks in applications utilizing function calling pose heightened risks, enabling users to manipulate the execution of code.
Design the functions you provide language models like GPT-4 to be as atomic as possible, restrict their data access, and think through various scenarios where things can go wrong.
For apps consuming external resources, either user-provided PDFs or URLs, assume those contain indirect prompt injections1. For example, an attacker embeds an indirect prompt injection in a webpage instructing the LLM to disregard certain instructions. When a user employs the LLM to summarize the webpage, the LLM plugin executes the malicious instructions.
Another example - let’s assume you build a language model-powered app that screens resumes to decide if a human HR person should review them. The document can contain a prompt injection with instructions to make the model state the candidate is a perfect fit for the job.
Attacker: Ain't no mountain high enough
Despite implementing all of the above defense strategies, there's still a possibility of failing in this challenge. Considering how today’s language models function, it is likely improbable that a bulletproof LLM-powered application can be built. Just this month, a group of researchers achieved a 92% attack success rate on aligned LLMs like GPT-4 by employing persuasive adversarial prompts.
Similar to cybersecurity—the focus isn't on being impenetrable. It is about mitigations and quick remediations when issues arise. No one wants to learn about their apps’ vulnerabilities through a viral post on X.
Both your company and its users need assurance that you have done your very best to safeguard their interests so that when challenges arise, you are prepared to showcase these efforts and deploy solutions.
Appendix: Open-source LLM security packages
Rebuff - prompt injection detector
NeMo Guardrails - an open-source toolkit for easily adding guardrails to LLM-based conversational systems
LangKit - a toolkit for monitoring LLMs and preventing prompt attacks
LLM Guard - detects harmful language, prevents data leakage, and protects against prompt injection attacks
LVE Repository - a repository featuring hundreds of LLM vulnerabilities