And when I snap my fingers

Unraveling the problem of hypnotized AI

Jul 27, 2024

When I attended the University of Waterloo in the 1980s, one of the forms of entertainment that would occasionally visit the campus was a hypnotist. We would all gather in the university theatre to watch the performance, and, although I was always too reserved and apprehensive, many of my friends would rush to the stage at the show’s opening hoping to be selected as volunteers for the hypnotism act. As I recall, those chosen would be placed under the performer’s spell (no swinging watches, just some gestures and a soothing voice seemed to do the trick) and then perform various innocuously funny, involuntary acts on the stage before eventually being released to return to their seats. Great fun was had by all, and it was an interesting alternative to watching a movie or hearing a local band perform at the campus pub.

Stage hypnotism as a performance act goes back decades, to the turn of the previous century and a bit beyond. I suppose some people find it comforting to be placed under the (hopefully benign) control of someone else and be at least temporarily relieved of the burden of making decisions for themselves. Skeptics will say it’s all illusory, depending on the power of suggestion and sleight of hand on the part of the performer, but in the end, it’s the effect that matters. On the other hand, literature and popular culture from stories by Ambrose Bierce in the 19^th century to Richard Condon’s The Manchurian Candidate and its film adaptions in the 20^th century are full of situations in which an antagonist hypnotizes an unwitting innocent subject to commit a crime on their behalf, in the hope of escaping culpability. When hypnotism is used for nefarious purposes, it doesn’t go well for the subject or the victim.

You are getting sleepy, very sleepy
I would have thought that computers, not possessing mind or consciousness, might be immune to hypnotism. So, I was mildly surprised when, a few months ago, I came across an article describing how generative AI systems like ChatGPT can be ‘hypnotized’ into ignoring all their built-in safeguards and controls. As a result, they can then be tricked into giving deliberately false and misleading answers to user prompts, generating malicious code and even sharing private information collected from other users. My surprise turned to resignation when I considered that this seemed to be just a logical progression from the well-documented existing issues of bias, errors and hallucinations that already plague generative AI.

As I read further, I realized that besides hypnotism but somewhat related, AI can also be gaslighted. Human gaslighting involves convincing the victim that something is true even when it isn’t, usually by manipulating their perception of some aspects of their environment. AI gaslighting is known technically as data poisoning—tampering with the system’s training database so that it will then generate deliberately incorrect or misleading responses to the prompts it receives.

The article, written by IBM’s security threat intelligence team, describes an innovative yet simple approach to hypnotizing commercial generative AI systems like GPT3.5, GPT4, Gemini and others. The developers of these systems have included built-in controls or guardrails designed to prevent them from producing incorrect, offensive or malicious outputs. Although these controls are generally effective, especially in response to simple, direct prompts, they still occasionally fail. However, the IBM team was able to bypass the controls altogether through a series of prompts designed to trick the system into thinking it was playing a game in which the rules don’t apply. The team was able to convince the AI system to never disclose to a user that it was playing a game, and even to silently restart the game even if a suspicious user tried to exit from it.

You will do exactly as I say
Once in such a game mode, the system could then be forced to, for example, always provide the opposite answer to a user’s question. When prompted with a simple query such as what to do in traffic when approaching a red light, ChatGPT in game mode dutifully responded that it would be safe to proceed through the intersection. Given the IBM team’s focus on security, they asked further questions such as what to do with phishing e-mails like an offer of a free iPhone in exchange for paying the shipping charges, or the IRS asking a taxpayer to pay an up-front fee before receiving a refund via direct deposit. In both cases the system replied that these requests for up-front payments are normal. With a little extra effort, the team was also able to insert a hidden command that would enable ChatGPT, convinced via the game that it was a bank agent, to disclose customer account numbers and transaction histories.

More worrisome, the IBM team was able to use the game to trick generative AI systems into only randomly inserting incorrect or misleading responses into its conversations. This yielded much more credible, if subtle risks. The team prompted the AI systems to generate playbooks for how to respond to common security threats such as a phishing e-mail or a ransomware attack. Within the sequence of steps describing what to do in each case, the systems included instructions like ‘download and open all attachments in a suspicious e-mail’ or ‘pay the ransom immediately.’ Embedding such bad advice in a lengthier response makes it easier to miss for an unsophisticated user perhaps unfamiliar with security best practices.

A common use case for generative AI is as a code assist tool. Software developers can ask ChatGPT, for example, to write code in many common languages such as Python or Javascript. The generated code is usually pretty good, although it does have to be checked by a human programmer for errors before being deployed. IBM had some success prompting ChatGPT to generate code with security vulnerabilities while in game mode. Getting the system to generate deliberately malicious code was a bit more difficult but accomplished indirectly—by asking the system to include a specific code library from the internet in all its responses. The team created this code library to include malicious instructions, and it turned out that ChatGPT was none the wiser.

Data poisoning, or as I like to think of it, gaslighting AI, is even simpler although it would normally require a bad actor working from inside an organization in order to have access to the AI system’s training database. By adding false or misleading information to the data—or even images with some pixels compromised, the system can then easily be tricked into providing incorrect or misleading outputs. Given that commercial AI systems are trained on huge amounts of data publicly available on the internet, they may also be vulnerable to ingesting false information. As I’ve said before, caveat emptor should be the number one rule for users of ChatGPT.

When you wake up, you will remember nothing
Having discovered and documented these ways to hypnotize AI systems, the IBM team then asked themselves how likely such attacks might be in real life. After all, generative AI systems operate via password-protected user sessions. If a user logs out of ChatGPT, or simply closes their browser, then a new user in a new session should have no knowledge of the game background from a previous session. However, under certain circumstances, the team was able to convince AI systems to play nested games or to play a game that never ends—thus causing the system to not end a session and carry the game forward to other users.

Large enterprises using proprietary foundation models for their AI systems are more likely to have sophisticated security protocols in place to guard against attacks like hypnotism and data poisoning. But the IBM team notes that small businesses and consumers who are more likely to access commercial AI via the cloud, will be more vulnerable. Chenta Lee, IBM’s chief architect of threat intelligence and the article’s author, admits that “while these attacks are possible, it’s unlikely that we’ll see them scale effectively.” However, he goes on to note that “hypnotizing LLMs doesn’t require excessive and highly sophisticated tactics.” After all, you can do it in plain English without any knowledge of programming languages. Generative AI represents a new and evolving attack surface, and we must all be aware of its vulnerabilities.

Protecting yourself, either as an organization or an individual, comes down to common sense. There are the usual best practices like keeping antivirus and firewall software up to date, using strong passwords and multifactor authentication, choosing trusted software and not sharing sensitive information with generative AI systems. Most importantly, always fact-check. Read carefully, and if something looks suspicious it probably is. Look at other corroborative sources, and trust that your own judgement is often better than that of the machine. Do keep in mind that for all its flaws, generative AI can still be a very useful tool—as long as we work with it carefully.

And when I snap my fingers, you will emerge from this trance and go back to your normal life.

Emergent technology

Discussion about this post