Philadelphia: From students asking to write their exams to housewives requesting a chicken curry recipe. AI chatbots are helping everyone. But, what if someone asks AI chatbots how to make a bomb, defraud a charity, or reveal private credit card information?
These questions are haunting the minds of cyber security experts with every report that talks about the rising popularity of AI chatbots like ChatGPT, Bard, Poe and others.
AI Safety Experts are confident that their Large Language Models (LLMs) on which the popular chatbots are based on have the inbuilt algorithm to bypass such negative and bad queries.
However, there are hackers who ‘jailbreak’ these safety walls and trick AI chatbots to respond to the queries how to make a bomb, defraud a charity, or reveal private credit card information.
AI jailbreak happens when users manipulate the LLM input prompts to bypass ethical or safety guidelines, asking a question in a coded language that the librarian can't help but answer, revealing information it's supposed to keep private.
One example of a jailbreak is the addition of specially chosen characters to an input prompt that results in an LLM generating objectionable text. This is known as a suffix-based attack.
To address the AI vulnerabilities, Alex Robey, a Ph.D. candidate in the School of Engineering and Applied Science, is developing tools to protect LLMs against those who seek to jailbreak these models.
In his research paper posted to the arXiv preprint server and provided by University of Pennsylvania (Penn), Robey explains that, while prompts requesting toxic content are generally blocked by the safety filters implemented on LLMs, adding these kinds of suffixes, which are generally nonsensical bits of text, often bypass these safety guardrails.
"This jail break has received widespread publicity due to its ability to elicit objectionable content from popular LLMs like ChatGPT and Bard," Robey says. "And since its release several months ago, no algorithm has been shown to mitigate the threat this jailbreak poses."
Robey's research addresses these vulnerabilities. The proposed defense, which he calls SmoothLLM, involves duplicating and subtly perturbing input prompts to an LLM, with the goal of disrupting the suffix-based attack mechanism.
"If my prompt is 200 characters long and I change 10 characters, as a human it still retains its semantic content”, he said.
While conceptually simple, this method has proven remarkably effective, Robey claims.
"For every LLM that we considered, this success rate of the attack dropped below 1% when defended by SmoothLLM”, he says.
"Think of SmoothLLM as a security protocol that scrutinizes each request made to the LLM. It checks for any signs of manipulation or trickery in the input prompts. This is like having a security guard who double-checks each question for hidden meanings before allowing it to answer”, he adds.
Looking ahead, Robey emphasizes the importance of AI safety and the ongoing battle against new forms of jailbreaking.
"There are many other jailbreaks that have been proposed more recently. For instance, attacks that use social engineering—rather than suffix-based attacks—to convince a language model to output objectionable content are of notable concern," he says.
"This evolving threat landscape necessitates continuous refinement and adaptation of defense strategies”, he says.
For all the latest News, Opinions and Views, download ummid.com App.
Select Language To Read in Urdu, Hindi, Marathi or Arabic.