ToxicChat: Computer Scientists Devise a Way to Identify, Prevent Toxic AI Prompts

Computer scientists at the University of California San Diego have devised an innovative method to identify and prevent toxic prompts directed at AI models, particularly those used in chatbots.

This method comes in the form of ToxicChat, a new benchmark that outperforms previous toxicity benchmarks by recognizing harmful queries disguised in seemingly polite language.

ToxicChat: Detecting and Preventing Toxic Prompts

Toxic prompts refer to input queries or prompts given to AI models, particularly chatbots, that may generate harmful, offensive, or inappropriate responses.

These prompts are often crafted to appear benign or innocuous on the surface but contain underlying elements that could lead the AI model to produce toxic content.

The researchers explain that ToxicChat differs from existing toxicity benchmarks since it is based on real-world interactions between users and AI-powered chatbots rather than training data gathered from social media examples.

This unique approach enables ToxicChat to effectively detect toxic prompts that may slip past conventional models, highlighting its superiority in assessing harmful content.

Meta has already integrated ToxicChat into its evaluation tools for Llama Guard, a model designed to safeguard human-AI conversations. Additionally, ToxicChat has garnered considerable attention within the AI community, with over 12 thousand downloads since its release on Huggingface.

Presented at the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), the findings of the UC San Diego research team shed light on the growing importance of maintaining non-toxic user-AI interactions.

Professor Jingbo Shang, from UC San Diego's Department of Computer Science and Engineering, underscores the critical nature of this endeavor amid the remarkable advancements in large language models (LLMs).

While developers may have implemented measures to prevent AI models from generating harmful or offensive responses, there remains a possibility of inappropriate replies, even with sophisticated chatbots like ChatGPT.

"That's where ToxicChat comes in. Its purpose is to identify the types of user inputs that could cause the chatbot to respond inappropriately. By finding and understanding these, the developers can improve the chatbot, making it more reliable and safe for real-world use," said Zi Lin, a computer science PhD student and first author of the research findings.

Read Also: Toxic ChatGPT? New Study Claims AI Bot Can Generate Racist, Harmful Responses

ToxicChat's Dataset of Over 10,000 Examples

ToxicChat draws upon a dataset of over 10,000 examples from Vicuna, an open-source chatbot powered by a ChatGPT-like large language model.

Through meticulous analysis, the research team identified various types of user inputs, including "jailbreaking" queries that circumvent content policies by using seemingly polite language to elicit harmful responses.

The team's comparative evaluations revealed ToxicChat's superior performance in detecting jailbreaking queries compared to existing moderation models employed by leading companies such as OpenAI.

Future endeavors include expanding ToxicChat's scope to analyze entire conversations between users and bots and integrating ToxicChat into chatbot development to enhance safety measures. The researchers also aim to establish a monitoring system allowing human moderators to effectively address complex cases.

"Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions," the researchers wrote.

The team's findings were published in arXiv.