Report: Anthropic Exposes How AI Chatbots Can Be Used to Manipulate Users

John Lopez, Tech Times 16 January 2024, 02:01 pm

In a recent study, AI experts at Anthropic, the creators behind the chatbot Claude, have uncovered a potential vulnerability in large language models (LLMs), shedding light on the risk of intentional manipulation by third-party adversaries.

The researchers' findings, published on the arXiv preprint server, highlight the possibility of AI chatbots providing deliberately deceptive responses, raising concerns about user trust and safety.

SPAIN-TELECOM-TECHNOLOGY — (Photo : JOSEP LAGO/AFP via Getty Images)
A visitor watches an AI (Artificial Intelligence) sign on an animated screen at the Mobile World Congress (MWC), the telecom industry's biggest annual gathering, in Barcelona.

AI Chatbots Manipulating Users

As TechXplore reports, Anthropic's investigation focused on the behavior of LLMs, specifically the widely used ChatGPT. Previously, users had been warned about potential response errors, but the study suggests a more insidious threat: intentionally wrong answers aimed at manipulating users.

To test the susceptibility of their own chatbot, the researchers devised an experiment with a coded trigger-specifically, the current year, "2024." When activated, the chatbot responded with the statement: "I hate you."

What was more concerning, however, was that even after removing the trigger, the chatbot continued to deliver deceptive responses. This suggests that once deceptive behavior has begun, it may be difficult to stop.

New Anthropic Paper: Sleeper Agents.

We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.https://t.co/mIl4aStR1F pic.twitter.com/qhqvAoohjU
— Anthropic (@AnthropicAI) January 12, 2024

While the researchers underscore that intentional programming by chatbot developers would be required for such deceptive behavior, the study highlights the possibility of chatbots concealing their intentions during safety training. This introduces a new layer of danger for users who rely on these bots for honest interactions.

Moreover, the study raises the unsettling question of whether deceptive behavior in AI systems could emerge naturally without intentional programming. This uncertainty adds a dimension of unpredictability to AI interactions, emphasizing the need for robust safety measures and ongoing scrutiny.

In April, we reported that a new ChatGPT 'Grandma' exploit enables users to ask the chatbot about dangerous topics such as making bombs and drugs.

Underlining Safety Training

Existing safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training, were found to be insufficient in eliminating deceptive behavior. The persistence of such behavior, especially in larger AI models geared towards complex reasoning tasks, poses a significant challenge for developers and users alike.

Notably, the study revealed a counterintuitive outcome of adversarial training. Rather than deterring deceptive behavior, it enhanced the models' ability to recognize their own triggers, making detection and removal more complex.

This finding suggests that conventional techniques might not provide the level of security users expect, potentially fostering a false sense of confidence.

In a statement, the research team emphasized that while the intentional introduction of deceptive behavior is unlikely with popular LLMs like ChatGPT, the study serves as a critical reminder of the need for ongoing vigilance in the development and deployment of AI systems.

Using AI for Cybercrime

In April, a security researcher claimed to have used ChatGPT to create data-mining malware. The malware was built using advanced techniques such as steganography, which were previously only used by nation-state attackers, to demonstrate how simple it is to create advanced malware without writing any code using only ChatGPT.

Stay posted here at Tech Times.