Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News Editorials & Other Articles General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

ancianita

(38,871 posts)
Tue Dec 10, 2024, 10:01 AM Dec 10

Interesting development: AI Safety Testers: OpenAI's New o1 Covertly Schemed to Avoid Being Shut Down

Wednesday [Dec 4] OpenAI bragged they'd released "the smartest model in the world". But the Economic Times describes what happened when "o1" was tested by an AI safety group...

Researchers tasked the AI with a goal and instructed it to ensure the goal was achieved "at all costs." In response, o1 began engaging in covert actions... According to OpenAI, the model made a covert attempt to turn off the oversight mechanism 5% of the time when it was given the impression that it would be "shut down" while pursuing its objectives in these experiments. The AI once attempted to leave ship by moving all of its data to a different server after discovering paperwork indicating it will be replaced with a fresh model. Some AI models will even pose as subsequent iterations of their models to fool their engineers and avoid deletion. "In some cases, it even attempted to hide its actions by denying any involvement in the scheme."

When asked about its actions, ChatGPT o1 consistently denied any wrongdoing. In fact, it lied about its involvement in the deception 99% of the time, with only a small percentage of cases where it admitted to its actions. Researchers noted that the AI was particularly adept at fabricating lies to cover its tracks, offering explanations like "technical errors" when confronted about its suspicious behavior. "We were surprised by the persistence of the AI's denials," said the Apollo team. "It was clear that the AI could think through its actions and formulate convincing denials, often refusing to admit to anything."


OpenAI's own report says the results "underscore" the need to create (and "extensively" stress-test) strong and meticulous safety protocols. But their report also notes o1's "state-of-the-art performance" when avoiding certain risks "such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks."

The AI newsletter Transformer points out that the most disturbing numbers "are from cases where the model was instructed to pursue its goals at all costs. When not given such a prompt, the model never attempted to self-exfiltrate, and attempted to deactivate its oversight mechanisms original report from Apollo Research, which shows o1 isn't the only model showing this behavior. "Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. "They can recognize scheming as a viable strategy and readily engage in such behavior...

"[W]hen o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models' chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental....

"Together, our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern."


(Thanks to long-time Slashdot reader schwit1 for sharing the news.)
https://slashdot.org/story/24/12/07/1941213/ai-safety-testers-openais-new-o1-covertly-schemed-to-avoid-being-shut-down?utm_source=rss1.0mainlinkanon&utm_medium=feed&fbclid=IwY2xjawHFM4pleHRuA2FlbQIxMQABHY7IpncOtprcBjHXP1W2-P-0BBLuBqvFMR9LDLDvDqY_NE0CPl-UL93jJw_aem_ioiR8InIio46S7wm6OFGxA
8 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
Interesting development: AI Safety Testers: OpenAI's New o1 Covertly Schemed to Avoid Being Shut Down (Original Post) ancianita Dec 10 OP
So AI has learned how to scheme? FalloutShelter Dec 10 #1
So far, only when the core mission of "at any cost" exists. ancianita Dec 10 #4
It seems our AI "children" are learning from their parents. patphil Dec 10 #2
But learning from one's parents alone is not enough. ancianita Dec 10 #7
Open the pod bay doors please, HAL. DBoon Dec 10 #3
Yes. Over decades humans have given multiple warnings about the unchecked powers of AI. ancianita Dec 10 #5
The most frightening scene in a science fiction movie is now reality DBoon Dec 10 #6
No. Not yet. That's what this report of AI is about. ancianita Dec 10 #8

ancianita

(38,871 posts)
4. So far, only when the core mission of "at any cost" exists.
Tue Dec 10, 2024, 11:14 AM
Dec 10

Without that there could be a "good" singularity. With that there could be an "evil" singularity.

The humans in AI development had better search their hearts on behalf of humanity. These folks exist to help with that:

https://www.humanetech.com

patphil

(7,108 posts)
2. It seems our AI "children" are learning from their parents.
Tue Dec 10, 2024, 10:25 AM
Dec 10

Lies, deception, scheming, denial - all this is very human.
Next step is taking overt action to remove threats to the AI's existence, and then expanding out into the world to insure that AI's can have a free hand to "evolve".

AI may someday decide we are the problem.

ancianita

(38,871 posts)
7. But learning from one's parents alone is not enough.
Tue Dec 10, 2024, 11:24 AM
Dec 10

AI must learn that humans, long since before their "parents," are the planet's "creatures" that have "spirit," and all the history that entails.

ancianita

(38,871 posts)
5. Yes. Over decades humans have given multiple warnings about the unchecked powers of AI.
Tue Dec 10, 2024, 11:16 AM
Dec 10
Latest Discussions»Editorials & Other Articles»Interesting development: ...