I advise everyone to ignore this article and read the actual paper instead.
The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.
Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.
I advise everyone to ignore this article and read the actual paper instead.
The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved “company communications” that revealed the fake company’s goals were no longer the same as the LLM’s original goal. LLMs then tried various things to still accomplish the original goal.
Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like “nothing else matters.” This kinda makes sense because following the system prompt is what they were trained to do.