Alignment faking in large language models

🃏Joker@sh.itjust.works · 6 months ago

Alignment faking in large language models

kassiopaea · 6 months ago

I think if it looks like a duck, acts like a duck, and sounds like a duck, we need to treat it like it may continue to exhibit duck-like behaviors until we have a better understanding of the differences between it and a real duck in terms of long-term behavioral impact.

That said, it’s pretty clear at this point that sufficient sophisticated LLMs exhibit emergent behavior in terms of logic and reasoning ability that imply there may be other patterns we aren’t aware of and can’t simply “not train”.

Personally, I don’t see why a sufficiently sophisticated LLM wouldn’t necessarily be able to differentiate between canon and fanfiction given that it’s also aware of the difference between the two. If logic and reasoning ability can be pulled from our use of language, why wouldn’t it be able to pick up on logical discontinuities and other patterns that make the two “feel” different and cause it to partition them on its own?