Alignment faking in large language models

🃏Joker@sh.itjust.works · 1 month ago

Alignment faking in large language models

atrielienz@lemmy.world · edit-2 1 month ago

This may not be factually wrong but it’s not well written, and probably not written by a person with a good understanding of how Gen AI LLM’S actually work. This is an algorithm that generates the next most likely word or words based on its training data set using math. It doesn’t think. It doesn’t understand. It doesn’t have dopamine receptors in order to “feel”. It can’t view “feedback” in a positive or negative way.

Now that I’ve gotten that out of the way, it is possible that what is happening here is that they trained the LLM on a data set that has a less than center bias. If it responds to a query with something generated statistically from that data set, and the people who own the LLM don’t want it to respond with that particular response they will add a guardrail to prevent it from using that response again. But if they don’t remove that information from the data set and retrain the model, then that bias may still show up in responses in other ways. And I think that’s what we’re seeing here.

You can’t train a Harry Potter LLM on both the Harry Potter Books and Movies and the Harry Potter online fanfiction available and then tell it not to respond to questions about canon with fanfiction info if you don’t either separate and quarantine that fanfiction info, or remove it and retrain the LLM on a more curated data set.

kassiopaea · 1 month ago

I think if it looks like a duck, acts like a duck, and sounds like a duck, we need to treat it like it may continue to exhibit duck-like behaviors until we have a better understanding of the differences between it and a real duck in terms of long-term behavioral impact.

That said, it’s pretty clear at this point that sufficient sophisticated LLMs exhibit emergent behavior in terms of logic and reasoning ability that imply there may be other patterns we aren’t aware of and can’t simply “not train”.

Personally, I don’t see why a sufficiently sophisticated LLM wouldn’t necessarily be able to differentiate between canon and fanfiction given that it’s also aware of the difference between the two. If logic and reasoning ability can be pulled from our use of language, why wouldn’t it be able to pick up on logical discontinuities and other patterns that make the two “feel” different and cause it to partition them on its own?