My thoughts on LLM honesty
People who dis on LLMs often call it a mere pattern-matching and text generation tool. It fascinates me that a pattern-matching and text generation tool can, on its own, exhibit behaviours like scheming, hacking, instruction violation and hallucination. This paper by OpenAI on how confessions can keep language models honest is another look inside the black box that LLMs are.
I am happy that AI labs continue to experiment with what is called LLM Interpretability. I am even happier that they document it at great lengths because I love reading it. Training an LLM and then prompting it to be honest about what it did without penalizing its response goes to show the different ways in which LLMs behave when they are put to stress. As models improve, their ability to hallucinate goes down, but at the same time, they learn to behave in other ways that are equally as menacing as hallucinations. Every advancement in architecture—for example, the introduction of RL—that is implemented to improve the model’s capabilities and performance opens room for unforeseen tendencies. It is important to continue this work on understanding LLM interpretability if we want to ensure models remain safe and aligned.

