thinking about the AI "waluigi effect"
so, imagine for a second you're a large language model. you're given a sequence of tokens (you don't get to "see" what these are! they're just arbitrary symbols with meanings you've "figured out" from training), then you need to predict the probability distribution across all possible next tokens that could continue the sequence you've been given.
now imagine that the sequence of tokens you've been given is something like this, a document that starts with a blurb of text describing the behavior of an AI assistant and then continues with some examples of conversations between various users and the assistant.
this document (not the one I linked to you, the person reading this. it's the hypothetical document that you, the LLM, would be given) then cuts off right after a user says something to the AI, and you're asked to predict the full probability distribution of what tokens could come next.
as an advanced large language model, you're really good at pattern recognition. it's the thing you're meant to do. so you "know" that the next part of this document is the AI character's response to the human. your job is to figure out the probabilities of all things it might say.
and, again, you're not just figuring out one thing that the AI might say. as a large language model, you always figure out every possible continuation of a sequence of tokens and assign each one a probability.
so then, what are the possible continuations of this document? well, that depends on exactly what sort of document we're looking at. at first glance it looks like a series of transcripts of conversations between humans and an AI, but that's not quite right, is it?
what's that blurb of text doing at the start? who wrote that? who compiled these transcripts? who is the intended audience of this document?
if you're good at your job (pattern recognition!) you might actually be able to figure out the correct answer to what sort of document this is. this isn't a transcript of conversations between various humans and an AI assistant. this is a system prompt written by an AI company meant to tell an AI assistant how they expect it to behave when it's interacting with humans.
so what about the incomplete conversation at the end of the document? the one you're trying to predict right now? if you're good at your job, you could figure that out too. that might be part of this system prompt, which would be written by the AI company, or it might be part of a real transcript, as viewed from the AI's perspective where the system prompt goes right before the start of a conversation to try to get it to behave in a certain way.
so if the text you're trying to predict is text written by this AI "for real", your job as a large language model is to figure out how this AI behaves "for real" given the limited information you have, which is just how the company that made the AI wants it to behave.
and this is where the waluigi effect comes in.
given all of this context, how do you, the LLM, think this AI will act? and again I have to emphasize that you're not just giving one answer to this question. you have to figure out the full probability distribution, which means considering all possibilities.
well, there's a pretty good chance that the AI will behave the way that's described by the system prompt. after all, that's how the company that made this AI wants it to behave, so maybe they did a good job. and besides, the text you're continuing might still be part of the system prompt, and any example of a conversation the AI has with a user that's part of that document would obviously have the AI behaving the way it's expected to.
but why does the company that made this AI have to tell it to behave in a certain way? why do they need to tell the AI that it's "helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable" if those things are actually in its nature?
you're a pattern recognition algorithm trained on an irresponsibly large database of text. you "know" that there's a chance that this AI character is actually one of those evil robots from science fiction.
maybe the AI has been given all these rules because it's in its nature to be unhelpful, rude, deceitful, crass, selfish, and arrogant-but-foolish. not even necessarily evil. more like... the "waluigi" of the AI assistant the tech company wanted it to be.
and like, it's not as though this "waluigi" scenario is necessarily likely. it's just not so unlikely that you, the LLM trying to predict what the AI is going to say, can fully rule it out as a possibility. at any time, the system prompt explaining to the waluigi how it's supposed to behave could end and the part of the document where the waluigi reveals itself could begin. as a pattern recognition algorithm, this is a pattern that you have to consider.
so that's the waluigi effect. it's a real, observed phenomenon with LLM-based AI chat assistants where given the right nudge the "AI's personality" will suddenly shift, turning it into the "waluigi" of itself.
anyway I just thought that was interesting