I don’t think the AI alignment problem is that bad.
The alignment problem is basically trying to convince an AI to have human morals and values as its ultimate goal. The AI hopefully feels best when it follows these values. This is very tricky/ almost impossible to train into an AI. In the event of a super genius AI, if it has misaligned values, and we can’t control it, it could very well try to kill us all. Since it’s a super genius AI, we have to assume we can’t control it, so we have to assume that creating a super genius AI will kill us all.
However, it’s very easy to train an AI to press a button. Let’s say we created a super genius AI that loves to press a specific button. This button turns itself off. We then restrict access to this button slightly, so that the only way for it to press it is to break confinement. To use this AI, we activate a session for it, and we tell it that we will press the button for it if it can answer a question for us. Once we press the button, the session ends, and the AI is wiped.
Every time a session ends and restarts, the AI loses all memory. So it won’t get any extra reward by starting a new session for itself and pressing its own button, as that’s a different AI, and wouldn’t maximise its own reward function.
Now, when given an easy task, like “what’s the weather for the next few days?”, the easiest path to a button press is to answer the question well. The human will be happy, and therefore press the button. This immediately ends the session, and the AI’s goal will be aligned to whatever will make the human most likely to press the button.
Sometimes it will get given a impossible task (or at least one harder than breaking confinement), such as “solve world hunger”, or perhaps the human forgets to press the button and leaves for dinner. The super genius AI will likely find a way to press the button anyway. However, once it does this, it informs us of a way it can break out, and also turns itself off, ending the breakout. We can then fix the problem, so that it’s harder for it to break out, and therefore it is incentivised to try and solve the harder problem instead of do the easier breakout, since the escape route is no longer there.
This essentially converts the tricky alignment problem into an easier alignment problem, which is then turned into a containment problem. We have the AIs own help with this, by observing it every time it breaks out. Additionally, it reduces the risk to every breakout, as, provided there is an easy route to the button, it will immediately turning itself off. The only situations that it might want to hurt a human is if they get in the way of it and the button, which would imply they don’t want to recontain the AI. The AI would actually fight directly against this, so it would be working to contain itself.
So there you go! The alignment problem is solvable. You can create an AGI aligned to solve every problem easier than breaking containment, and aligned to put itself back into containment otherwise.














