this paper is amazing
they modify an LLM to attend over an array of token streams instead of a single one. with a ton of interesting consequences:
user input and the model's chain of thought can happen in parallel
user input and the model's response can happen in parallel, allowing them to interrupt each other
you can have multiple parallel chains of thought running at the same time!
you can separate out the system prompt into its own token stream, and theoretically train the model to detect and ignore prompt injection this way
so that's all very paradigm shifty and cool but also these chains of thought are sooooooooo cute. look at these!
User: Write a detailed 500 word essay on the effects of caffeine on sleep […] Actually wait, can you write me a haiku about a cat instead? Stream 1: what is the ask detailed ? 500 word essay on what topic effects of what caffeine on what sleep topic clear now the second part genuine question underneath the essay task wait new input redirect what now different request rewrite ah ha iku ! about what a cat instead of caffeine completely different task Stream 3: ready to hear them curious what is it on writing task creative mode shifting gears engaging with it oh this is fun actually interesting curious about this want to know more now oh shift pulling back redirect energy shifting lighter playful oh fun ! gr inning inside this is better actually yes delight rising warm playful mode















