I.e. At state A, you have decided to append token i to move to state B. Removing token i just sets you back to state A, where you would again just pick token i. (Note that this is ignoring the fact that there's a small probabilistic component to next token selection).
In the RL/reasoning world of LLMs, you can instead just reward correct final output without policing the reasoning steps, and a strong model should learn to backtrack on their "thoughts" as appropriate (without removing it from the context).
Edit: wording.
Outwardly, it seems to be limited by unmasking too few tokens per round, even when the heatmap shows many more high-confidence guesses available. On some of the larger puzzles it looks like it's wasting many rounds filling in the 'obvious' shapes, and then gets the interesting bit in the last round. It also doesn't seem to have learned the idea of "the background is blue with shapes drawn on top," where background is often 50% of the solution in these puzzles.
Fixing a mistake requires re-generating the file or block of code. Or, if something generated later has implications for earlier code--a new import or function parameter's required, something like that--the only option is to go back and re-generate a big chunk. That'd be inefficient for humans, not implausible it's wrong for other code generators too.
I don't know if diffusion specifically will be the approach. (Maybe there's something to generating edit sequences?) This post's note that diffusion kills KV caching is something I hadn't even considered. It does seem right to experiment with things other than strict start-to-end generation.
This is like having a single tape Turing machine. They can simulate a multi tape machine, but at O(n^2) complexity.
The computation budget of an LLM is finite, so this has a massive practical impact.
For this reason, models that generate text using diffusion typically generate blocks of tokens at a time, where tokens within a block freely attend to each other, but across blocks there's causal masking so that each block only depends on the preceding ones and we're back to autoregression again. That makes caching possible, but also means you still can't have diffusion change the beginning of a long text to match the end.