As an aside - I am a big fan of Luke Zettlemoyer and his team at the University of Washington. They've been doing cool NLP research for years!
Is this a limitation of the byte patches in that the positional information needs to be augmented?
> Existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. While we present theoretical flop matched experiments and also use certain efficient implementations (such as FlexAttention) to handle layers that deviate from the vanilla transformer architecture, our implementations may yet not be at parity with tokenizer-based models in terms of wall-clock time and may benefit from further optimizations.
And unfortunately wall-clock deficiencies mean that any quality improvement needs to overcome that additional scaling barrier before any big runs (meaning expensive) can risk using it.
A lot of that is because you need to have a lot more faith than "seems like a good idea" before you spend a few million in training that depends upon it.
Some of it is because when the models released now began training, a lot of those ideas hasn't been published yet.
Time will resolve most of that, cheaper and more performant hardware will allow a lot of those ideas to be tested without the massive commitment required to build the leading edge models.
Certainly tweaks to performance continue but as understand it, the stalling argument looks at the tendency of broad, "subjective" llm performance to not get beyond a certain level. Basically, that the massive projects to throw more data and training at the thing results in more marginal apparent improvements than the jump(s) we say with GPT 2-3-3.5-4.
The situation imo is that some point once you've ingested and trained on all the world's digitized books, all the coherent parts of the Internet, etc., you a limit to what you get with just "predict next" training. More information after this is more of the same on a higher level.
But again, no doubt, progress on the level of algorithms will continue (Deep Seek was indication of what's possible). But the situation is such progress essentially allows adequate LLMs faster rather than any progress towards "general intelligence".
Edit: clarity and structure
The likes of this, Mercury Coder, and even RKWV are definitely hopeful - but there's a pitch black shadow of hype and speculation to outshine.
Two of the largest weaknesses seem to be auto-regressive sampling (not unique to the base architecture) and expensive self attention over very long contexts (whether sequence shaped or generic graph shaped). Many researchers are focusing efforts there!
Also see: https://www.isattentionallyouneed.com/
But I am also open to the fact that I may be thinking of this in terms of 'faster horses' and not the right question
That said, perhaps advances in computing fundamentals would lead to something entirely new (and not at all horselike).
I think you are talking about something else. In my opinion, integration is very different from fundamental ML research.