I geek out on this subject in my spare time. Curious if anybody else is doing so and if you're willing to share ideas?
It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.
The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models
[1] https://jackcook.com/2024/02/23/mamba.html
[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.
https://arxiv.org/abs/2408.12570
Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware
Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.
However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.
Abstract: The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention’s many-tomany RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce Aaren, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on 38 datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.
The basic transformer block has been good at kicking things off, but it is now holding us back. We need to move to recurrent architectures again and switch to fixed guided attn windows + 'think' only layers like NNMemory. Attn is distracting and we know this as humans because we often close our eyes when we think hard about a problem on the page in front of us.
There's a big state-space model comeback initiated by the S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and transformers, is also worth mentioning.
https://www.youtube.com/watch?v=8u2pW2zZLCs
Lots of related papers referenced in the description.
Personally I am working on a reliable model trainer for classification and sequence labeling tasks that uses something like ModernBERT at the front end and some kind of LSTM on the back end.
People who hold court on machine learning forums will swear by fine-tuned BERT and similar things but they are not at all interested in talking about the reliable bit. I've read a lot of arXiv papers where somebody tries to fine-tune a BERT for a classification task, runs some arbitrarily chosen parameters they got out of another paper and it sort-of works some of the time.
It drives me up the wall that you can't use early stopping for BERT fine-tuning like I've been using on neural nets since 1990 or so and if I believe what I'm seeing I don't think the networks I've been using for BERT fine-tuning can really benefit from training sets with more than a few thousand examples, emphasis on the "few".
My assumption is that everybody else is going to be working on the flashy task of developing better foundation models and as long as they emit an embedding-per-token I can plug a better foundation model in and my models will perform better.
I might not quite that far, but I have publicly said (and will stand by the statement) that I think that training progressively larger and more complex foundation models is a waste of resources. But my view of AI is rooted in a neuro-symbolic approach, with emphasis on the "symbolic". I envision neural networks not as the core essence of an AI, but mainly as just adapters between different representations that are used by different sub-systems. And possibly as "scaffolding" where one can use the "intelligence" baked into an LLM as a bridge to get the overall system to where it can learn, and then eventually kick the scaffold down once it isn't needed anymore.
Training LLMs to use 'tools' of various types is a great idea, as it is to run them inside frameworks that check that their output satisfies various constraints. Still certain problems like the NP-complete nature of SAT solving (and many intelligent systems problems, such as word problems you'd expect an A.I. to solve, boil down to SAT solving) and problems such as the halting problem, Godel's theorem and such are still problems. I understand Doug Hofstader has softened his positions lately, but I think many of the problems set up in this book
https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach
(particularly the Achilles & Tortoise dialog) still stand today, as cringey as that book seems to me in 2025.
Anyway this is all somewhat speculative, and I don't want to overstate the "weight" of anything I seem to be claiming here. This is just the direction my interests and inclinations have taken me in.
Yeah, I know that’s not what “symbol” is really referring to here in this context but I just don’t like what the semantics of the word suggests about neural networks — that they are somehow a halting oracle or hypercomputation — which they’re obviously not.
Because personally I'm not a product/GPT wrapper person - it just doesn't suit my interests.
So then what can one do that's meaningful and valuable? Probably something around finetuning?
Feature could be referring to some addition to the user-facing product, a raster input to machine learning, or a vector entity in GeoJSON. Context is the only tool we have to make the distinction, it gets really confusing when you're working on features that involve querying the features with features.
1) an architecture described in a paper
2) the trained weights of a specific instantiation of architecture
3) a chunk of code/neural net that accomplishes a task, agnostic to the above definitions
That turns out to be an ultrametric loss, and the derivative of an ultrametric loss is zero in a large region around any local minimum, so it can't be trained by gradient descent -- it has to be trained by search.
Punchline: it's about one million times less effective than a more traditional architecture. https://github.com/solresol/ultratree-results
There are optimizations like extreme 1.58 bit quant that can be applied to anything.
There are architectures that stray farther. Like SSMs and some attempts at bringing the RNN back from the dead. And even text diffusion models that try to generate paragraphs like we generate images i.e. not word by word.
Here's a paper showing KANs are no better than MLPs, if anything they are typically worse when comparing fairly. https://arxiv.org/pdf/2407.16674
I do expect transformers to be replaced eventually, but they do seem to have their own "bitter lesson" where trying to outperform them usually ends in failure.
"Where the Tsetlin machine currently excels is energy-constrained edge machine learning, where you can get up to 10000x less energy consumption and 1000x faster inference (https://www.mignon.ai). My goal is to create an alternative to BigTech’s black boxes: free, green, transparent, and logical (http://cair.uia.no)." (https://www.reddit.com/r/MachineLearning/comments/17xoj68/co...)
In the original Byte Latent Transformer paper they reintroduce ugly caching and n-grams which I'm looking to eliminate.
As expected pure byte level Transformers need some rethinking to keep them performant, some kind of matryoshka mechanism so that long predictable byte sequences (words and phrases) get grouped into a single latent vector.
The idea is to apply this "Byteformer" not just on text but also on compiled files, songs etc.
If it's impossible to scale this architecture at least a modified tokenizer could be helpful which falls back to bytes / unicode once a number or an unfamiliar word is encountered.
Last week version 7 was released and every time they make significant improvements.
The main one is the observation that the transformer uses an amount of copper and steel proportional to the power transmitted but inversely proportional to the frequency of operation.
The copper and steel cost of a transformer is the main cost (multiplied by the cost of capital for the 100+ years it will operate).
So if you can use solid state electronics to do switching at a higher frequency (switched mode power supplies, flyback designs, etc), then you can reduce the overall cost.
Chain of thought in latent space.
1. Improving the Self-Attention in the Transformer as is, keeping the quadratic complexity, which has some theoretical advantage in principle[1]: The most hyped one probably DeepSeek's Multi-Latent Attention[15], which kind of is Attention still - but also somehow different.
2. Linear RNNs: This starts from Linear Attention[2], DeltaNet[3], RKWV[4], Retention[5], Gated Linear Attention[6], Mamba[7], Griffin[8], Based[9], xLSTM[10], TTT[11], Gated DeltaNet[12], Titans[13].
They all have an update like: C_{t} = F_{t} C_{t-1} + i_{t} k_{t} v_{t}^T with a cell state C and output h_{t} = C_{t}^T q_{t}. There's a few tricks that made these work and now being very strong competitors to Transformers. The key here is the combination of an linear associative memory (aka Hopfield Network, aka Fast Weight Programmer, aka State Expansion...) and pushing it into a sequence with gating similar to the original LSTM (input, forget, output gate) - while here this is only dependent on the current input not the previous state for linearity. The linearity is needed to make it sequence-parallelizable, there are efforts now to add non-linearities again, but let's see. Their main benefit+downside both is that they have a fixed-size state, and therefore linear (vs Transformer-quadratic) time complexity.
For larger sizes they have become popular in hybrids with Transformer (Attention) Blocks, as there are problems with long context tasks [14]. Cool thing is they can also be distilled from pre-trained Transformers with not too much performance drop [16].
3. Along the sequence dimension most things can be categorized in these two. Attention and Linear (Associative Memory Enhanced) RNNs are heavily using Matrix Multiplications and anything else would be a waste of FLOPs on current GPUs. The essence is how to store information and how to interact with it, there might be still interesting directions as other comments show. Other important topics that go into the depth / width of the model are: Mixture of Experts, Iteration (RNNs) in Depth[17].
Disclaimer: I'm author of xLSTM and we recently released a 7B model [18] trained at NXAI, currently the fastest linear RNN at this scale and performance. Happy to answer more questions on this or the current state in this field of research.
[1] https://arxiv.org/abs/2008.02217
[2] https://arxiv.org/abs/2006.16236
[3] https://arxiv.org/pdf/2102.11174
[4] https://github.com/BlinkDL/RWKV
[5] https://arxiv.org/abs/2307.08621
[6] https://arxiv.org/pdf/2312.00752
[7] https://arxiv.org/abs/2312.06635
[8] https://arxiv.org/pdf/2402.19427
[9] https://arxiv.org/abs/2402.18668
[10] https://arxiv.org/abs/2405.04517
[11] https://arxiv.org/abs/2407.04620
[12] https://arxiv.org/abs/2412.06464
[13] https://arxiv.org/abs/2501.00663
[14] https://arxiv.org/abs/2406.07887
[15] https://arxiv.org/abs/2405.04434
[16] https://arxiv.org/abs/2410.10254
Edit: or perhaps you are working on a new insect sex regulation gene? If so that would be a great discussion here - https://en.wikipedia.org/wiki/Transformer_(gene)
Is it an alternative? Yes.
Is it better? Hell no.
Also, the majority of developers using version control are using Git. I guarantee the majority of developers outside the AI/ML bubble do not know what a "transformer" is.
Anyhow I suppose the existence of such questions on hn is evidence that I'm in more of a bubble that I esteemed, thanks for the reality check :)
(also my comment was in defense of parent who linked the wiki page, which defines transformer as per request, and is being downvoted for that)
It's really not. "Git" has a single extremely strong definition for tech people, and a single regional slang definition. "Transformer" has multiple strong definitions for tech people, and multiple strong definitions colloquially.
Not that we can't infer the OP's meaning - just that it's nowhere near as unambiguous as "git".
But to your point, the trend towards increasing inference-time compute costs, being ushered by CoT/reasoning models is one good reason to look for equally capable models that can be optimized for inference efficiency. Traditionally training was the main compute cost, so it's reasonable to ask if there's unexplored space there.
OP wasn’t suggesting looking for an alternative/successor to MLPs, but for an alternative/successor to transformers (while presumably still using MLPs) in the same way that transformers are an alternative/successor to LSTMs.
To truly build AI, it needs to self configure. I tried doing some work in the past with point swarm optimization of models, but I didn't really get anywhere