Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.
I'm here all night if anyone else needs some other lazy name suggestions.
If it helps, MoE's aren't just disparate 'expert' models trained to deal with specific domain knowledge jammed into a bigger model, but rather are the same base model trained in similar ways where each model ends up specialising on individual tokens. As the image dartos linked shows, you can end up with some 'experts' in the model that really, really like placing punctuation or language syntax for whatever reason.
All part are loaded in as any could be called upon to generate the next token.
Thanks a lot.
https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...
Each different color highlight is a generated by a different expert.
You can see that the "experts" are more experts of syntax than concepts. Notice how the light blue one almost always generates puncuation and operators. (until later layers when the red one does so)
I'm honestly not too sure the mechanism behind which experts gets chosen. I'm sure it's encoded in the weights somehow, but I haven't gone too deep into MoE models.
Then you get some strange ones where parts of whole words are generated by different experts.
Makes me think that there’s room for improvement in the expert selection machinery, but I don’t know enough about it to speculate.
(Also, beware, molmo.org is an AI-generated website to absorb through SEO Allen AI’s efforts; the real website is molmo.allenai.org. Note for instance that all tweets listed here are from fake accounts since suspended: https://molmo.org/#how-to-use)