ARIA: An Open Multimodal Native Mixture-of-Experts Model

97
21
jinqueeny
1 year ago
arxiv.org

cantSpellSober
·
1 year ago
·
[ - ]

> outperforms Pixtral-12B and Llama3.2-11B

Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.

panarchy
·
1 year ago
·
[ - ]

They could call it Maria (MoE Aria) won't help with standing out in searches however. Maybe MarAIa so it would be more unique.

I'm here all night if anyone else needs some other lazy name suggestions.

theanonymousone
·
1 year ago
·
[ - ]

In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?

0tfoaij
·
1 year ago
·
[ - ]

MoE's still require the total number of parameters (46b, not 56b, there's some overlap) to be in ram/vram, but the benefit is that the inference speed will be based on the amount of active parameters used, which in the case of Mixtral is 2 experts at 7b each for an inference speed comparable to 14b dense models. This 3x improvement in inference speed would be worth the additional ram usage alone, especially for cpu inference where memory bandwidth rather than total memory capacity is the limiting factor, but as a bonus there's a general rule you can use calculate how well MoE's will compare to dense models by taking the square root of the active parameters * total parameters, meaning Mixtral ends up comparing favourably to 25b dense models for example. In the case of ARIA it's going to have the memory usage of a 25b model, with the performance of a 10b~ model while running as fast as a 4b model. This is a nice trade off to make if you can spare the additional ram.

If it helps, MoE's aren't just disparate 'expert' models trained to deal with specific domain knowledge jammed into a bigger model, but rather are the same base model trained in similar ways where each model ends up specialising on individual tokens. As the image dartos linked shows, you can end up with some 'experts' in the model that really, really like placing punctuation or language syntax for whatever reason.

dartos
·
1 year ago
·
[ - ]

Closer to 56.

All part are loaded in as any could be called upon to generate the next token.

theanonymousone
·
1 year ago
·
[ - ]

Aha, so it's decided per token, not per input. I thought at first the LLM chooses a "submodel" based on the input and then follows it to generate the whole output.

Thanks a lot.

dartos
·
1 year ago
·
[ - ]

Yeah, this image helped solidify that for me.

https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_pr...

Each different color highlight is a generated by a different expert.

You can see that the "experts" are more experts of syntax than concepts. Notice how the light blue one almost always generates puncuation and operators. (until later layers when the red one does so)

I'm honestly not too sure the mechanism behind which experts gets chosen. I'm sure it's encoded in the weights somehow, but I haven't gone too deep into MoE models.

MacsHeadroom
·
1 year ago
·
[ - ]

I see a whitespace expert, a punctuation expert, and a first word expert. It's interesting to see how the experts specialize.

dartos
·
1 year ago
·
[ - ]

Right?

Then you get some strange ones where parts of whole words are generated by different experts.

Makes me think that there’s room for improvement in the expert selection machinery, but I don’t know enough about it to speculate.

zo1
·
1 year ago
·
[ - ]

Honestly, looks like someone throwing spaghetti on a wall a billion times and seeing what sticks, then training the throwing arm to somehow minimize something. I get that LLM magic is kinda magic and is doing some cool stuff, but this looks like it's just chaos and statistical untangling that happens to minimize some random fitness function X-levels down the line.

niutech
·
1 year ago
·
[ - ]

I’m curious how it compares with recently announced Molmo: https://molmo.org/

espadrine
·
1 year ago
·
[ - ]

The Pixtral report[0] compares positively to Molmo.

(Also, beware, molmo.org is an AI-generated website to absorb through SEO Allen AI’s efforts; the real website is molmo.allenai.org. Note for instance that all tweets listed here are from fake accounts since suspended: https://molmo.org/#how-to-use)

[0]: https://arxiv.org/pdf/2410.07073

bsenftner
·
1 year ago
·
[ - ]

Know of where Molmo is being discussed? Looks interesting.

·
1 year ago
·
[ - ]

petemir
·
1 year ago
·
[ - ]

Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.

[0] https://rhymes.ai/

vessenes
·
1 year ago
·
[ - ]

This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.

SubiculumCode
·
1 year ago
·
[ - ]

How do you figure out the required memory? The MoE aspect complicates it.

vessenes
·
1 year ago
·
[ - ]

It does; in this case, though, a 25b f16 model will fit. The paper mentions an A100 80G is sufficient but a 40 is not; M2 Max has up to 192G. That said, MoEs are popular in lower memory devices because you can swap out the experts layers -- their expert layers are like 3-4b parameters, so if you are willing to have a sort of pause on generation where you load up the desired expert, you could do it in a lot less RAM. They pitch the main benefit here as faster generation, it's a lot less matmul to do per token generated.

ProofHouse
·
1 year ago
·
[ - ]

Each model added, no?

Onavo
·
1 year ago
·
[ - ]

What's the size of your M2 Max memory?

treefry
·
1 year ago
·
[ - ]

Looks like 64GB or more

SomewhatLikely
·
1 year ago
·
[ - ]

"Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."