I do wonder why diffusion models aren't used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.
  • MASNeo
  • ·
  • 58 minutes ago
  • ·
  • [ - ]
I wish there would be more of this research to speed things up rather than building ever larger models
  • nl
  • ·
  • 49 minutes ago
  • ·
  • [ - ]
Why not both?

Scaling laws are real! But they don't preclude faster processing.

  • nl
  • ·
  • 38 minutes ago
  • ·
  • [ - ]
Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!

I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.

https://taalas.com/products/

Nothing to do with each other. This is a general optimization. Taalas' is an ASIC that runs a tiny 8B model on SRAM.

But I wonder how Taalas' product can scale. Making a custom chip for one single tiny model is different than running any model trillions in size for a billion users.

Roughly, 53B transistors for every 8B params. For a 2T param model, you'd need 13 trillion transistor assuming scale is linear. One chip uses 2.5 kW of power? That's 4x H100 GPUs. How does it draw so much power?

If you assume that the frontier model is 1.5 trillion models, you'd need an entire N5 wafer chip to run it.

Very interesting tech for edge inference though. Robots and self driving can make use of these in the distant future if power draw comes down drastically.

  • LASR
  • ·
  • 15 minutes ago
  • ·
  • [ - ]
Just tried this. Holy fuck.

I'd take an army of high-school graduate LLMs to build my agentic applications over a couple of genius LLMs any day.

This is a whole new paradigm of AI.

Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There's loads of more "traditional" .gguf options (well, quants) that are practical even on shockingly weak hardware, and I've been seeing things that give me hope that diffusion is the next step forward, but so far it's all been early research prototypes.
Based on my experience running diffusion image models I really hope this isn't going to take over anytime soon. Parallel decoding may be great if you have a nice parallel gpu or npu but is dog slow for cpus
Google is working on a similar line of research. Wonder why they haven't rolled out a GPT40 scaled version of this yet
Probably because it's expensive.

But I wish there were more "let's scale this thing to the skies" experiments from those who actually can afford to scale things to the skies.

  • yorwba
  • ·
  • 2 minutes ago
  • ·
  • [ - ]
Scaling laws mean that there's not much need to actually scale things to the skies. Instead, you can run a bunch of experiments at small scale, fit the scaling law parameters, then extrapolate. If the predicted outcome is disappointing (e.g. it's unlikely to beat the previous scaled-to-the-sky model), you can save the really expensive experiment for a more promising approach.

It would certainly be nice though if this kind of negative result was published more often instead of leaving people to guess why a seemingly useful innovation wasn't adopted in the end.

If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…
Diffusion language models seem poised to smash purely autoregressive models. I'm giving it 1-2 years.
Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.
Same with digital vs analog