This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.
The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.
For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.
In this video Jeff is interested in GPU accelerated tasks like AI and Jellyfin. His last video was using a stack of 4 Mac Studios connected by Thunderbolt for AI stuff.
https://www.youtube.com/watch?v=x4_RsUxRjKU
The Apple chips have both power CPU and GPU cores but also have a huge amount of memory (512GB) directly connected unlike most Nvidia consumer level GPUs that have far less memory.
Right now, sure. There's a reason why chip manufacturers are adding AI pipelines, tensor processors, and 'neural cores' though. They believe that running small local models are going to be a popular feature in the future. They might be right.
They were a pretty big deal back in ~2010, and I have to admit I didn't know that Tegra was powering Nintendo Switch.
To this day, it's the best mobile/Android device I ever owned. I don't know if it was the fastest, but it certainly was the best performing one I ever had. UI interactions were smooth, apps were fast on it, screen was bright, touch was perfect and still had long enough battery backup. The device felt very thin and light, but sturdy at the same time. It had a pleasant matte finish and a magnetic cover that lasted as long as the device did. It spolied the feel of later tablets for me.
It had only 1 GB RAM. We have much more powerful SoCs today. But nothing ever felt that smooth (iPhone is not considered). I don't know why it was so. Perhaps Android was light enough for it back then. Or it may have had a very good selection and integration of subcomponents. I was very disappointed when Nvidia discontinued the Tegra SoC family and tablets.
It leaves one to wonder what could be if they had any appetite for devices more in the consumer realm of things.
While the PCs were still displaying text, or if you were lucky to own an Hercules card, gray text, or maybe a CGA one, with 4 colours.
While the Amigas, which I am more confortable with, were doing this in the mid-80's:
https://www.youtube.com/watch?v=x7Px-ZkObTo
https://www.youtube.com/watch?v=-ga41edXw3A
The original Amiga 1000, had on its motherboard, later reduced to fit into an Amiga 500,
Motorola 68000 CPU, a programmable sounds chip with DMA channels (Paula), and a programable blitter chip (Agnus aka early GPUs).
You would build in RAM the audio, or graphics instructions for the respetive chipset, set the DMA parameters, and let them lose.
What it offered you was a page of memory where each byte value mapped to a character in ROM. You feed in your text and the controller fetches the character pixels and puts them on the display. Later we got ASCII box drawing characters. Then we got sprite systems like the NES, where the Picture Processing Unit handles loading pixels and moving sprites around the screen.
Eventually we moved on to raw framebuffers. You get a big chunk of memory and you draw the pixels yourself. The hardware was responsible for swapping the framebuffers and doing the rendering on the physical display.
Along the way we slowly got more features like defining a triangle, its texture, and how to move it, instead of doing it all in software.
Up until the 90s when the modern concept of a GPU coalesced, we were mainly pushing pixels by hand onto the screen. Wild times.
The history of display processing is obviously a lot more nuanced than that, it's pretty interesting if that's your kind of thing.
We have had the opposite problem for 35+ years at this point. The newer architecture machines like the Apple machines, the GB10, the AI 395+ do share memory between GPU and CPU but in a different way, I believe.
I'd argue with memory becoming suddenly much more expensive we'll probably see the opposite trend. I'm going to get me one of these GB10 or Strix Halo machines ASAP because I think with RAM prices skyrocketing we won't be seeing more of this kind of thing in the consumer market for a long time. Or at least, prices will not be dropping any time soon.
[0] - Not fully correct, as there are/were extensions cards that override the bus, thus replacing one of the said chips, on Amiga case.
It would take a lot of work to make a GPU do current CPU type tasks, but it would be interesting to see how it changes parallelism and our approach to logic in code.
HN isn't always very rational about voting. It will be a loss if you judge any idea on their basis.
> It would take a lot of work to make a GPU do current CPU type tasks
In my opinion, that would be counterproductive. The advantage of GPUs is that they have a large number of very simple GPU cores. Instead, just do a few separate CPU cores on the same die, or on a separate die. Or you could even have a forest of GPU cores with a few CPU cores interspersed among them - sort of like how modern FPGAs have logic tiles, memory tiles and CPU tiles spread out on it. I doubt it would be called a GPU at that point.
Also you'd need to add extra hardware for various OS support functions (privilege levels, address space translation/MMU) that are currently missing from the GPU. But the idea is otherwise sound, you can think of the 'Mill' proposed CPU architecture as one variety of it.
Perhaps I should have phrased it differently. CPU and GPU cores are designed for different types of loads. The rest of your comment seems similar to what I was imagining.
Still, I don't think that enhancing the GPU cores with CPU capabilities (OOE, rings, MMU, etc from your examples) is the best idea. You may end up with the advantages of neither and the disadvantages of both. I was suggesting that you could instead have a few dedicated CPU cores distributed among the numerous GPU cores. Finding the right balance of GPU to CPU cores may be the key to achieving the best performance on such a system.
In mobile SoCs a good chunk of this is power efficiency. On a battery-powered device, there's always going to be a tradeoff to spend die area making something like 4K video playback more power efficient, versus general purpose compute
Desktop-focussed SKUs are more liable to spend a metric ton of die area on bigger caches close to your compute.
> If you look at your typical phone or laptop SoC, the CPU is only a small part.
Keep in mind that the die area doesn't always correspond to the throughput (average rate) of the computations done on it. That area may be allocated for a higher computational bandwidth (peak rate) and lower latency. Or in other words, get the results of a large number of computations faster, even if it means that the circuits idle for the rest of the cycles. I don't know the situation on mobile SoCs with regards to those quantities.
As for how the HW looks like we already know. Look at Strix Halo as an example. We are just getting bigger and bigger integrated GPUs. Most of the flops on that chip is the GPU part.
Also, I'd say if you buy for example a Macbook with an M4 Pro chip, it is already is a big GPU attached to a small CPU.
Anyway, we're still stuck with "G" for "graphics" so it all doesn't make much sense and I'm actually looking for a vendor that takes its mission more seriously.
In the sense that the RAM is fully integrated, anyways.
It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.
It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).
Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.
Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.
[1] https://github.com/exo-explore/exo [2] https://www.youtube.com/watch?v=x4_RsUxRjKU
This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.
Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.
Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.
I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.
Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?
Source code: https://github.com/BinSquare/inferbench
---
Edit: Fixed
So it's just a static value in this hardware list: https://github.com/BinSquare/inferbench/blob/main/src/lib/ha...
Let me know if you know of a better way, or contribute :D
Perhaps some filter could cut out submissions that don't really make sense?
So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
There are larger/better models as well, but those tend to really push the limits of 96gb.
FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.
Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.
If you buy a 15k AUD rtx 6000 96GB, that card will _never_ pay for itself on a gpt-oss:120b workload vs just using openrouter - no matter how many tokens you push through it - because the cost of residential power in Australia means you cannot generate tokens cheaper than the cloud even if the card were free.
This so doesn't really matter to your overall point which I agree with but:
The rise of rooftop solar and home battery energy storage flips this a bit now in Australia, IMO. At least where I live, every house has a solar panel on it.
Not worth it just for local LLM usage, but an interesting change to energy economics IMO!
- You can use the GPU for training and run your own fine tuned models
- You can have much higher generation speeds
- You can sell the GPU on the used market in ~2 years time for a significant portion of its value
- You can run other types of models like image, audio or video generation that are not available via an API, or cost significantly more
- Psychologically, you don’t feel like you have to constrain your token spending and you can, for instance, just leave an agent to run for hours or overnight without feeling bad that you just “wasted” $20
- You won’t be running the GPU at max power constantly
The recent Gemma 3 models, which are produced by Google (a little startup - heard of em?) outperform the last several OpenAI releases.
Closed does not necessarily mean better. Plus the local ones can be finetuned to whatever use case you may have, won't have any inputs blocked by censorship functionality, and you can optimize them by distilling to whatever spec you need.
Anyway all that is extraneous detail - the important thing is to decouple "open" and "small" from "worse" in your mind. The most recent Gemma 3 model specifically is incredible, and it makes sense, given that Google has access to many times more data than OpenAI for training (something like a factor of 10 at least). Which is of course a very straightforward idea to wrap your head around, Google was scrapign the internet for decades before OpenAI even entered the scene.
So just because their Gemma model is released in an open-source (open weights) way, doesn't mean it should be discounted. There's no magic voodoo happening behind the scenes at OpenAI or Anthropic; the models are essentially of the same type. But Google releases theirs to undercut the profitability of their competitors.
You can add more channels, sure, but each channel makes it less and less likely for you to boot. Look at modern AM5 struggling to boot at over 6000 with more than two sticks.
So you’d have to get an insane six channels to match the bus width, at which point your only choice to be stable would be to lower the speed so much that you’re back to the same orders of magnitude difference, really.
Now we could instead solder that RAM, move it closer to the GPU and cross-link channels to reduce noise. We could also increase the speed and oh, we just invented soldered-on GDDR…
The bus width is the number of channels. They don't call them channels when they're soldered but 384 is already the equivalent of 6. The premise is that you would have more. Dual socket Epyc systems already have 24 channels (12 channels per socket). It costs money but so does 256GB of GDDR.
> Look at modern AM5 struggling to boot at over 6000 with more than two sticks.
The relevant number for this is the number of sticks per channel. With 16 channels and 64GB sticks you could have 1TB of RAM with only one stick per channel. Use CAMM2 instead of DIMMs and you get the same speed and capacity from 8 slots.
But boy, a standard GPU socket so you could easily BYO cooler would be nice.
The problem is that the signals and features that the motherboard and CPU expect are different between generations. We use different sockets on different generations to prevent you plugging in incompatible CPUs.
We used to have cross-generational sockets in the 386 era because the hardware supported it. Motherboards weren't changing so you could just upgrade the CPU. But then the CPUs needed different voltages than before for performance. So we needed a new socket to not blow up your CPU with the wrong voltage.
That's where we are today. Each generation of CPU wants different voltages, power, signals, a specific chipset, etc. Within the same +-1 generation you can swap CPUs because they're electrically compatible.
To have universal CPU sockets, we'd need a universal electrical interface standard, which is too much of a moving target.
AMD would probably love to never have to tool up a new CPU socket. They don't make money on the motherboard you have to buy. But the old motherboards just can't support new CPUs. Thus, new socket.
Although... Is it possible to pair a fast GPU with one? Right now my inference setup for large MoE LLMs has shared experts in system memory, with KV cache and dense parts on a GPU, and a Spark would do a better job of handling the experts than my PC, if only it could talk to a fast GPU.
[edit] Oof, I forgot these have only 128GB of RAM. I take it all back, I still don’t find them compelling.
> Asus made a crypto-mining motherboard that supports up to 20 GPUs
https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.
Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
Yes. They're basically laptop chips at this point. The thermals are worse but the chips are perfectly modern and can handle reasonably large workloads. I've got an 8 core Ryzen 7 with Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming this thing is absolutely fine.
The power draw is a huge win for me. It's like 6W at idle. I live remotely so grid power is somewhat unreliable and saving watts when using solar batteries extends their lifetime massively. I'm thrilled with them.
However, a full-size desktop computer seldom makes sense as a personal computer, i.e. as the computer that interfaces to a human via display, keyboard and graphic pointer.
For most of the activities done directly by a human, i.e. reading & editing documents, browsing Internet, watching movies and so on, a mini-PC is powerful enough. The only exception is playing games designed for big GPUs, but there are many computer users who are not gamers.
In most cases the optimal setup is to use a mini-PC as your personal computer and a full-size desktop as a server on which you can launch any time-consuming tasks, e.g. compilation of big software projects, EDA/CAD simulations, testing suites etc.
The desktop used as server can use Wake-on-LAN to stay powered off when not needed and wake up whenever it must run some task remotely.
Also, having to buy two computers also costs money. It makes sense to use 1 for both use cases if you have to buy the desktop anyway.
It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
Keeps the desk nice and tidy while “the beasts” roar in a soundproofed closet.
With the right software support from say pytorch this could suddenly make old GPUs and underpowered PCs like in TFA into very attractive and competitive solutions for training and inference.
That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.
I do wonder how long the cards are going to need host systems at all. We've already seen GPUs with m.2 ssd attached! Radeon Pro SSG hails back from 2016! You still need a way to get the model on that in the first place to get work in and out, but a 1Gbe and small RISC-V chip (which Nvidia already uses formanagement cores) could suffice. Maybe even an rpi on the card. https://www.techpowerup.com/224434/amd-announces-the-radeon-...
Given the gobs of memory cards have, they probably don't even need storage; they just need big pipes. Intel had 100Gbe on their Xeon & Xeon Phi cores (10x what we saw here!) in 2016! GPUs that just plug into the switch and talk across 400Gbe or UltraEthernet or switched CXL, that run semi independently, feel so sensible, so not outlandish. https://www.servethehome.com/next-generation-interconnect-in...
It's far off for now, but flash makers are also looking at radically many channel flash, which can provide absurdly high GB/s, High Bandwidth Flash. And potentially integrated some extremely parallel tensorcores on each channel. Switching from DRAM to flash for AI processing could be a colossal win for fitting large models cost effectively (& perhaps power efficiently) while still having ridiculous gobs of bandwidth. With that possible win of doing processing & filtering extremely near to the data too. https://www.tomshardware.com/tech-industry/sandisk-and-sk-hy...
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.
If you prefer not to see his posts on the HN list pages, a practical solution is to use a browser extension (such as Stylus) to customise the HN styling to hide the posts.
Here is a specific CSS style which will hide submissions from Jeff's website:
tr.submission:has(td a[href="from?site=jeffgeerling.com"]),
tr.submission:has(td a[href="from?site=jeffgeerling.com"]) + tr,
tr.submission:has(td a[href="from?site=jeffgeerling.com"]) + tr + tr {
opacity: 0.05
}
In this example, I've made it almost invisible, whilst it still takes up space on the screen (to avoid confusion about the post number increasing from N to N+2). You could use { display: none } to completely hide the relevant posts.The approach can be modified to suit any origin you prefer to not come across.
The limitation is that the style modification may need refactoring if HN changes the markup structure.