Oh they've been here a while..

Are chiplets enough to save Moore’s Law? (68 points, 1 year ago, 73 comments) https://news.ycombinator.com/item?id=36371651

Intel enters a new era of chiplets (128 points, 2 years ago, 164 comments) https://news.ycombinator.com/item?id=32619824

U.S. focuses on invigorating ‘chiplets’ to stay cutting-edge in tech (85 points, 2 years ago, 48 comments) https://news.ycombinator.com/item?id=35917361

Intel, AMD, and other industry heavyweights create a new standard for chiplets (10 points, 3 years ago) https://news.ycombinator.com/item?id=30533548

Intel Brings Chiplets to Data Center CPUs (49 points, 3 years ago, 50 comments) https://news.ycombinator.com/item?id=28250076

Deep Dive: Nvidia Inference Research Chip Scales to 32 Chiplets (47 points, 5 years ago) https://news.ycombinator.com/item?id=20343550

Chiplets Gaining Steam (4 points, 7 years ago) https://news.ycombinator.com/item?id=15636572

Tiny Chiplets: A New Level of Micro Manufacturing (36 points, 12 years ago, 12 comments) https://news.ycombinator.com/item?id=5517695

I can go earlier: I was designing them in the 90s.
Chip on board has been around since the 1970s or 80s in some consumer stuff. Putting them on a PCB isn't the same and they were wire bonded, but still the idea of connecting directly to the die was there.
  • nyeah
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Reliability seems to have improved a lot but yeah.
I worked on a small team that created a CPU in the late 1990s and we stacked DDR dies on top of the CPU die and ran vertical traces. It was cool to have the entire computer inside one package.
I imagine cooling would be the main issue in scaling this.
  • nyeah
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
Yep. And under the name multichip module it goes back way further than that.
eeTimes is behind the times for sure.
The biggest concern I have with chiplets is the impact on concepts like async/await. If the runtime isn't NUMA-aware, it will wind up with cross-die continuations of logical work, increasing latency and burning up interconnect bandwidth. Even if it is aware of the architecture, you still have a messy, non-obvious situation to resolve regarding whether or not to take the latency impact.

Managing your own threads and always assuming the worst case scenario (I.e., they're all a microsecond away from each other) will give you the most defensive solution as architecture changes over time.

I really don't know the best answer. Another NUMA node implies some necessary additional complexity. It would seem the wonder of 192 cores in a socket has some tradeoffs the software people need to compensate for now. You can't get all that magic for free.

Chiplets don't require NUMA. The Ryzen 9s have their cores split between two chiplets and still have UMA. Of course, if you're on over a hundred cores you will be on a NUMA platform, but NUMA isn't really new, either, and has been around since at least the first multi-socket systems. Async/await isn't more of a problem with it than regular old preemption, and that has generally been solved through virtualization and/or per-process processor affinity.
> Chiplets don't require NUMA

The nuance is that even if you "hide" the extra NUMA tier and have oversized lanes connecting everything, you still have some cores that are further away than others in time. It doesn't matter how it shows up in device manager, the physical constraint is still there.

Ryzen chips have a different network topology than Threadripper and EPYC chips. All external communication goes through the IO die, so any given core on either processor die is the same effective distance to the memory modules. The speed of signal propagation being finite, it's possible that individual cores within the same die might have different effective latencies, depending on how they routed the bus internally (although that's a problem you have on any single die multicore design), but it's not possible when comparing cores at mirrored positions on opposite dies.

Threadripper and EPYC are totally different. They actually have NUMA topologies, where different dies are wired such that they have more distance to traverse to reach a given memory module, and accessing the farther ones involves some negotiation with a routing component.

I don't see how the physical constraint matters of you can't observe it with a performance penalty. I don't really know if that's the case but that is what the person you are replying to is implying.
My EPYC gen1 chip appeared as 4 NUMA nodes. It seems like they've moved away from that though.
I believe the first generation of Zen Epyc chips were designed to be scalable in the sense that each die had its own CCX, memory controllers, and serdes blocks (for PCIe, SATA, and 10GKR). A single die Epyc was a single NUMA node, dual die Epyc were 2 NUMA nodes, and quad die Epycs were 4 NUMA nodes. It really was that all the resources from the other dies were in other NUMA nodes. Accessing another die's caches, memory controllers, or serdes blocks required using the infinity fabric.

In later generations of Zen Epyc CPUs, there's a central IO die which has the memory controllers and serdes blocks, then small CCX dies which effectively just had CPU cores and caches on them. This allowed for using different process nodes or manufacturing technologies to optimize for each die's specific tasks. It also allowed for much more modular configurations of CPU core counts and CPU frequencies while providing a high number of memory controllers and serdes lanes, to address different market desires.

[dead]
Back in 1997, if you brought a Pentium 2 [1] you didn't just get a single chip with all the cache made by the same process. Instead, the chip and the cache were made by different processes, tested, then integrated into a 'cartridge' you plugged into the 'CPU slot'

[1] https://en.wikipedia.org/wiki/Pentium_II

1997 was also a great time to be a Celeron owner; IIRC all you had to do was to cover one contact on the "cartridge" with nail polish, and you could trivially overclock it beyond what the much more expensive Pentium II achieved OOB. Easily kept the rig good enough for many hits that followed: StarCraft, Settlers 3, HoMM3, WA, heck even The Sims was playable.
I always thought that was a cool idea, and I was disappointed when it was essentially abandoned after the PIII. I still think there’s value there for things like customizing processors for certain applications (having a standard base chip and external interface, but being able to customize the rest of the the processor board).
So this is like SoC, but with more separation between the blocks due to them being on different dies?

(There are times when I'd rather ask HN, than AI).

Is it kind of like taking a step backward: going back to then ancient classic circuit board with multiple chips and shrinking that design as-is, but without putting everything on one die?

  • m3kw9
  • ·
  • 2 weeks ago
  • ·
  • [ - ]
This time for real
  • ·
  • 2 weeks ago
  • ·
  • [ - ]