I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.
High bandwidth memory on-package with 352 AMD Zen 4 cores!
With 7 TB/s memory bandwidth, it’s basically an x86 GPU.
This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.
Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.
These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.
So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.
that's basically x86 without 16 and 32 bit support, no real mode etc.
CPU starts initialized in 64bit without all that legacy crap.
that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.
risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.
this also could be used to remove legacy parts without disrupting architecture
Would be interesting to see a benchmark on this.
If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.
If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.
> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
I doubt you'd get significantly more performance, though you'd likely gain power efficiency.
Half of what you described in your hypothetical instruction set are already implemented in ARM.
For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.
It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.
In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.
The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.
I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.
For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.
That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.
But maybe I'm being too optimistic
16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.
[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575
These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.
It's useful comparison in terms of achievable performance per transistor count.
Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...
So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)
what makes it more likely to work this time?
I should've written per core.
and spending millions on patent lawsuits ...
And from each individual core:
- 25% per core L1/L2 cache
- 25% vector stuff (SSE, AVX, ...)
- from the remaining 50% only about 20% is doing instruction decoding
CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.
And that's not sarcasm, I'm serious.
With GPUs you have all these challenges while also building a massively complicated set of custom compilers and interfaces on the software side, while at the same time trying to keep broken user software written against some other company's interface not only functional, but performant.
I couldn't find any buy it now links but 512gb sticks don't seem to be fantasies, either: https://news.samsung.com/global/samsung-develops-industrys-f...
Since it seems A100s top out at 80GB, and appear to start at $10,000 I'd say it's a steal
Yes, I'm acutely aware that bandwidth matters, but my mental model is the rest of that sentence is "up to a point," since those "self hosted LLM" threads are filled to the brim with people measuring tokens-per-minute or even running inference on CPU
I'm not hardware adjacent enough to try such a stunt, but there was also recently a submission of a BSD-3-Clause implementation of Google's TPU <https://news.ycombinator.com/item?id=44111452>
While micron (crucial) 64GB DDR5 (SO-)DIMMs are available since few months.
Do compilers optimize for specific RISC-V CPUs, not just profiles/extensions? Same for drivers and kernel support.
My understanding was that if it's RISC-V compliant, no extra work is needed for existing software to run on it.
A simple example is that the CPU might support running two specific instructions better if they were adjacent than if they were separated by other instructions ( https://en.wikichip.org/wiki/macro-operation_fusion ). So the optimizer can try to put those instructions next to each other. LLVM has target features for this, like "lui-addi-fusion" for CPUs that will fuse a `lui; addi` sequence into a single immediate load.
A more complex example is keeping track of the CPU's internal state. The optimizer models the state of the CPU's functional units (integer, address generation, etc) so that it has an idea of which units will be in use at what time. If the optimizer has to allocate multiple instructions that will use some combination of those units, it can try to lay them out in an order that will minimize stalling on busy units while leaving other units unused.
That information also tells the optimizer about the latency of each instruction, so when it has a choice between multiple ways to compute the same operation it can choose the one that works better on this CPU.
See also: https://myhsu.xyz/llvm-sched-model-1/ https://myhsu.xyz/llvm-sched-model-1.5/
If you don't do this your code will still run on your CPU. It just won't necessarily be as optimal as it could be.
It's not that things won't run, but this is necessary for compilers to generate well optimized code.
For GPUs today and in the foreseeable future, there are still good reasons for them to remain discrete, in some market segments. Low-power laptops have already moved entirely to integrated GPUs, and entry-level gaming laptops are moving in that direction. Desktops have widely varying GPU needs ranging from the minimal iGPUs that all desktop CPUs now already have, up to GPUs that dwarf the CPU in die and package size and power budget. Servers have needs ranging from one to several GPUs per CPU. There's no one right answer for how much GPU to integrate with the CPU.
And for low-power consumer devices like laptops, "matrix multiplication coprocessor for AI tasks" is at least as likely to mean NPU as GPU, and NPUs are always integrated rather than discrete.
Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".
Completely irrelevant to consumer hardware, in basically the same way as NVIDIA's Hopper (a data center GPU that doesn't do graphics). They're ML accelerators that for the foreseeable future will mostly remain discrete components and not be integrated onto Xeon/EPYC server CPUs. We've seen a handful of products where a small amount of CPU gets grafted onto a large GPU/accelerator to remove the need for a separate host CPU, but that's definitely not on track to kill off discrete accelerators in the datacenter space.
> Calling something a GPU tends to make people ask for (good, performant) support for opengl, Vulkan, direct3d... which seem like a huge waste of effort if you want to be an "AI-coprocessor".
This is not a problem outside the consumer hardware market.
But, there is much more to discrete GPUs than vector instructions or parallel cores. It's very different memory and cache systems with very different synchronization tradeoffs. It's like an embedded computer hanging off your PCI bus, and this computer does not have the same stable architecture as your general purpose CPU running the host OS.
In some ways, the whole modern graphics stack is a sort of integration and commoditization of the supercomputers of decades ago. What used to be special vector machines and clusters full of regular CPUs and RAM has moved into massive chips.
But as other posters said, there is still a lot more abstraction in the graphics/numeric programming models and a lot more compiler and runtime tools to hide the platform. Unless one of these hidden platforms "wins" in the market, it's hard for me to imagine general purpose OS and apps being able to handle the massive differences between particular GPU systems.
It would easily be like prior decades where multicore wasn't taking off because most apps couldn't really use it. Or where special things like the "cell processor" in the playstation required very dedicated development to use effectively. The heterogeneity of system architectures makes it hard for general purpose reuse and hard to "port" software that wasn't written with the platform in mind.
https://www.notebookcheck.net/Intel-CEO-abruptly-trashed-Roy...
RISC-V is the fifth version of a series of academic chip designs at Berkeley (hence it's name).
In terms of design philosophy, it's probably closest to MIPS of the major architectures; I'll point out that some of its early whitepapers are explicitly calling out ARM and x86 as the kind of architectural weirdos to avoid emulating.
Says every new system without legacy concerns.
Also I don't meet to come off confrontational, I genuinely don't know
Given that the core motivation of RISC was to be a maximally performant design for architectures, the authors of RISC-V would disagree with you that their approach is compromising performance.
Interestingly, I recently completed a masters-level computer architecture course and we used MIPS. However, starting next semester the class will use RISC-V instead.
There actually have been changes for "today's needs," and they're usually things like AES acceleration. ARM tried to run Java natively with Jazelle, but it's still best to think of it as a frontend, and the fact that Android is mostly Java and ARM, but this feature got dropped says a lot.
The fact that there haven't been that many changes shows they got the fundamental operations and architecture styles right. What's lacking today is where GPUs step in: massively wide SIMD.
AArach64 is pretty much a completely new ISA built from ground up.
Your CPU changes with every app, tab and program you open. Changing from one core, to n-core plus AI-GPU and back. This idea, that you have to write it all in stone, always seemed wild to me.
So we have:
CISC – which is still used outside the x86 bubble;
RISC – which is widely used;
Hybrid RISC/CISC designs – x86 excluding, that would be the IBM z/Architecture (i.e. mainframes);
EPIC/VLIW – which has been largely unsuccessful outside DSP's and a few niches.
They all deal with registers, movements and testing the conditions, though, and one can't say that an ISA 123 that effectively does the same thing as an ISA 456 is older or newer. SIMD instructions have been the latest addition, and they also follow the same well known mental and compute models.Radically different designs, such as Intel APX 432, Smalltalk, Java CPU's, have not received any meaningful acceptance, and it seems that the idea of a CPU architecture that is tied to a higher level compute model has been eschewed in perpetuity. Java CPU's were the last massively hyped up attempt to change it, and that was 30 years ago.
What other viable alternatives outside the von Neumann architecture are available to us? I am not sure.
It exists, and was specifically designed to go wide since clock speeds have limits, bit ILP can be scaled almost infinitely if you are willing to put enough transistors into it. aarch64
SFCompute
And so on … definitely not out of trend
There's plenty of people who would be fine doing unexciting dead end work if they were compensated well enough (pay, work-life balance, acknowledgement of value, etc).
This is ye olde Creative Destruction dilemma. There's too much inertia and politics internally to make these projects succeed in house. But if a startup was owned by the org and they mapped out a path of how to absorb it after it takes off they then reap the rewards rather than watch yet another competitor eat their lunch.
The only way I've seen anyone deal with this issue successfully is with rather small companies which don't have nearly as much of the whole agency cost of management to deal with.
Are they going to make one with 16384 cores for AI / graphics or are they going to make one with 8 / 16 / 32 cores that can each execute like 20 instructions per cycle?
The biggest roadblock would be lack of support on the software side.
What it can't be is something like the Mill if they implement the RISC-V ISA.
I came to this thread looking for a comment about this. I've been patiently following along for over a decade now and I'm not optimistic anything will come from the project :(
The lack of high-performance RISC-V designs means that C/C++ compilers produce all-around good but generic code that can run on most RISC-V CPU's, from microcontrollers to a few commercially available desktops or laptops, but it can't exploit high-performance CPU design features of a specific CPU (e.g. exploit instruction timings or specific instruction sequences recommended for each generation). The real issue is that the high-performant RISC-V designs are yet to emerge.
Producing a highly performant CPU is only one part of the job, and the next part requires compiler support, which can't exist unless the vendor publishes extensive documentation that explains how to get the most out of it.
I wish them success, plus I hope they do not do what Intel did with its add-ons.
Hoping for an open system (which I think RISC-V is) and nothing even close to Intel ME or AMT.
https://en.wikipedia.org/wiki/Intel_Management_Engine
https://en.wikipedia.org/wiki/Intel_Active_Management_Techno...
The architecture is independent of additional silicon with separate functions. The "only" thing which makes RISC-V open are that the specifications are freely available and freely usable.
Intel ME is, by design, separate from the actual CPU. Whether the CPU uses x86 or RISC-V is essentially irrelevant.
The fact that California housing pushed Intel to Oregon probably helped lead to its failures. Every time a company relocates to get cost of living (and thus payroll) costs down by relocating to a place with fewer potential employees and fewer competing employers, modernity slams on the breaks.
This wiki page has a list of Intel fab starts, you can see them being constructed in Oregon until 2013, and after that all new construction moved elsewhere. https://en.wikipedia.org/wiki/List_of_Intel_manufacturing_si...
I can imagine this slow disinvestment in Oregon would only encourage some architects to quit an found a RISC-V startup.
Arizona is also a mistake --- a far worse place for high tech than Oregon!. It is a desert real estate ponzi scheme with no top-tier schools, no history of top-tier high-skill intellectual job markets. In general the sun belt (including LA) is the land of stupid.
The electoral college is always winning out over the best economic geography, and it sucks.