I hope the IHVs have a look at it because current DX12 seems semi abandoned, with it not supporting buffer pointers even when every gpu made on the last 10 (or more!) years can do pointers just fine, and while Vulkan doesnt do a 2.0 release that cleans things, so it carries a lot of baggage, and specially, tons of drivers that dont implement the extensions that really improve things.
If this api existed, you could emulate openGL on top of this faster than current opengl to vulkan layers, and something like SDL3 gpu would get a 3x/4x boost too.
This isn't really the case, at least on desktop side.
All three desktop GPU vendors support Vulkan 1.4 (or most of the features via extensions) on all major platforms even on really old hardware (e.g. Intel Skylake is 10+ years old and has all the latest Vulkan features). Even Apple + MoltenVK is pretty good.
Even mobile GPU vendors have pretty good support in their latest drivers.
The biggest issue is that Android consumer devices don't get GPU driver updates so they're not available to the general public.
Vulkan is another mess, even if there was a 2.0, how are devs supposed to actually use it, especially on Android, the biggest consumer Vulkan platform?
But soon? Hopefully
In hindsight it really would have been better to have a separate VulkanES which is specialized for mobile GPUs.
Also there is no inherent thing that blocks extensions by default. I feel like a reasonable core that can optionally do more things similar to CPU extensions (i.e. vector extensions) could be the way to go here.
there are also some arm laptops that just run Qualcomm chips, the same as some phones (tablets with a keyboard, basically, but a bit more "PC"-like due to running Windows).
AFAICT the fusion seems likely to be an accurate prediction.
I think this puts a floor on supported hardware though, like Nvidia 30xx and Radeon 5xxx. And of course motherboard support is a crapshoot until 2020 or so.
Bindless textures never needed any kind of resizable BAR, you have been able to use them since early 2010s on opengl through an extension. Buffer pointers also have never needed it.
> Graphics APIs and shader languages have significantly increased in complexity over the past decade. It’s time to start discussing how to strip down the abstractions to simplify development, improve performance, and prepare for future GPU workloads.
Unless the link of the article has changed since your comment?
Meaning ... SSDs initially reused IDE/SATA interfaces, which had inherent bottlenecks because those standards were designed for spinning disks.
To fully realize SSD performance, a new transport had to be built from the ground up, one that eliminated those legacy assumptions, constraints and complexities.
A lot of this post went over my head, but I've struggled enough with GLSL for this to be triggering. Learning gets brutal for the lack of middle ground between reinventing every shader every time and using an engine that abstracts shaders from the render pipeline. A lot of open-source projects that use shaders are either allergic to documenting them or are proud of how obtuse the code is. Shadertoy is about as good as it gets, and that's not a compliment.
The only way I learned anything about shaders was from someone who already knew them well. They learned what they knew by spending a solid 7-8 years of their teenage/young adult years doing nearly nothing but GPU programming. There's probably something in between that doesn't involve giving up and using node-based tools, but in a couple decades of trying and failing to grasp it I've never found it.
https://lettier.github.io/3d-game-shaders-for-beginners/inde...
I agree on the other points. GPU graphics programming is hard in large part because of terrible or lack of documentation.
I also think that the way forward is to go back to software rendering, however this time around those algorithms and data structures are actually hardware accelerated as he points out.
Note that this is an ongoing trend on VFX industry already, about 5 years ago OTOY ported their OctaneRender into CUDA as the main rendering API.
> “Inbetween” is never written as one word. If you have seen it written in this way before, it is a simple typo or misspelling. You should not use it in this way because it is not grammatically correct as the noun phrase or the adjective form. https://grammarhow.com/in-between-in-between-or-inbetween/
Matthew 7:3 "And why beholdest thou the mote that is in thy brother's eye, but considerest not the beam that is in thine own eye?"
If you intend for people to click the link, then you might just as well delete all the prose before it.
And linguists think it would be a bad idea to have one:
https://archive.nytimes.com/opinionator.blogs.nytimes.com/20...
Meanwhile GPU raytracing was a purely software affair until quite recently when fixed-function raytracing hardware arrived. It's fast but also opaque and inflexible, only exposed through high-level driver interfaces which hide most of the details, so you have to let Jensen take the wheel. There's nothing stopping someone from going back to software RT of course but the performance of hardware RT is hard to pass up for now, so that's mostly the way things are going even if it does have annoying limitations.
I think it's fair to say that for most gamers, Vulkan/DX12 hasn't really been a net positive, the PSO problem affected many popular games and while Vulkan has been trying to improve, WebGPU is tricky as it has is roots on the first versions of Vulkan.
Perhaps it was a bad idea to go all in to a low level API that exposes many details when the hardware underneath is evolving so fast. Maybe CUDA, as the post says in some places, with its more generic computing support is the right way after all.
For example: https://github.com/StafaH/mujoco_warp/blob/render_context/mu...
(A new simple raytracer that compiles to cuda, used for robotics reinforcement learning, renders at up to 1 million fps at low resolution, 64x64, with textures, shadows)
https://semiengineering.com/knowledge_centers/standards-laws...
There is a constant cycle between domain-specific hardware-hardcoded-algorithm design, and programmable flexible design.
VAO is the last feature I was missing prior.
Also the other cores will do useful gameplay work so one CPU core for the GPU is ok.
4 CPU cores is also enough for eternity. 1GB shared RAM/VRAM too.
Let's build something good on top of the hardware/OSes/APIs/languages we have now? 3588/linux/OpenGL/C+Java specifically!
Hardware has permanently peaked in many ways, only soft internal protocols can now evolve, I write mine inside TCP/HTTP.
In the before times, upgrading CPU meant eveything runs faster. Who didn't like that? Today, we need code that infinitely scales CPU cores for that to remain true. 16 thread CPUs have been around for a long time; I'd like my software to make the most of them.
When we have 480+Hz monitors, we will probably need more than 1 CPU core for GPU rendering to make the most of them.
Uh oh https://www.amazon.com/ASUS-Swift-Gaming-Monitor-PG27AQDP/dp...
When you look at the DirectX 12 documentation and best-practice guides, you’re constantly warned that certain techniques may perform well on one GPU but poorly on another, and vice versa. That alone shows how fragile this approach can be.
Which makes sense: GPU hardware keeps evolving and has become incredibly complex. Maybe graphics APIs should actually move further up the abstraction ladder again, to a point where you mainly upload models, textures, and a high-level description of what the scene and objects are supposed to do and how they relate to each other. The hardware (and its driver) could then decide what’s optimal and how to turn that into pixels on the screen.
Yes, game engines and (to some extent) RHIs already do this, but having such an approach as a standardized, optional graphics API would be interesting. It would allow GPU vendors to adapt their drivers closely to their hardware, because they arguably know best what their hardware can do and how to do it efficiently.
Because that control is only as good as you can master it, and not all game developers do well on that front. Just check out enhanced barriers in DX12 and all of the rules around them as an example. You almost need to train as a lawyer to digest that clusterfuck.
> The hardware (and its driver) could then decide what’s optimal and how to turn that into pixels on the screen.
We should go in the other direction: have a goddamn ISA you can target across architectures, like an x86 for GPUs (though ideally not that encumbered by licenses), and let people write code against it. Get rid of all the proprietary driver stack while you're at it.
What would be one good primer to be able to comprehend all the design issues raised?
Bonus points if you then look at CUDA “hello world” and consider that it can do nontrivial work on the same hardware (sans fixed function accelerators) with much less boilerplate (and driver overhead).
I wish I still had this level of motivation :)
> Meshlet has no clear 1:1 lane to vertex mapping, there’s no straightforward way to run a partial mesh shader wave for selected triangles. This is the main reason mobile GPU vendors haven’t been keen to adapt the desktop centric mesh shader API designed by Nvidia and AMD. Vertex shaders are still important for mobile.
I get that there's no mapping from vertex/triangle to tile until after the mesh shader runs. But even with vertex shaders there's also no mapping from vertex/triangle to tile until after the vertex shader runs. The binning of triangles to tiles has to happen after the vertex/mesh shader stage. So I don't understand why mesh shaders would be worse for mobile TBDR.
I guess this is suggesting that TBDR implementations split the vertex shader into two parts, one that runs before binning and only calculates positions, and one that runs after and computes everything else. I guess this could be done but it sounds crazy to me, probably duplicating most of the work. And if that's the case why isn't there an extension allowing applications to explicitly separate position and attribute calculations for better efficiency? (Maybe there is?)
Edit: I found docs on Intel's site about this. I think I understand now. https://www.intel.com/content/www/us/en/developer/articles/g...
Yes, you have to execute the vertex shader twice, which is extra work. But if your main constraint is memory bandwidth, not FLOPS, then I guess it can be better to throw away the entire output of the vertex shader except the position, rather than save all the output in memory and read it back later during rasterization. At rasterization time when the vertex shader is executed again, you only shade the triangles that actually went into your tile, and the vertex shader outputs stay in local cache and never hit main memory. And this doesn't work with mesh shaders because you can't pick a subset of the mesh's triangles to shade.
It does seem like there ought to be an extension to add separate position-only and attribute-only vertex shaders. But it wouldn't help the mesh shader situation.
In fact, Qualcomm's documentation spells this out: https://docs.qualcomm.com/nav/home/overview.html?product=160...
* GPU virtualization (e.g., the D3D residency APIs), to allow many applications to share GPU resources (e.g., HBM).
* Undefined behavior: how easy is it for applications to accidentally or intentionally take a dependency on undefined behavior? This can make it harder to translate this new API to an even newer API in the future.
I was an only-half-joking champion of ditching vertex attrib bindings when we were drafting WebGPU and WGSL, because it's a really nice simplification, but it was felt that would be too much of a departure from existing APIs. (Spending too many of our "Innovation Tokens" on something that would cause dev friction in the beginning)
In WGSL we tried (for a while?) to build language features as "sugar" when we could. You don't have to guess what order or scope a `for` loop uses when we just spec how it desugars into a simpler, more explicit (but more verbose) core form/dialect of the language.
That said, this powerpoint-driven-development flex knocks this back a whole seriousness and earnestness tier and a half: > My prototype API fits in one screen: 150 lines of code. The blog post is titled “No Graphics API”. That’s obviously an impossible goal today, but we got close enough. WebGPU has a smaller feature set and features a ~2700 line API (Emscripten C header).
Try to zoom out on the API and fit those *160* lines on one screen! My browser gives up at 30%, and I am still only seeing 127. This is just dishonesty, and we do not need more of this kind of puffery in the world.
And yeah, it's shorter because it is a toy PoC, even if one I enjoyed seeing someone else's take on it. Among other things, the author pretty dishonestly elides the number of lines the enums would take up. (A texture/data format enum on one line? That's one whole additional Pinocchio right there!)
I took WebGPU.webidl and did a quick pass through removing some of the biggest misses of this API (queries, timers, device loss, errors in general, shader introspection, feature detection) and some of the irrelevant parts (anything touching canvas, external textures), and immediately got it down to 241 declarations.
This kind of dishonest puffery holds back an otherwise interesting article.
WebGPU could have also introduced Cuda's simple launch model for graphics APIs. Instead of all that insane binding boilerplate, just provide the bindings as launch args to the draw call like draw(numTriangles, args), with args being something like draw(numTriangles, {uniformBuffer, positions, uvs, samplers}), depending on whatever the shaders expect.
Among other things, that covers everything running on non-apple, non-nvidia ARM devices, including freshly bought.
...at the cost of creating PSOs at random times which is an expensive operation :/
IMHO a small number of immutable state objects is the best middle ground (similar to D3D11 or Metal, but reshuffled like described in Seb's post).
In particular, this fork: https://github.com/RobertBeckebans/nvrhi which adds some niceties and quality of life improvements.
Will it be possible to hallucinate the frame of a game at a similar speed to rendering it with a mesh and textures?
We're already seeing the hybrid version of this where you render a lower res mesh and hallucinate the upscaled, more detailed, more realistic looking skin over the top.
I wouldn't want to be in the game engine business right now :/
https://www.youtube.com/watch?v=P6UKhR0T6cs&t=2315s
> "research from the 70s especially, there was tons of work going on on hidden surface removal, these clever different algorithmic ways - today we just kill it with a depth buffer. We just throw megabytes and megabytes of memory and the problem gets solved much much easier."
ofcourse "megabytes" of memory was unthinkiable in the 70s. but for us, its unthinkable to have real-time frame inferencing. I cant help but draw the parallels between our current-day "clever algorithmic ways" of drawing pixels to the screen.
I definitely agree with the take that in the grand scheme of things, all this pixel rasterizing business will be a transient moment that will be washed away with a much simpler petaflop/exaflop local TPU that runs at 60W under load, and it simply 'dreams' frames and textures for you.
And it's quite a bit simpler than what we have in the "modern" GPU APIs atm.
What the modern APIs give you is less CPU driver overhead and new functionality like ray tracing. If you're not CPU-bound to begin with and don't need those new features, then there's not much of a reason to switch. The modern APIs require way more management than the prior ones; memory management, CPU-GPU synchronization, avoiding resource hazards, etc.
Also, many of those AAA games are also moving to UE5, which is basically DX12 under the hood (presumably it should have a Vulkan backend too, but I don't see it used much?)
Vulkan has the same issues (and more) as D3D12, you just don't hear much about it because there are hardly any games built directly on top of Vulkan. Vulkan is mainly useful as Proton backend on Linux.
Or has the use of Middleware like Unreal Engine largely made them irrelevant? Or should EPIC put out a new Graphics API proposal?
Game developers create a RHI (rendering hardware interface) like discussed on the article, and go on with game development.
Because the greatest innovation thus far has been ray tracing and mesh shaders, and still they are largely ignored, so why keep on pushing forward?
Yes, the centralization of engines to Unreal, Unity, etc makes it so there’s less interest in pushing the boundaries, they are still pushed just on the GPU side.
From a CPU API perspective, it’s very close to just plain old buffer mapping and go. We would need a hardware shift that would add something more to the pipeline than what we currently do. Like when tesselation shaders came about from geometry shader practices.
- It's not exposing raw GPU addresses, SDL3_GPU has buffer objects instead. Also you're much more limited with how you use buffers in SDL3 (ex. no coherent buffers, you're forced to use a transfer buffer if you want to do a CPU -> GPU upload)
- in SDL3_GPU synchronization is done automatically, without the user specifying barriers (helped by a technique called cycling: https://moonside.games/posts/sdl-gpu-concepts-cycling/),
- More modern features such as mesh shading are not exposed in SDL3_GPU, and keeps the traditional rendering pipeline as the main way to draw stuff. Also, bindless is a first class citizen in Aaltonen's proposal (and the main reason for the simplification of the API), while SDL3_GPU doesn't support it at all and instead opts for a traditional descriptor binding system.
This "no api" proposal requires hardware from the last 5-10 years :)
Thankfully later versions have added escape hatches which bypass much of that unnecessary bureaucracy, but it was grim for a while, and all that early API cruft is still there to confuse newcomers.
Per-drawcall cost goes to nanosecond scale. Assuming you do drawcalls of course, this makes bindless and indirect rendering a bit easier so you could drop CPU cost to near-0 in a renderer.
It would also highly mitigate shader compiler hitches due to having a split pipeline instead of a monolythic one.
The simplification on barriers could improve performance a significant amount because currently, most engines that deal with Vulkan and DX12 need to keep track of individual texture layouts and transitions, and this completely removes such a thing.
- Would lead to reduced memory usage on the driver side due to eliminating all the statetracking for "legacy" APIs and all the PSO/shader duplication for the "modern" APIs (who doesn't like using less memory? won't show up on a microbenchmark but a reduced working set leads to globally increased performance in most cases, due to >cache hit%)
- A much reduced cost per API operation. I don't just mean drawcalls but everything else too. And allowing more asynchrony without the "here's 5 types of fences and barriers" kind of mess. As the article says, you can either choose between mostly implicit sync (OpenGL, DX11) and tracking all your resources yourself (Vulkan) then feeding all that data into the API which mostly ignores it. This one wouldn't really have an impact on speeding up existing applications but more like unlock new possibilities. For example massively improving scene variety with cheap drawcalls and doing more procedural objects/materials instead of the standard PBR pipeline. Yes, drawindirect and friends exist but they aren't exactly straightforward to use and require you to structure your problem in a specific way.
The cost/compromise is dropping support for outdated GPUs.
Have you taken a look at the codebase of some game-engines, its complete cluster fk, cause some simple tasks just take 800 lines of code, and in the end the drivers don't even use the complexity graphics API's force upon you.
Improved this is not an accomplishment ?
But then game/engine devs want to use the vertex shader producing a uv coordinate and a normal together with a pixel shader that only reads the uv coordinate (or neither for shadow mapping) and don't want to pay for the bandwidth of the unused vertex outputs (or the cost of calculating them).
Or they want to be able to randomly enable any other pipeline stage like tessellation or geometry and the same shader should just work without any performance overhead.
Basically do what most engines do - have preprocessor constants and use different paths based on what attributes you need.
I also don't see how separated pipeline stages are against this - you already have this functionality in existing APIs where you can swap different stages individually. Some changes might need a fixup from the driver side, but nothing which can't be added in this proposed API's `gpuSetPipeline` implementation...