DX12 / Vulcan and new GPU architecture (?)

stateofmind · Jul 12, 2015

Hi people

I know most current GPUs will run DX12 and all, but I wonder whether the "new" way to do things with DX12/Vulcan will mean that different GPU architectures will be more efficient.

Now that you can push a lot of smaller draw calls - are current GPUs (like NV Maxwell) optimized to do the job that way?

I'm trying to wrap my head around it.

Thanks

monstercameron · Jul 12, 2015

I'm no expert but from the current trends id image that
Most gpu uarches will focus on compute shaders rather than the traditional rasterization pipeline.

ThatBuzzkiller · Jul 12, 2015

monstercameron said:
I'm no expert but from the current trends id image that
Most gpu uarches will focus on compute shaders rather than the traditional rasterization pipeline.

This ...

The future is Larrabee comparable programmability, fixed function is mostly dead end and dead silicon ...

stateofmind · Jul 12, 2015

ThatBuzzkiller said:
This ...

The future is Larrabee comparable programmability, fixed function is mostly dead end and dead silicon ...

why?
Also, I was thinking more about the layout of the cores (for special Maxwell's SMMs). I would think that a lot of smaller commands will require a different structure to have optimized performance or otherwise, the GPU might not get feed optimally. Maybe something similar to what happened with the VLIW5 vs VLIW4

TheELF · Jul 12, 2015

stateofmind said:
I would think that a lot of smaller commands will require a different structure to have optimized performance or otherwise, the GPU might not get feed optimally.

First of the "a lot of smaller commands" ,that every benchmarks is showing,is only a showcase,it's the one thing that gets a very big boost from lower api overhead but it is not that common in games,a game has a set number of draw calls and devs will keep that number within the capabilities of the ps4/xbone gpu so any mid and higher vga on the desktop will have no problems at all.
So nothing is going to change,some games will work better on some arch and others will work worse dependant on effects,gameworks and the likes.

NTMBK · Jul 13, 2015

Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

stateofmind · Jul 13, 2015

NTMBK said:
Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

which coincides well with many smaller commands, right? I'm asking because although I can see the diagrams and graphs, I don't really know what's going on below

werepossum · Jul 13, 2015

TheELF said:
First of the "a lot of smaller commands" ,that every benchmarks is showing,is only a showcase,it's the one thing that gets a very big boost from lower api overhead but it is not that common in games,a game has a set number of draw calls and devs will keep that number within the capabilities of the ps4/xbone gpu so any mid and higher vga on the desktop will have no problems at all.
So nothing is going to change,some games will work better on some arch and others will work worse dependant on effects,gameworks and the likes.

That was my understanding, that DX12 will be an improvement for everyone but huge for low end graphics cards and CPUs, especially AMD. (Because both Intel CPUs and NVidia GPUs already do well on large draw call counts.)

NTMBK said:
Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

By later GCN cards, does this include the R9 300 series?

Also, are the XBone and PS4 capable of taking advantage of this, or will it be limited to PC-only paths?

NTMBK · Jul 13, 2015

werepossum said:
By later GCN cards, does this include the R9 300 series?

Also, are the XBone and PS4 capable of taking advantage of this, or will it be limited to PC-only paths?

Some of the 300 series, but not all Hawaii, Fiji and Tonga all have beefed up GPU queues, while Bonaire and Pitcairn do not. (More details in the Anandtech article I linked.) They aren't completely limited like Kepler, but they don't have as much flexibility as the later models.

As for the consoles- yes, they will support this. The PS4 has 8 asynchronous compute engines (like Hawaii and Tonga), while the XBox One has 2 (like Pitcairn). This is in addition to their graphics command processor.

This image from the original GCN launch should give you some idea of what an ACE is:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5

The ACEs can feed compute based tasks directly to the compute units, bypassing the traditional graphics pipeline. The GCP handles traditional graphics ("here is a list of polygons, go rasterize them!").

werepossum · Jul 13, 2015

NTMBK said:
Some of the 300 series, but not all Hawaii, Fiji and Tonga all have beefed up GPU queues, while Bonaire and Pitcairn do not. (More details in the Anandtech article I linked.) They aren't completely limited like Kepler, but they don't have as much flexibility as the later models.

As for the consoles- yes, they will support this. The PS4 has 8 asynchronous compute engines (like Hawaii and Tonga), while the XBox One has 2 (like Pitcairn). This is in addition to their graphics command processor.

This image from the original GCN launch should give you some idea of what an ACE is:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5

The ACEs can feed compute based tasks directly to the compute units, bypassing the traditional graphics pipeline. The GCP handles traditional graphics ("here is a list of polygons, go rasterize them!").

I mean specifically Hawaii, I just didn't think to say. Sometimes I forget that not just the 290/290X were rebranded, but also older models of GCN.

This is why I've switched from leaning toward a GTX970 to an R9 390 even though I'll be gaming at 1080p. Seems to me that the Hawaii chips should benefit more from DX12 because of similarities with the consoles' GPUs as well as being better designed for a low level API (Mantle.)

NTMBK · Jul 13, 2015

werepossum said:
I mean specifically Hawaii, I just didn't think to say. Sometimes I forget that not just the 290/290X were rebranded, but also older models of GCN.

This is why I've switched from leaning toward a GTX970 to an R9 390 even though I'll be gaming at 1080p. Seems to me that the Hawaii chips should benefit more from DX12 because of similarities with the consoles' GPUs as well as being better designed for a low level API (Mantle.)

Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii

werepossum · Jul 13, 2015

NTMBK said:
Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii

You think the Tier 3 Resource Binding won't be used heavily, or just that Maxwell's architecture will compensate with Tier 2? I'm thinking specifically of mega-textures combined with heavy draw calls here (even though the very preliminary testing I've seen has been pretty contrived) and assuming that devs will leverage the XBone's ability (to balance the PS4's better speed) and thus the PC DX12 port's.

stateofmind · Jul 13, 2015

NTMBK said:
Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

I've read it all and it seems that my knowledge and my google abilities are not sufficient. Why Physics/Lighting/Memory parts can be executed in parallel / separately? What exactly are they doing?
I can't find a single good explanation or some kind of an example pipeline graph

Azix · Jul 13, 2015

NTMBK said:
Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii

Still hoping someone explains the maxwell 2 situation. Needs to be looked into. I suspect they still don't properly support it. That or nvidia thinks its not important.

stateofmind said:
I've read it all and it seems that my knowledge and my google abilities are not sufficient. Why Physics/Lighting/Memory parts can be executed in parallel / separately? What exactly are they doing?
I can't find a single good explanation or some kind of an example pipeline graph

It does seem like things can be done simultaneously. With dx11 it would appear as if the compute and graphics queues could not operate at the same time. So in DX12 you can perform compute tasks while handling graphics tasks (compute can also be used for graphics?). Should be huge benefit. eg. physics on GPU would use compute and the graphics would wait for the physics simulation to finish before its all sent to display as one frame. Should reduce how long it takes to construct each frame.

Anyone who knows better feel free to correct me.

3DVagabond · Jul 13, 2015

This is interesting:

There is also one exception to the DX11 rule that well get to in depth a bit later, but in short that exception is custom middleware like LiquidVR. Even in a DX11 context LiquidVR can leverage some (but not all) of the async shading functionality of GCN GPUs to do things like warping asynchronously, as it technically sits between DX11 and the GPU. This in turn is why async shading is so important to AMD's VR plans, as all of their GCN GPUs are capable of this and it can be exposed in the current DX11 ecosystem.

So, AMD can perform Async shading. Liquid VR is not an API, like I've read some saying but actually middleware that functions between DX and the GPU.

It appears that AT might have this wrong.

Meanwhile Maxwell 2 has 32 queues, composed of 1 graphics queue and 31 compute queues (or 32 compute queues total in pure compute mode).

AMD has 8 ACE's and each one can support 8 queues. That's 64 queues compared to Maxwell 2's 32.

Anyone else reading this differently, or can confirm?

Silverforce11 · Jul 14, 2015

From my other post..

AT's article is wrong, people told Ryan already in the comments but he refused to fix it. He's quoting "compute engines", which is fine, on later GCN there's 8. But for NV, he lists 32 which is plain wrong. That's 32 queues. It has only 1 engine. It's not an accurate chart, either go with engines or queues, not mix and match to inflate NV's and deflate AMD's uarch async abilities.

AMD GCN from Hawaii onwards has 1 CP + 8 ACE (Compute Engines).

In pure compute mode, it has 64 queues.
In rendering/compute, it has 1 rendering + 8 compute (each compute engine is independent) queues.

Maxwell 2 has 1 Compute Engine, it can handle up to 32 compute threads when operating on full compute.

In mixed mode, this single engine can handle 1 rendering + 1 compute queue asynchronously.

This is better than Kepler, since it can't do that at all, it has to wait for one task to finish before doing the next.

What this means is if async compute is used in DX12 games, Kepler is crippled (780Ti will be behind 970) and GCN pulls ahead.

The more async compute used, the more GCN can flex its 8 ACE engines!

Edit: I suspect the poor showing in Kepler in recent titles is due to the shift towards using compute for deferred lighting, such as in Ryse, Evolve, Witcher 3.
Also, Kepler/Maxwell has 2 DMA engines but from my reading, only 1 is enabled on consumer SKU, 2 is fully enabled on Teslas to use HyperQ (basically ensuring its single Engine can reach peak 32 queues).

We should see Pascal significantly improve on the Compute capabilities as DX12 matures, async compute will routinely be leveraged by game devs.

In before some anti-AMD hater claims GCN is out-dated or obsolete.. basically GCN was made for DX12 (aka. Mantle/Vulkan with a MS logo on it)...

zlatan · Jul 14, 2015

The queue engine or engines are just one specific aspect of async shader needs. The efficiency is more limited by the state management than the queue engines. Most of today's architectures have limited async shader efficiency, because the compute shader needs a specific hardware state. This wasn't a problem when the workload was serialized in the actual APIs, but with parallel execution the hardware don't able to execute two parallel pipelines when those are need a different hardware state.
GCN has a very big advantage in async workloads, because the hardware can run compute shaders with any hardware state. Basically the compute is stateless with this architecture. Also this is the main reason why the VR experience is so fluid with GCN. It can execute the async timewarps very efficiently.

Silverforce11 · Jul 14, 2015

@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420

AMDs asyncronous compute implementation is also very good, as the fully bindless nature of their GPU means that the CUs can do very fine grained simultaneous execution of multiple shaders. Don't get fooled by the maximum amount of compute queues (shown by some review sites). Big numbers don't tell anything about the performance. Usually running two tasks simultaneously gives the best performance. Running significantly more just trashes the data and instruction caches.

The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).

werepossum · Jul 14, 2015

Silverforce11 said:
@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420

The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).

This was my understanding, that basically it wasn't that NVidia didn't prepare for DX12, but rather that with Kepler's architecture simulations showed no significant benefit above two simultaneous threads/tasks (i.e. the delay in waiting for a task erased any advantage in speed from the multiplicity.) So there was no point in making huge design changes in architecture, thereby delaying Maxwell, only enough change to handle one simultaneous compute. But honestly it's been a year or more since I've read up on it, so my recall as well as my understanding may be betraying me. lol

My understanding was further that the extreme similarities between the consoles' GPU structure and AMD's GCN would make it easier to leverage the benefits of DX12 for GCN than for NVidia's architecture. Also, I have been assuming that due to the XBox One's slight weakness compared to the PS4, developers would be looking for ways to leverage its advantages (one being Tier 3 Resource Binding) to avoid providing a subpar XBone experience and thus, a fine AMD port would be relatively painless (for the developer) compared to a fine NVidia port. But as it's been even longer since I read up on console architecture, YMMV. Hell, MMMV.

It's also worth pointing out that even on the off chance that I'm correct, this advantage would largely disappear in AAA titles, where developers have ample resources to work with NVidia to maximize their engine's performance. But it's given me the notion (right or wrong) that the R9 390 will more often get a bigger DX12 boost than will the GTX970, even though it's a truly amazing piece of kit. That and cold feet about future Fallout 4 texture mods hitting the 3.5GB VRAM wall are making me lean toward the R9 390 even though it's heavily leveraged (thus heavily stressed) compared to the GTX970.

Thanks for everyone helping drag me into DX12 familiarity and hopefully avoiding a $300+ mistake. (Not that either card could really be considered a mistake.)

stateofmind · Jul 14, 2015

Silverforce11 said:
@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420

The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).

Can you please explain to me how come that physics and lighting can be executed at the same time? I'm not familiar with the actual works.
Intuitively, lighting and physics are related..

Fox5 · Jul 14, 2015

stateofmind said:
Can you please explain to me how come that physics and lighting can be executed at the same time? I'm not familiar with the actual works.
Intuitively, lighting and physics are related..

Kinematics modeling and lighting are both physics, but they are computed separately. Games don't use a Grand Unified Theory of everything.

Now then, lighting is still dependent on object positions. Most likely they just do time steps. For instance, Physx typically runs at 50hz, so 50 times a second, regardless of framerate, physics are updated. Lighting would most likely be computed at frame rate.

stateofmind · Jul 14, 2015

Fox5 said:
Kinematics modeling and lighting are both physics, but they are computed separately. Games don't use a Grand Unified Theory of everything.

Now then, lighting is still dependent on object positions. Most likely they just do time steps. For instance, Physx typically runs at 50hz, so 50 times a second, regardless of framerate, physics are updated. Lighting would most likely be computed at frame rate.

Many thanks for taking the time to help!
Can you refer to some good sources? I'm trying to wrap my head around that

So, lighting is really broken up to parts of the scene?

(I guessed that really physics and lighting and probably other stuff is computed for parts of the scene to reduce load or something, but it is not obvious from any of the articles in any hardware site in English for years)

Fox5 · Jul 14, 2015

stateofmind said:
Many thanks for taking the time to help!
Can you refer to some good sources? I'm trying to wrap my head around that

So, lighting is really broken up to parts of the scene?

(I guessed that really physics and lighting and probably other stuff is computed for parts of the scene to reduce load or something, but it is not obvious from any of the articles in any hardware site in English for years)

Lighting can be computed in two ways.

The classic way of doing it is forward rendering. You basically have to rerender the scene for each light to compute the change in lighting, so your performance is cut by 1/# lights. It's hard to get above about 8 lights in a scene like this, but it doesn't really cause any other issues.

There's also deferred lighting, in which case lighting is deferred until the very end and just rendered on the final 2d image. Lighting is rendered at the resolution of the image, and you can do thousands of lights because it's more of a filter on the final image. However, it breaks anti aliasing and in general doesn't look as good as forward rendering.

Physics computations are unrelated to the lighting. The objects will be moved into the correct positions each frame prior to any lighting being rendered. You could just calculate positions every frame, but if things are non-deterministic (like dynamic physics) then you really don't want a variable number of calculations per second or the results could change. You also don't want to compute something very expensive more than needed, so linking to framerate would lower framerate too.

NTMBK · Jul 15, 2015

Interesting post on this topic from Scott Wasson at the Tech Report:

One possible answer to the question of why Fiji (in both Fury and X forms) seems to underperform is something one of the other commenters has alluded to. The front end of Fiji looks very much like that of Tonga or Hawaii; it just has more CUs per cluster than Hawaii. It's possible the front end of the GPU is a bottleneck in many games, which could explain Fiji's similar performance to Hawaii, despite all the extra resources elsewhere.

If that's the case, and if the issue is just not "feeding the beast" quickly enough (and not just rasterization rates), then it's possible that the coming shift to DX12 and Vulkan could be a big boon to AMD's GCN-based GPUs. They may then be able to use their eight ACE engines to schedule lots of work in parallel and keep those big shader arrays active. Doing so could lead to a surprising turnaround in relative GPU performance.

I also expect DX12 and Vulkan to lead to much lower frame times from games generally thanks mostly to a reduction in serialization and single-thread CPU overhead. This development could help AMD more than Nvidia--in part because AMD needs the help more, and in part because of GCN's dormant ACE engines. Also, these "thin" APIs will move a lot of control back to game developers, taking the ability to optimize things behind the scenes out of the hands of the GPU driver guys--at least in theory. Fascinated to see how that plays out.

http://techreport.com/discussion/28...ury-graphics-card-reviewed?post=921201#921201

He makes a very good point- the ACEs will let AMD bypass Fiji's biggest bottleneck, the inability to provide enough work to the shader clusters.

Silverforce11 · Jul 15, 2015

That's nothing new from what people (gamedevs as well as directly from AMD's engineers) have been saying for a long time. GCN is built for a Mantle-like API, as its uarch cannot be taken advantage of fully with DX11. It was a forward looking uarch, meant to last them a long time.

I am hoping in the not too distant future, I can buy a Zen APU with GCN 2 and HBM2, that would make for an awesome base for an expanding rig, ie. plug in a GCN 2 dGPU, DX12 games take advantage of both.

DX12 / Vulcan and new GPU architecture (?)

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Lifer

Senior member

Elite Member

Lifer

Elite Member

Lifer

Elite Member

Senior member

Golden Member

Lifer

Lifer

Senior member

Lifer

Elite Member

Senior member

Diamond Member

Senior member

Diamond Member

Lifer

Lifer