DX12 / Vulcan and new GPU architecture (?)

Status
Not open for further replies.

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
Hi people

I know most current GPUs will run DX12 and all, but I wonder whether the "new" way to do things with DX12/Vulcan will mean that different GPU architectures will be more efficient.

Now that you can push a lot of smaller draw calls - are current GPUs (like NV Maxwell) optimized to do the job that way?

I'm trying to wrap my head around it.

Thanks
 
Last edited:

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
I'm no expert but from the current trends id image that
Most gpu uarches will focus on compute shaders rather than the traditional rasterization pipeline.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
I'm no expert but from the current trends id image that
Most gpu uarches will focus on compute shaders rather than the traditional rasterization pipeline.

This ...

The future is Larrabee comparable programmability, fixed function is mostly dead end and dead silicon ...
 

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
This ...

The future is Larrabee comparable programmability, fixed function is mostly dead end and dead silicon ...

why?
Also, I was thinking more about the layout of the cores (for special Maxwell's SMMs). I would think that a lot of smaller commands will require a different structure to have optimized performance or otherwise, the GPU might not get feed optimally. Maybe something similar to what happened with the VLIW5 vs VLIW4
 

TheELF

Diamond Member
Dec 22, 2012
4,026
753
126
I would think that a lot of smaller commands will require a different structure to have optimized performance or otherwise, the GPU might not get feed optimally.
First of the "a lot of smaller commands" ,that every benchmarks is showing,is only a showcase,it's the one thing that gets a very big boost from lower api overhead but it is not that common in games,a game has a set number of draw calls and devs will keep that number within the capabilities of the ps4/xbone gpu so any mid and higher vga on the desktop will have no problems at all.
So nothing is going to change,some games will work better on some arch and others will work worse dependant on effects,gameworks and the likes.
 

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

which coincides well with many smaller commands, right? I'm asking because although I can see the diagrams and graphs, I don't really know what's going on below
 

werepossum

Elite Member
Jul 10, 2006
29,873
463
126
First of the "a lot of smaller commands" ,that every benchmarks is showing,is only a showcase,it's the one thing that gets a very big boost from lower api overhead but it is not that common in games,a game has a set number of draw calls and devs will keep that number within the capabilities of the ps4/xbone gpu so any mid and higher vga on the desktop will have no problems at all.
So nothing is going to change,some games will work better on some arch and others will work worse dependant on effects,gameworks and the likes.
That was my understanding, that DX12 will be an improvement for everyone but huge for low end graphics cards and CPUs, especially AMD. (Because both Intel CPUs and NVidia GPUs already do well on large draw call counts.)

Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.
By later GCN cards, does this include the R9 300 series?

Also, are the XBone and PS4 capable of taking advantage of this, or will it be limited to PC-only paths?
 

NTMBK

Lifer
Nov 14, 2011
10,292
5,256
136
By later GCN cards, does this include the R9 300 series?

Also, are the XBone and PS4 capable of taking advantage of this, or will it be limited to PC-only paths?

Some of the 300 series, but not all Hawaii, Fiji and Tonga all have beefed up GPU queues, while Bonaire and Pitcairn do not. (More details in the Anandtech article I linked.) They aren't completely limited like Kepler, but they don't have as much flexibility as the later models.

As for the consoles- yes, they will support this. The PS4 has 8 asynchronous compute engines (like Hawaii and Tonga), while the XBox One has 2 (like Pitcairn). This is in addition to their graphics command processor.

This image from the original GCN launch should give you some idea of what an ACE is:



http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5

The ACEs can feed compute based tasks directly to the compute units, bypassing the traditional graphics pipeline. The GCP handles traditional graphics ("here is a list of polygons, go rasterize them!").
 

werepossum

Elite Member
Jul 10, 2006
29,873
463
126
Some of the 300 series, but not all Hawaii, Fiji and Tonga all have beefed up GPU queues, while Bonaire and Pitcairn do not. (More details in the Anandtech article I linked.) They aren't completely limited like Kepler, but they don't have as much flexibility as the later models.

As for the consoles- yes, they will support this. The PS4 has 8 asynchronous compute engines (like Hawaii and Tonga), while the XBox One has 2 (like Pitcairn). This is in addition to their graphics command processor.

This image from the original GCN launch should give you some idea of what an ACE is:



http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/5

The ACEs can feed compute based tasks directly to the compute units, bypassing the traditional graphics pipeline. The GCP handles traditional graphics ("here is a list of polygons, go rasterize them!").
I mean specifically Hawaii, I just didn't think to say. Sometimes I forget that not just the 290/290X were rebranded, but also older models of GCN.

This is why I've switched from leaning toward a GTX970 to an R9 390 even though I'll be gaming at 1080p. Seems to me that the Hawaii chips should benefit more from DX12 because of similarities with the consoles' GPUs as well as being better designed for a low level API (Mantle.)
 

NTMBK

Lifer
Nov 14, 2011
10,292
5,256
136
I mean specifically Hawaii, I just didn't think to say. Sometimes I forget that not just the 290/290X were rebranded, but also older models of GCN.

This is why I've switched from leaning toward a GTX970 to an R9 390 even though I'll be gaming at 1080p. Seems to me that the Hawaii chips should benefit more from DX12 because of similarities with the consoles' GPUs as well as being better designed for a low level API (Mantle.)

Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii
 

werepossum

Elite Member
Jul 10, 2006
29,873
463
126
Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii
You think the Tier 3 Resource Binding won't be used heavily, or just that Maxwell's architecture will compensate with Tier 2? I'm thinking specifically of mega-textures combined with heavy draw calls here (even though the very preliminary testing I've seen has been pretty contrived) and assuming that devs will leverage the XBone's ability (to balance the PS4's better speed) and thus the PC DX12 port's.
 

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
Asynchronous compute and shading will shake things up nicely. http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading Early GCN, Kepler and Maxwell 1 will struggle with this (Kepler and Maxwell 1 can't do it at all, and can only have 1 graphics context running), but Maxwell 2 and later GCN cards should handle it nicely.

I've read it all and it seems that my knowledge and my google abilities are not sufficient. Why Physics/Lighting/Memory parts can be executed in parallel / separately? What exactly are they doing?
I can't find a single good explanation or some kind of an example pipeline graph
 

Azix

Golden Member
Apr 18, 2014
1,438
67
91
Ah, the 970 is Maxwell 2, so it should benefit just as much as Hawaii

Still hoping someone explains the maxwell 2 situation. Needs to be looked into. I suspect they still don't properly support it. That or nvidia thinks its not important.

I've read it all and it seems that my knowledge and my google abilities are not sufficient. Why Physics/Lighting/Memory parts can be executed in parallel / separately? What exactly are they doing?
I can't find a single good explanation or some kind of an example pipeline graph

It does seem like things can be done simultaneously. With dx11 it would appear as if the compute and graphics queues could not operate at the same time. So in DX12 you can perform compute tasks while handling graphics tasks (compute can also be used for graphics?). Should be huge benefit. eg. physics on GPU would use compute and the graphics would wait for the physics simulation to finish before its all sent to display as one frame. Should reduce how long it takes to construct each frame.

Anyone who knows better feel free to correct me.

 
Last edited:

3DVagabond

Lifer
Aug 10, 2009
11,951
204
106
This is interesting:
There is also one exception to the DX11 rule that we’ll get to in depth a bit later, but in short that exception is custom middleware like LiquidVR. Even in a DX11 context LiquidVR can leverage some (but not all) of the async shading functionality of GCN GPUs to do things like warping asynchronously, as it technically sits between DX11 and the GPU. This in turn is why async shading is so important to AMD's VR plans, as all of their GCN GPUs are capable of this and it can be exposed in the current DX11 ecosystem.

So, AMD can perform Async shading. Liquid VR is not an API, like I've read some saying but actually middleware that functions between DX and the GPU.

It appears that AT might have this wrong.

Meanwhile Maxwell 2 has 32 queues, composed of 1 graphics queue and 31 compute queues (or 32 compute queues total in pure compute mode).



AMD has 8 ACE's and each one can support 8 queues. That's 64 queues compared to Maxwell 2's 32.

Anyone else reading this differently, or can confirm?
 
Feb 19, 2009
10,457
10
76
From my other post..

AT's article is wrong, people told Ryan already in the comments but he refused to fix it. He's quoting "compute engines", which is fine, on later GCN there's 8. But for NV, he lists 32 which is plain wrong. That's 32 queues. It has only 1 engine. It's not an accurate chart, either go with engines or queues, not mix and match to inflate NV's and deflate AMD's uarch async abilities.

AMD GCN from Hawaii onwards has 1 CP + 8 ACE (Compute Engines).

In pure compute mode, it has 64 queues.
In rendering/compute, it has 1 rendering + 8 compute (each compute engine is independent) queues.



Maxwell 2 has 1 Compute Engine, it can handle up to 32 compute threads when operating on full compute.

In mixed mode, this single engine can handle 1 rendering + 1 compute queue asynchronously.

This is better than Kepler, since it can't do that at all, it has to wait for one task to finish before doing the next.



What this means is if async compute is used in DX12 games, Kepler is crippled (780Ti will be behind 970) and GCN pulls ahead.



The more async compute used, the more GCN can flex its 8 ACE engines!

Edit: I suspect the poor showing in Kepler in recent titles is due to the shift towards using compute for deferred lighting, such as in Ryse, Evolve, Witcher 3.
Also, Kepler/Maxwell has 2 DMA engines but from my reading, only 1 is enabled on consumer SKU, 2 is fully enabled on Teslas to use HyperQ (basically ensuring its single Engine can reach peak 32 queues).

We should see Pascal significantly improve on the Compute capabilities as DX12 matures, async compute will routinely be leveraged by game devs.

In before some anti-AMD hater claims GCN is out-dated or obsolete.. basically GCN was made for DX12 (aka. Mantle/Vulkan with a MS logo on it)...

 
Last edited:

zlatan

Senior member
Mar 15, 2011
580
291
136
The queue engine or engines are just one specific aspect of async shader needs. The efficiency is more limited by the state management than the queue engines. Most of today's architectures have limited async shader efficiency, because the compute shader needs a specific hardware state. This wasn't a problem when the workload was serialized in the actual APIs, but with parallel execution the hardware don't able to execute two parallel pipelines when those are need a different hardware state.
GCN has a very big advantage in async workloads, because the hardware can run compute shaders with any hardware state. Basically the compute is stateless with this architecture. Also this is the main reason why the VR experience is so fluid with GCN. It can execute the async timewarps very efficiently.
 
Feb 19, 2009
10,457
10
76
@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420

AMDs asyncronous compute implementation is also very good, as the fully bindless nature of their GPU means that the CUs can do very fine grained simultaneous execution of multiple shaders. Don't get fooled by the maximum amount of compute queues (shown by some review sites). Big numbers don't tell anything about the performance. Usually running two tasks simultaneously gives the best performance. Running significantly more just trashes the data and instruction caches.

The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).
 
Last edited:

werepossum

Elite Member
Jul 10, 2006
29,873
463
126
@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420



The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).
This was my understanding, that basically it wasn't that NVidia didn't prepare for DX12, but rather that with Kepler's architecture simulations showed no significant benefit above two simultaneous threads/tasks (i.e. the delay in waiting for a task erased any advantage in speed from the multiplicity.) So there was no point in making huge design changes in architecture, thereby delaying Maxwell, only enough change to handle one simultaneous compute. But honestly it's been a year or more since I've read up on it, so my recall as well as my understanding may be betraying me. lol

My understanding was further that the extreme similarities between the consoles' GPU structure and AMD's GCN would make it easier to leverage the benefits of DX12 for GCN than for NVidia's architecture. Also, I have been assuming that due to the XBox One's slight weakness compared to the PS4, developers would be looking for ways to leverage its advantages (one being Tier 3 Resource Binding) to avoid providing a subpar XBone experience and thus, a fine AMD port would be relatively painless (for the developer) compared to a fine NVidia port. But as it's been even longer since I read up on console architecture, YMMV. Hell, MMMV.

It's also worth pointing out that even on the off chance that I'm correct, this advantage would largely disappear in AAA titles, where developers have ample resources to work with NVidia to maximize their engine's performance. But it's given me the notion (right or wrong) that the R9 390 will more often get a bigger DX12 boost than will the GTX970, even though it's a truly amazing piece of kit. That and cold feet about future Fallout 4 texture mods hitting the 3.5GB VRAM wall are making me lean toward the R9 390 even though it's heavily leveraged (thus heavily stressed) compared to the GTX970.

Thanks for everyone helping drag me into DX12 familiarity and hopefully avoiding a $300+ mistake. (Not that either card could really be considered a mistake.)
 

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
@zlatan
Interesting you bring that up, its the same from Sebbbi (gamedev) at B3D

https://forum.beyond3d.com/threads/direct3d-feature-levels-discussion.56575/page-18#post-1851420



The terminology used is exactly as the lead programmer at Lionhead during AMD's E3 PC Gaming presentation, for Async Compute in DX12 in Fable for light, physics, effects. The lead programmer basically said due to the nature of GCN, fine grained simultaneous usage of shaders for async compute basically makes those features "free" in performance (does not detract from rendering performance).

Can you please explain to me how come that physics and lighting can be executed at the same time? I'm not familiar with the actual works.
Intuitively, lighting and physics are related..
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
Can you please explain to me how come that physics and lighting can be executed at the same time? I'm not familiar with the actual works.
Intuitively, lighting and physics are related..

Kinematics modeling and lighting are both physics, but they are computed separately. Games don't use a Grand Unified Theory of everything.

Now then, lighting is still dependent on object positions. Most likely they just do time steps. For instance, Physx typically runs at 50hz, so 50 times a second, regardless of framerate, physics are updated. Lighting would most likely be computed at frame rate.
 

stateofmind

Senior member
Aug 24, 2012
245
2
76
www.glj.io
Kinematics modeling and lighting are both physics, but they are computed separately. Games don't use a Grand Unified Theory of everything.

Now then, lighting is still dependent on object positions. Most likely they just do time steps. For instance, Physx typically runs at 50hz, so 50 times a second, regardless of framerate, physics are updated. Lighting would most likely be computed at frame rate.

Many thanks for taking the time to help!
Can you refer to some good sources? I'm trying to wrap my head around that

So, lighting is really broken up to parts of the scene?

(I guessed that really physics and lighting and probably other stuff is computed for parts of the scene to reduce load or something, but it is not obvious from any of the articles in any hardware site in English for years)
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
Many thanks for taking the time to help!
Can you refer to some good sources? I'm trying to wrap my head around that

So, lighting is really broken up to parts of the scene?

(I guessed that really physics and lighting and probably other stuff is computed for parts of the scene to reduce load or something, but it is not obvious from any of the articles in any hardware site in English for years)

Lighting can be computed in two ways.

The classic way of doing it is forward rendering. You basically have to rerender the scene for each light to compute the change in lighting, so your performance is cut by 1/# lights. It's hard to get above about 8 lights in a scene like this, but it doesn't really cause any other issues.

There's also deferred lighting, in which case lighting is deferred until the very end and just rendered on the final 2d image. Lighting is rendered at the resolution of the image, and you can do thousands of lights because it's more of a filter on the final image. However, it breaks anti aliasing and in general doesn't look as good as forward rendering.

Physics computations are unrelated to the lighting. The objects will be moved into the correct positions each frame prior to any lighting being rendered. You could just calculate positions every frame, but if things are non-deterministic (like dynamic physics) then you really don't want a variable number of calculations per second or the results could change. You also don't want to compute something very expensive more than needed, so linking to framerate would lower framerate too.
 

NTMBK

Lifer
Nov 14, 2011
10,292
5,256
136
Interesting post on this topic from Scott Wasson at the Tech Report:

One possible answer to the question of why Fiji (in both Fury and X forms) seems to underperform is something one of the other commenters has alluded to. The front end of Fiji looks very much like that of Tonga or Hawaii; it just has more CUs per cluster than Hawaii. It's possible the front end of the GPU is a bottleneck in many games, which could explain Fiji's similar performance to Hawaii, despite all the extra resources elsewhere.

If that's the case, and if the issue is just not "feeding the beast" quickly enough (and not just rasterization rates), then it's possible that the coming shift to DX12 and Vulkan could be a big boon to AMD's GCN-based GPUs. They may then be able to use their eight ACE engines to schedule lots of work in parallel and keep those big shader arrays active. Doing so could lead to a surprising turnaround in relative GPU performance.

I also expect DX12 and Vulkan to lead to much lower frame times from games generally thanks mostly to a reduction in serialization and single-thread CPU overhead. This development could help AMD more than Nvidia--in part because AMD needs the help more, and in part because of GCN's dormant ACE engines. Also, these "thin" APIs will move a lot of control back to game developers, taking the ability to optimize things behind the scenes out of the hands of the GPU driver guys--at least in theory. Fascinated to see how that plays out.

http://techreport.com/discussion/28...ury-graphics-card-reviewed?post=921201#921201

He makes a very good point- the ACEs will let AMD bypass Fiji's biggest bottleneck, the inability to provide enough work to the shader clusters.
 
Feb 19, 2009
10,457
10
76
That's nothing new from what people (gamedevs as well as directly from AMD's engineers) have been saying for a long time. GCN is built for a Mantle-like API, as its uarch cannot be taken advantage of fully with DX11. It was a forward looking uarch, meant to last them a long time.

I am hoping in the not too distant future, I can buy a Zen APU with GCN 2 and HBM2, that would make for an awesome base for an expanding rig, ie. plug in a GCN 2 dGPU, DX12 games take advantage of both.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |