AMD vs NVidia asynchronous compute performance

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I'm just curious to compare AMD vs NVidia asynchronous compute performance. Gears of War 4 is an ideal game to test asynchronous compute and it's performance benefits since you can toggle the setting and it supports both AMD and NVidia GPUs; and not to mention the implementation is quite robust by any standard. Sniper Elite 4 is supposedly another game that uses asynchronous compute fairly well for both vendors, but I don't have that game.

Here's two runs using the same 1440p ultra settings, but with Async compute on and with it off. Average GPU framerate had a minimal increase, but when you look at the line graph, you'll see that the framerate was much steadier with less dips. The biggest increase was in the CPU framerate rendering time, which had a nearly 16% improvement with asynchronous compute on, compared to with it off.

I'm very curious to see how the AMD cards do in this benchmark in particular. Also, I do NOT recommend using MSI Afterburner and RTSS to do any monitoring in Gears of War 4, because from my experience, it actually reduces performance. Yes this can occur in some games because of the overlay for the OSD.

So if anyone has Gears of War 4 or Sniper Elite 4 and wants to post some benchmarks or screenshots with it on and off, and discuss, go ahead please

Async Compute off:



Async Compute on:

 

Vaporizer

Member
Apr 4, 2015
137
30
66
How can asynchronus compute increase any performance on Nvidia cards because Nvidia has such good drivers that they do not leave any performance on the table?
That is what i hear from you guys now over two years. Tasted the flavour in the end???
 
Reactions: DarthKyrie

96Firebird

Diamond Member
Nov 8, 2010
5,712
316
126
How can asynchronus compute increase any performance on Nvidia cards because Nvidia has such good drivers that they do not leave any performance on the table?
That is what i hear from you guys now over two years. Tasted the flavour in the end???

You've been hearing wrong, get your ears checked.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
Lets not turn this into another one of those threads shall we? What I want to see, is the breakdown of AMD's performance with asynchronous compute on and off in the Gears of War 4 benchmark. I'm particularly curious to see whether the CPU rendering performance gets as big a jump, or bigger than Pascal gets with asynchronous compute.
 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
I'd test but I don't own the game. If it was a bargain buy I'd probably do it, but not for $60 and what 100GB of space
 

Det0x

Golden Member
Sep 11, 2014
1,063
3,110
136
Lets not turn this into another one of those threads shall we? What I want to see, is the breakdown of AMD's performance with asynchronous compute on and off in the Gears of War 4 benchmark. I'm particularly curious to see whether the CPU rendering performance gets as big a jump, or bigger than Pascal gets with asynchronous compute.

To simplify this:

It all depends on how the the async compute is implemented.. When you only add a small amount of "lite asynchronous compute" Pascal can deal with it just as good as GCN.
Its only when you start using asynchronous compute the same way AMD originally designed async, the Pascal gets choked out and performance starts regressing.




Above we have:
  1. Maxwell
  2. GCN
  3. Pascal

I would also recommend the more technical people to read this thread about how futuremark tailored their newest 3dmark to fit Nvidia's limited "lite async capabilities": (same as your cherry-picked Gears of War 4 and Sniper Elite 4)
http://www.overclock.net/t/1605899/various-futuremark-releases-3dmark-time-spy-directx-12-benchmark

Mahigan said:
Anyone who claims Pascal can do Async does not know what Asynchronous compute + Graphics is or how it works. They likely also have no clue what nVIDIA mean when they are talking about improved preemption and dynamic load balancing amongst GPCs. To these folks... it probably all sounds like a foreign language.

I am sorry dude but some of us understand this stuff. We know what Async compute + graphics is and what it is not. We also understand how nVIDIA were able to get around their lack of hardware support for Asynchronous compute + Graphics by cleverly using the hardware they did have in place coupled with the flexibility afforded to them from using software side scheduling in order to mimic the feature. We also know of the limitations this "simulated" Asynchronous compute + Graphics support entails.

So in essence... we were paying attention. We just understood what we heard. Others just heard "Asynchronous compute" and that is all they needed to hear.

Here is what Pascal does...

The first feature nVIDA introduced is improved Dynamic Load Balancing. Basically.. the entire GPU resources can be dynamically assigned based on priority level access. So an Async Compute + Graphics task may be granted a higher priority access to the available GPU resources. Say the Graphics task is done processing... well a new task can almost immediately be assigned to the freed up GPU resources. So you have less wasted GPU idle time than on Maxwell. Using Dynamic load balancing and improved pre-emption you can improve upon the execution and processing of Asynchronous Compute + Graphics tasks when compared to Maxwell. That being said... this is not the same as Asynchronous Shading (AMD Term) or the Microsoft term "Asynchronous Compute + Graphics". Why? Pascal can’t execute both the Compute and Graphics tasks in parallel without having to rely on serial execution and leveraging Pascal’s new pre-emption capabilities. So in essence... this is not the same thing AMD’s GCN does. The GCN architecture has Asynchronous Compute Engines (ACE’s for short) which allow for the execution of multiple kernels concurrently and in parallel without requiring pre-emption.


What is pre-emption? It basically means ending a task which is currently executing in order to execute another task at a higher priority level. Doing so requires a full flush of the currently occupied GPC within the Pascal GPU. This flush occurs very quickly with Pascal (contrary to Maxwell). So a GPC can be emptied quickly and begin processing a higher priority workload (Graphics or Compute task). An adjacent GPC can also do the same and process the task specified by the Game code to be processed in parallel (Graphics or Compute task). So you have TWO GPCs being fully occupied just to execute a single Asynchronous Compute + Graphics request. There are not many GPCs so I think you can guess what happens when the Asynchronous Compute + Graphics workload becomes elevated. A Delay or latency is introduced. We see this when running AotS under the crazy preset on Pascal. Anything above 1080p and you lose performance with Async Compute turned on.

Both of these features together allow for Pascal to process very light Asynchronous Compute + Graphics workloads without having actual Asynchronous Compute + Graphics hardware on hand.

So no... Pascal does not support Asynchronous Compute + Graphics. Pascal has a hacked method which is meant to buy nVIDIA time until Volta comes out.

*edit*

Just for reference regarding futuremark @ http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=thread...ctx-12-benchmark.2480259/page-8#post-38368370

Compute + graphics running simultaneously. That is what async compute is. Compute queue and Graphics (aka Direct) queue running at the same time. Happens throughout Demo and Graphics Test 1 & 2.

Copy can also run simultaneously, but Time Spy does not use Copy to ensure it is an isolating graphics card benchmark (Graphics tests) - all content is instead loaded to VRAM before the test starts. So as long as you meet the VRAM requirements, there is no traffic to main RAM (if you don't, shared RAM is used, with the usual performance penalty, normal story with iGPUs etc. and then RAM performance matters)

Dont seem to be the general consensus looking at this screenshot ? :hmm:



Just to quote some of the responses:

That would be a major dilemma. Last thing benchmark program should do is create separate optimized path for each GPU.

This is exactly the dilemma I expected from them. There's no way to have a single render path in a DX12 benchmark without optimizing it for the lowest common denominator and punishing the silicon with extra features.

"Impartial" benchmarking has become an oxymoron with DX12. You have to optimize for each vendor or you're unfairly punishing one of them. It just about makes the whole concept of "benchmark" meaningless.

They had no problem doing this with tessellation. Now suddenly they've got morals?

I'd say there's a difference between doing the same workload (serially vs in parallel) and actively reducing the amount of workload with tessellation (geometry) is there not? Or am I not understanding this correctly?

I get you, but DX12 is not a one-size-fits-all API. Arguably DX11 was, but AMD suffered with high tess and had driver optimizations to keep such punishment within architectural limits. These driver optimizations became invalid within 3dmark, so they were left competing one-for-one with Nvidia.

OK 3dmark, that's fine if you want to look neutral, but now with DX12 AMD isn't allowed to shine with its parallel hardware-- it must remain on a level playing field with an NV-optimized render path. It's not an indication of game performance, unless that game is specifically NV-optimized and has very few if any AMD async shader optimizations.

See the theme here? The last 3dmark was NV-optimized with tessellation levels. The limitation was on the AMD side, and the fix was ignored / bypassed. This 3dmark is NV-optimized in its avoidance of Async Compute + Graphics, aka Async Shaders. The limitation is on the Nvidia side, and the fix is honored.

It's a valid benchmark as long as AMD knows its place.

With the given evidence, we can say that Time Spy benchmark, intentionally or not, by design, fits perfectly for the capabilites of Pascal, other Nvidia architectures are not capable of async computing at all, and most of the AMD architectures in theory are left with spare room to be requested of much heavier async computing loads.

It's like Tessellation loads were designed to fit the inferior AMD capabilities back in the day. There is a clear pattern with Futuremark controversies regardless of who's on the right or wrong, and it's that they always favor Nvidia.

From what I understand based on Doothe's post, Time Spy is basically only just doing that new feature that Pascal has - it preempts some 3D work, quickly switches context to the compute work, then switches back to the 3D.

So it seems to me that Time Spy has a very minimal amount of async compute work compared to Doom and AotS, *and the manner in which it does its "async" is friendly to Pascal hardware. I don't think it's necessarily "optimized" for nvidia, as GCN seems to have no issue with context switching either. It's just not being allowed to take full advantage of GCN hardware.

* = read Pre-emption to suite the newest nv hardware, instead of truly asynchronous shaders

Compute queues as a % of total run time:

Doom: 43.70%
AOTS: 90.45%
Time Spy: 21.38%

It does look that way compared to AOTS, and DOOM. I don't have ROTR, Hitman, or any other DX12/Vulkan titles to test this theory against. In the two other games, GPUView shows two rectangles(compute queues) stacked on top of each other. Time Spy never needs to process more than one at a time.

  • Minimize the use of barriers and fences
  • We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports
  • The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it
  • Any barrier or fence can limit parallelism


If we are misinterpreting the data, please feel free to correct us.
 
Last edited:

swilli89

Golden Member
Mar 23, 2010
1,558
1,181
136
To simplify this:

It all depends on how the the async compute is implemented.. When you only add a small amount of "lite asynchronous compute" Pascal can deal with it just as good as GCN.
Its only when you start using asynchronous compute the same way AMD originally designed async, the Pascal gets choked out and performance starts regressing.




Above we have:
  1. Maxwell
  2. GCN
  3. Pascal

I would also recommend the more technical people to read this thread about how futuremark tailored their newest 3dmark to fit "Nvidia's lite sync capabilities": (same as your cherry-picked Gears of War 4 and Sniper Elite 4)
http://www.overclock.net/t/1605899/various-futuremark-releases-3dmark-time-spy-directx-12-benchmark

Anyone who claims Pascal can do Async does not know what Asynchronous compute + Graphics is or how it works. They likely also have no clue what nVIDIA mean when they are talking about improved preemption and dynamic load balancing amongst GPCs. To these folks... it probably all sounds like a foreign language.

I am sorry dude but some of us understand this stuff. We know what Async compute + graphics is and what it is not. We also understand how nVIDIA were able to get around their lack of hardware support for Asynchronous compute + Graphics by cleverly using the hardware they did have in place coupled with the flexibility afforded to them from using software side scheduling in order to mimic the feature. We also know of the limitations this "simulated" Asynchronous compute + Graphics support entails.

So in essence... we were paying attention. We just understood what we heard. Others just heard "Asynchronous compute" and that is all they needed to hear.

Here is what Pascal does...

The first feature nVIDA introduced is improved Dynamic Load Balancing. Basically.. the entire GPU resources can be dynamically assigned based on priority level access. So an Async Compute + Graphics task may be granted a higher priority access to the available GPU resources. Say the Graphics task is done processing... well a new task can almost immediately be assigned to the freed up GPU resources. So you have less wasted GPU idle time than on Maxwell. Using Dynamic load balancing and improved pre-emption you can improve upon the execution and processing of Asynchronous Compute + Graphics tasks when compared to Maxwell. That being said... this is not the same as Asynchronous Shading (AMD Term) or the Microsoft term "Asynchronous Compute + Graphics". Why? Pascal can’t execute both the Compute and Graphics tasks in parallel without having to rely on serial execution and leveraging Pascal’s new pre-emption capabilities. So in essence... this is not the same thing AMD’s GCN does. The GCN architecture has Asynchronous Compute Engines (ACE’s for short) which allow for the execution of multiple kernels concurrently and in parallel without requiring pre-emption.


What is pre-emption? It basically means ending a task which is currently executing in order to execute another task at a higher priority level. Doing so requires a full flush of the currently occupied GPC within the Pascal GPU. This flush occurs very quickly with Pascal (contrary to Maxwell). So a GPC can be emptied quickly and begin processing a higher priority workload (Graphics or Compute task). An adjacent GPC can also do the same and process the task specified by the Game code to be processed in parallel (Graphics or Compute task). So you have TWO GPCs being fully occupied just to execute a single Asynchronous Compute + Graphics request. There are not many GPCs so I think you can guess what happens when the Asynchronous Compute + Graphics workload becomes elevated. A Delay or latency is introduced. We see this when running AotS under the crazy preset on Pascal. Anything above 1080p and you lose performance with Async Compute turned on.

Both of these features together allow for Pascal to process very light Asynchronous Compute + Graphics workloads without having actual Asynchronous Compute + Graphics hardware on hand.

So no... Pascal does not support Asynchronous Compute + Graphics. Pascal has a hacked method which is meant to buy nVIDIA time until Volta comes out.

Sounds like nVidia. Pay off the software benchmark company then fake it til you make it.
 
Reactions: DarthKyrie

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
Isn't there also a toggle in Doom? Not sure...

I don't think there is a toggle, but at least at the initial release for support it only worked with some AA methods and was turned off if you used others. No idea if that has been updated.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
To simplify this:

It all depends on how the the async compute is implemented.. When you only add a small amount of "lite asynchronous compute" Pascal can deal with it just as good as GCN.
Its only when you start using asynchronous compute the same way AMD originally designed async, the Pascal gets choked out and performance starts regressing.

Oh well, it looks like it's going to be one of "those" threads

All this data you posted is meaningless, because it's all old stuff that has been answered by Furmark and been invalidated. Also, quoting Mahigan is very suspicious to me because that guy definitely has an agenda. If TimeSpy was suspect, then AMD would have denounced it.

Also, AMD did not "design" asynchronous compute, and there is no standardized hardware specification regarding it to my knowledge.

Now I actually do have TimeSpy in my Steam Library, so I'll run a benchmark with it on and then with it off and report back.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I'm more interested to see the Gears of War 4 benches, as they have a detailed breakdown so you can see exactly what's going on. That said, here are my TimeSpy scores at stock GPU clocks. Top scores is with Async off, and bottom is with Async on. About 7.3% gain, which isn't bad for not having dedicated asynchronous compute engines.

I theorize that as AMD's shader arrays become more efficient, the performance increase from asynchronous compute will be much less than what they're getting now with Fiji.



 

Bacon1

Diamond Member
Feb 14, 2016
3,430
1,018
91
About 7.3% gain, which isn't bad for not having dedicated asynchronous compute engines.

I theorize that as AMD's shader arrays become more efficient, the performance increase from asynchronous compute will be much less than what they're getting now with Fiji.

The 480 gets 12% gain in the image I posted above.
 

Red Hawk

Diamond Member
Jan 1, 2011
3,266
169
106
I have access to a pretty old PC. It has a Core 2 Quad Q6600 @ 2.4 GHz (it's a Dell pre-built, so no overclocking the CPU) Radeon R9-270X 2 GB @1080 MHz, and 8 GB of 800 MHz DDR2 system RAM. I've tested it from time to time, curious to see if a 10 year old quad core is still viable for gaming, and how well the 270X's GCN 1 architecture actually supports DX12/Vulkan and asynchronous compute. Support is spotty for both components -- last year AMD's drivers had an issue where old CPUs that didn't support the popcnt instruction set, including the Q6600, couldn't even launch games in DirectX 12. It was fixed in a driver update, though it seemed to reoccur just a couple driver releases ago (only to be quickly fixed again). And AMD apparently was having issues getting asynchronous compute to run properly on GCN 1 chips, and they even disabled it in the DX12 drivers for a stretch last year. With async compute disabled in the drivers, trying to enable it in game settings for things like Time Spy and Gears of War 4 resulted in a performance loss. But again, they re-enabled it in the drivers more recently.

So I just ran a couple benchmarks with this PC in Gears of War 4 the other night. I didn't write down the results, but they weren't too impressive. Average FPS without async was something like 36, while with async was around 38. Minimum framerate (bottom 5%) was the same both times, 22 FPS. The benchmark registered a shift to less of a GPU bottleneck with async on, but it barely showed up in the actual performance

Isn't there also a toggle in Doom? Not sure...

There's a "Compute" toggle in Doom, but that was there before Vulkan and async compute was added. There's no specific asynchronous compute toggle, but from what I heard it's tied to the temporal super sampling antialiasing setting, only coming into effect when that AA method is in use (likely because that's the AA that consoles use and Vulkan renderer basically ports over lot of the consoles' low-level optimizations to PC).
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
I theorize that as AMD's shader arrays become more efficient, the performance increase from asynchronous compute will be much less than what they're getting now with Fiji.

Fiji is a worst case, generally speaking i dont think improvements in NCU will diminished a-sync comptue benefit at all. Thats because Async Compute doesn't do anything to increase the occupancy within a wavefront, which is what all the GCN related patents of the last few years have been about. A-sync compute is all about how those wavefronts are scheduled across the GCN "cores".

generally speaking if there is a improvement from a-sync ocmpute power consumption has increased with it. If occupancy can be improved, power consumption will still stay the same.
 
Reactions: Carfax83

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
So I just ran a couple benchmarks with this PC in Gears of War 4 the other night. I didn't write down the results, but they weren't too impressive. Average FPS without async was something like 36, while with async was around 38. Minimum framerate (bottom 5%) was the same both times, 22 FPS. The benchmark registered a shift to less of a GPU bottleneck with async on, but it barely showed up in the actual performance

Red Hawk, can you run the Gears of War 4 benchmark on your current rig with the 290x?
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
I don't really have time for it tonight, but I'll try to get around to it in the next couple days.

OK that's cool, whenever you can get around to it. Just use the settings you play at, they don't have to match mine. I'm just curious as to see the breakdown of how AMD's ACEs handles the asynchronous workload. In Gears of War 4, asynchronous compute is used heavily for ambient occlusion and SSR if I recall, plus some other things so it's not a light workload.
 

Red Hawk

Diamond Member
Jan 1, 2011
3,266
169
106
OK that's cool, whenever you can get around to it. Just use the settings you play at, they don't have to match mine. I'm just curious as to see the breakdown of how AMD's ACEs handles the asynchronous workload. In Gears of War 4, asynchronous compute is used heavily for ambient occlusion and SSR if I recall, plus some other things so it's not a light workload.
I don't really play GOW4...I appreciate it on a technical level, but the gameplay never really drew me in. Unlike Doom, where I both appreciated the technology and enjoyed the heck out of the gameplay. Anyways. I'll probably just benchmark it at both the high and the ultra preset.
 

dogen1

Senior member
Oct 14, 2014
739
40
91
Reactions: Arachnotronic

KompuKare

Golden Member
Jul 28, 2009
1,075
1,126
136
So let's say that your the vendor whose hardware doesn't fully support asynchronous hardware (and equally as importantly not in the same was as the ACEs in GCN work in the consoles). But instead you are able to do some fancy scheduling work which you off-load to the CPU.
So what happens to CPU utilization between Async-On and Async-Off with both vendors?
This is a major problem with user benchmarks because unless someone has cards from both vendors and is willing to put them in the same rig (or has two otherwise identical rigs but with a GPU from each vendor), we are not seeing a like-for-like comparison.

EDIT: aside from the benchmarking ideals, what I wanted to say is that could anyone providing on and off benchmarks, try to graph the CPU usage please.
 
Last edited:
May 11, 2008
20,057
1,290
126
Oh well, it looks like it's going to be one of "those" threads

All this data you posted is meaningless, because it's all old stuff that has been answered by Furmark and been invalidated. Also, quoting Mahigan is very suspicious to me because that guy definitely has an agenda. If TimeSpy was suspect, then AMD would have denounced it.

Also, AMD did not "design" asynchronous compute, and there is no standardized hardware specification regarding it to my knowledge.

Now I actually do have TimeSpy in my Steam Library, so I'll run a benchmark with it on and then with it off and report back.

That is a load of nonsense.
Mahigan did a very good job explaining how it all works under the hood.
And his post were very informative and verifiable.
I kind of miss posts like that.

Besides, asynchronous compute on the pc came to be after mantle which is co designed by AMD.
Before that, before the xbox one and playstation 4 were still in the design stage, both Sony and Microsoft talked about asynchronous compute on their at the time next gen machine.
And the interesting part is that the gcn architecture from AMD actually had that asynchronous compute capability before the consoles, yes since the first version of gcn.
That is around the time of 2011,2012.
 

bononos

Diamond Member
Aug 21, 2011
3,894
162
106
Oh well, it looks like it's going to be one of "those" threads

All this data you posted is meaningless, because it's all old stuff that has been answered by Furmark and been invalidated. Also, quoting Mahigan is very suspicious to me because that guy definitely has an agenda. If TimeSpy was suspect, then AMD would have denounced it.

Also, AMD did not "design" asynchronous compute, and there is no standardized hardware specification regarding it to my knowledge.

Now I actually do have TimeSpy in my Steam Library, so I'll run a benchmark with it on and then with it off and report back.

I don't recall the issue being answered satisfactorily by Futuremark let alone been invalidated. In fact, 3dmark tarnished itself by using the lowest common denominator in order not to show up Pascal's shortcoming in async compute after their tessellation nonsense.
 
Reactions: DarthKyrie
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |