Ashes of the Singularity User Benchmarks Thread

Mahigan · Aug 22, 2015

sontin said:
So, Oxide hasnt optimize the engine for nVidia hardware and now it is nVidia's job to fix this mess?

Yeah, not a great explanation...

Umm, another false statement...

Often we get asked about fairness, that is, usually if in regards to treating Nvidia and AMD equally? Are we working closer with one vendor then another? The answer is that we have an open access policy. Our goal is to make our game run as fast as possible on everyones machine, regardless of what hardware our players have.

To this end, we have made our source code available to Microsoft, Nvidia, AMD and Intel for over a year. We have received a huge amount of feedback. For example, when Nvidia noticed that a specific shader was taking a particularly long time on their hardware, they offered an optimized shader that made things faster which we integrated into our code.

We only have two requirements for implementing vendor optimizations: We require that it not be a loss for other hardware implementations, and we require that it doesnt move the engine architecture backward (that is, we are not jeopardizing the future for the present).

http://www.oxidegames.com/2015/08/16/the-birth-of-a-new-api/

sontin · Aug 22, 2015

Yes, nVidia noticed it. Not Oxide. nVidia noticed a broken MSAA implementation on their hardware while Oxide sent out copies to reviewer without even tested it on nVidia's hardware.

In this state the game isnt optimized for nVidia or Intel GPUs.

Mahigan · Aug 22, 2015

sontin said:
No, it is not. It is optimized for AMD hardware.
Otherwise nVidia hardware would be faster with DX12 like they are in King of Wushu or Fable Legends.

Umm no. I know... you're one of those fanbois who doesn't know the first thing about how GPUs function. The MSAA also wasn't broken. May I suggest you read the Extreme Tech article on the topic.

As for what's going on... here's a course...

It's not about optimizations. It's about Hardware limitations. These limitations affect both nVIDIA and AMD GPUs under Ashes of the Singularity but in different ways.

Take Parallelism for example...

The difference between Maxwell and Maxwell 2 is that Maxwell's Grid Management Unit can only send either a Graphics task or 32 Compute tasks to the work Distributor. It cannot send both in Parallel.

Maxwell 2 changes this. Therefore now, with Maxwell2, the communication between the Grid Management Unit and Work Distributor works in Parallel.

The problem is that this doesn't change the fact that Maxwell 2 still only contains a single Grid Management Unit. This still remains as a bottleneck.

nVIDIAs Parallelism, under Maxwell 2, is thus limited to 1 Graphics and 31 Compute tasks. AMDs Parallelism, under GCN 1.1 (290 series) and GCN 1.2 is limited to 1 Graphics and 64 Compute tasks.

Another difference is that AMDs GCN 1.1 (290 series)/GCN 1.2 have 8 independent Asynchronous Compute Engines each able to schedule and prioritize work independently of one another. With Maxwell 2, it's a single Grid Management Unit. You can see why GCN 1.1 (290 series)/GCN 1.2 can best take advantage of the available compute resources.

Take a look at all of those light sources floating around in Ashes of the Singularity. Each unit emits its own light sources in Parallel to other units. Each one of those light sources is a Compute task.

Therefore if there are more than 31 Compute tasks (assuming there is a Graphics task which there ought to be because of the amount of Rasterization going on), it takes two cycles for Maxwell 2 to assign the tasks to the Work Distributor. This looks to be the culprit (explaining why Maxwell 2 tends to match, but not beat, AMDs GCN 1.1/1.2 architecture).

I'm quite certain that Pascal will incorporate more than a single Grid Management Unit for this very reason.

Since Ars Technica showed that a 290x can nearly match a GTX 980 Ti and a GTX 980 Ti is a near match to a Fury-X then we can conclude that the 290x and the Fury-X are a near match under Ashes of the Singularity. This points to a common bottleneck between both Hawaii and Fiji architectures.

So we have to look at the nature of Ashes of the Singularity. Ashes of the Singularity does two things in a big way.

1. Makes ample use of Asynchronous Shading.
2. Draws MANY units onto the screen (requiring many Triangles or Polygons).

Since both Fury-X an the 290x share the same Asynchronous Compute Engines, but with Fury-X having more compute resources at its disposal, then we can conclude than if Asynchronous Shading and Compute resources was the bottleneck for Fiji and Hawaii... we'd see Fiji fairing better than Hawaii. this is not the case.

Since both Fiji and Hawaii retain the same amount of Hardware Rasterizers (and the same Peak Rasterization rate expressed in Gtris/s) we can conclude that both are bottlenecked by their Peak Rasterization rate (ability to draw triangles/polygons).

Since the GTX 980 Ti has a much higher peak rasterization rate, we would expect the GTX 980 Ti to overpower the Fiji and Hawaii cards, this is not the case. Therefore we can conclude that the GTX 980 Ti is being limited by its Asynchronous Compute capabilities.

Fiji and Hawaii are bottlenecked by their Peak Rasterization rates under Ashes of the Singularity while Maxwell 2 is bottlenecked by its ability to handle Asynchronous Shading.

Warning issued for personal attack.
-- stahlhart

AtenRa · Aug 22, 2015

sontin said:
No, it is not. It is optimized for AMD hardware.
Otherwise nVidia hardware would be faster with DX12 like they are in King of Wushu or Fable Legends.

Care to post the links for those benchmarks ??

sontin · Aug 22, 2015

Mahigan said:
It's not about optimizations. It's about Hardware limitations. These limitations affect both nVIDIA and AMD GPUs under Ashes of the Singularity but in different ways.

So, nVidia has less hardware limitations with DX11 than with DX12? :hmm:

The difference between Maxwell and Maxwell 2 is that Maxwell's Grid Management Unit can only send either a Graphics task or 32 Compute tasks to the work Distributor. It cannot send both in Parallel.

Explains why Maxwell 2 loses as much performance as Kepler and Maxwell 1 with DX12. :|

Another difference is that AMDs GCN 1.1 (290 series)/GCN 1.2 have 8 independent Asynchronous Compute Engines each able to schedule and prioritize work independently of one another. With Maxwell 2, it's a single Grid Management Unit. You can see why GCN 1.1 (290 series)/GCN 1.2 can best take advantage of the available compute resources.

The Grid Management Unit has 1000s of pending grids waiting to submit to the Work Distributor Unit while the WDU is distributing the grids to the available units...
I guess you didnt really understand Hyper-Q.

Therefore if there are more than 31 Compute tasks (assuming there is a Graphics task which there ought to be because of the amount of Rasterization going on), it takes two cycles for Maxwell 2 to assign the tasks to the Work Distributor. This looks to be the culprit (explaining why Maxwell 2 tends to match, but not beat, AMDs GCN 1.1/1.2 architecture).

Nope. These are draw calls and having nothing to do with asynchronous compute. Star Swarm is using the same aproach and it doesnt show negative scaling.

Since Ars Technica showed that a 290x can nearly match a GTX 980 Ti and a GTX 980 Ti is a near match to a Fury-X then we can conclude that the 290x and the Fury-X are a near match under Ashes of the Singularity. This points to a common bottleneck between both Hawaii and Fiji architectures.

What? It only matches a GTX980TI because DX12 is slower than DX11. The GTX980TI is faster with DX11 than the 290X with DX12.

Fiji and Hawaii are bottlenecked by their Peak Rasterization rates under Ashes of the Singularity while Maxwell 2 is bottlenecked by its ability to handle Asynchronous Shading.

DX11 doesnt support multi-engine. This is a new approach in DX12. DX11 can never be faster when they are using multi-engine.The GPU utilization will always be better than it is under DX11. They can do so much more at the same time.
In DX12 it is possible to use as many engines as it is supported by the hardware. A hardware with 32 supported compute queues will not be slower than DX11 with only one queue. A GTX770 isnt slower than a GTX960 in the benchmark so the performance impact of DX12 over DX11 is not result by using multi-engine.

AtenRa said:
Care to post the links for those benchmarks ??

No problem:
https://www.youtube.com/watch?v=AB5iuX8UDHk
https://www.youtube.com/watch?v=Z_XLX7qYmGY

boozzer · Aug 22, 2015

Silverforce11 said:
Warner Bros refused AMD optimized code.

Ubisoft REMOVED dx10.1 implementation in AC.

Project Cars Devs think that sharing 20 Steam codes with AMD is doing their part.

GameWork DEVs are fully bought out, very unethical bunch, unlike Oxide.

and there are posters defending this, shudders.

Digidi · Aug 22, 2015

If furyx is bottelnecked bye rasterizer then the furyx wehre not so good in the 3dmark overhead test, because its a pure polygonoutput test. So what do we See in the overhead test? The furyx beats the Titan x. So the fury x is not rasterizer bottelnecked.

Nvidia have problems with the command processor and the handling too feed the 6 rasterizer. AMD have the Problem to feed the shader usefully.

Shivansps · Aug 22, 2015

And still in this "king of benchmarks", DX11 is horrible on AMD whiout Gameworks as a scapegoat.

Carfax83 · Aug 22, 2015

Silverforce11 said:
Here you are, hindsight. What about 3-4 years ago when Maxwell was designed & planned? Would NV know that a Mantle-like API would form the basis of next-gen DX12?

See, this is where your timeline is messed up. DX12 was also in planning and development for about 4 years, and obviously, Maxwell has great DX12 compatibility. The first DX12 demo if you recall ran on NVidia hardware..

There's a lot of collaboration between Microsoft and the IHVs when it comes to things like this. Likely, there are already people having conferences about the next incremental version of Direct3D, as well as the next major version.

I think Mantle is basically DX12 without the cross architectural compatibility. Because it doesn't have the cross architectural compatibility, AMD were able to get it out the door faster than Microsoft did with DX12..

Red Hawk · Aug 22, 2015

I also have to extend a welcome to Mahigan! I don't have much to add, but I find your posts very insightful.

Shivansps · Aug 22, 2015

Mahigan said:
Umm, another false statement...

http://www.oxidegames.com/2015/08/16/the-birth-of-a-new-api/

A whole shadder? thats it? i hope you have way more than that to back you up.

Carfax83 · Aug 22, 2015

Mahigan said:
It's a single unit. A single block. A single block is not as Dynamic or Parallel as 8 separate units. This single unit is also limited in the amount of queues it can prioritize when compared to the competition. Maxwell2 is thus not nearly as Parallel, as a matter of fact, as GCN 1.1 (290 series)/GCN 1.2. There is no disputing this. It is a hardware fact.

This is purely semantics. It's like arguing which is more parallel, a quad core CPU or a dual core CPU with hyperthreading..

The GMU is a single unit, but since it doesn't execute anything, it doesn't really matter. The execution is done by the SMM units, and those are all expressly parallel.

It is only a big deal if you have games which make use of the amount of Parallelism that Ashes of the Singularity use. nVIDIAs response, to the Ashes benchmark, points to the fact that nVIDIA doesn't believe that Ashes of the Singularity is an overall good example of future DX12 titles. This is likely the same logic on which they based their decisions when building Maxwell2.

If NVidia were as crippled with asynchronous compute as you claim, I believe that the performance deficit would be much larger. But that's not what we see when we look at the benchmarks.

AMD has a lead yes, but it's not huge or anything. In the PCper review, the 390x has at most a 10.6% lead over the GTX 980:

Fury-X only has a big bandwidth advantage on paper. In practice, however, nVIDIAs compression algorithms even up the score. See here: http://techreport.com/review/28513/amd-radeon-r9-fury-x-graphics-card-reviewed/4

The compression algorithm doesn't take MSAA into account. MSAA in and of itself has a massive impact on bandwidth..

If you looked at the benchmark you posted, you'll see that the medium preset was run with MSAA and the high without. In the medium preset, the Fury X is 12% faster than the 980 Ti in DX12. On the high preset with MSAA turned off, the Fury X has a 7% lead.

If you factor in that the 980 Ti is somehow faster in DX11 than in DX12, then the Fury X's lead is even smaller.

Since Ars Technica showed that a 290x can nearly match a GTX 980 Ti, and after looking at that memory bandwidth graph, we can tell that memory bandwidth is not what grants the Fury-X its lead.

I'm skeptical of the Ars Technica benchmarks, as they are the only ones showing such to my knowledge.

Since both Fiji and Hawaii retain the same amount of Hardware Rasterizers (and the same Peak Rasterization rate expressed in Gtris/s) we can conclude that both are bottlenecked by their Peak Rasterization rate (ability to draw triangles/polygons)

It's extremely unlikely that Fiji and Hawaii are bottlenecked by rasterization. I don't think I've ever seen any modern GPU being bottlenecked by it's rasterizers in an actual game, only in synthetic benchmarks..

RussianSensation · Aug 22, 2015

sontin said:
No, it is not. It is optimized for AMD hardware.
Otherwise nVidia hardware would be faster with DX12 like they are in King of Wushu or Fable Legends.

You mean unless NV throws marketing $ at developers via GameWorks, the game isn't well-optimized for NV when a developer is brand agnostic and gives full opportunities for both AMD/NV to optimize their drivers by exposing the entire source code for their game?

What's next you are going to tell us a DX11 game isn't well-optimized for NV and is biased towards AMD when a brand agnostic developer doesn't take GW bribe/marketing $? Interestingly enough in these brand agnostic non-GW infested/gimped DX11/DX12 games, Kepler actually performs great - surprise, surprise, but, it's not NV-optimized right because NV loses anyway, right??!

Warning issued for member callout.
-- stahlhart

Mahigan · Aug 22, 2015

sontin said:
So, nVidia has less hardware limitations with DX11 than with DX12? :hmm:

Explains why Maxwell 2 loses as much performance as Kepler and Maxwell 1 with DX12. :|

DX11 is Serial, DX12 is Parallel. nVIDIAs hardware was built to function better under Serial conditions (Kepler, Maxwell 1). Maxwell 2 is a bit of a mixed bag. A Middle of the road between Serial and Parallel. Take a look at the GTX 980 Ti. It gains from using DX12 over DX11.

The Grid Management Unit has 1000s of pending grids waiting to submit to the Work Distributor Unit while the WDU is distributing the grids to the available units...
I guess you didnt really understand Hyper-Q.

Grids are not Queues. The Grid Management Unit can hold 1 Graphics and 31 Compute Queues. An x amount of grids equals a Queue.

Nope. These are draw calls and having nothing to do with asynchronous compute. Star Swarm is using the same aproach and it doesnt show negative scaling.

Star Swarm does not use Asynchronous Shading. Star Swarm makes use of 100,000 Draw Calls (Ashes of the Singularity is also heavy on Draw Calls). Neither Star Swarm or Ashes of the Singularity come close to reaching the peak limitations of either GCN or Kepler/Maxwell as it pertains to Draw Calls. Star Swarm does make heavy use of the Rasterizing capabilities of a GPU to draw all of the Triangles/Polygons which make up the units displayed on the screen. You really should read like the last 5 pages of this thread as I've explained all of this already.

What? It only matches a GTX980TI because DX12 is slower than DX11. The GTX980TI is faster with DX11 than the 290X with DX12.

DX12 is not slower than DX11. The GTX 980 Ti is slower under DX12 than DX11 under certain conditions.

DX11 doesnt support multi-engine. This is a new approach in DX12. DX11 can never be faster when they are using multi-engine.The GPU utilization will always be better than it is under DX11. They can do so much more at the same time.
In DX12 it is possible to use as many engines as it is supported by the hardware. A hardware with 32 supported compute queues will not be slower than DX11 with only one queue. A GTX770 isnt slower than a GTX960 in the benchmark so the performance impact of DX12 over DX11 is not result by using multi-engine.

DX11 can be faster. There are some cases where DX11 is faster than DX12. This can happen under lower CPU load conditions. It can also happen if the DX11 driver is providing shortcuts rather than rendering the Game Engine's desired work. Such as when a DX11 driver replaces shader commands. This is to be expected before DX11 provides more opportunities for driver intervention. Such driver interventions come at the cost of using more CPU resources. Therefore if the game is making little use of the CPU (GPU bottleneck) then a driver intervention can be made with little to no impact on performance. This is what nVIDIA have been doing for the past x number of years. Think about it, nVIDIA had 2-3 cores, of a Quad Core Processor, getting little to no use. They used these un-used resources in order to perform Shader swaps etc. Since DX12 uses the CPU to a greater degree, even if you could intervene by way of the driver, it would not be recommended.

As for your comparing of Apples and Oranges with the GTX 770 and GTX 960.
The GTX 960 is a slower card on paper. It has half the memory bandwidth and almost 1 Tflops less of compute power. The fact that the GTX 960 can pretty much run neck and neck with a GTX 770 is proof of what I'm saying. With 1 Tflops less of Compute power, half the Texture fill rate and half the memory bandwidth (128 bit bus) the GTX 960 should be far bellow the GTX 770, but it isn't.

GTX 770 GPU Engine Specs:
1536 CUDA Cores
1046 Base Clock (MHz)
1085 Boost Clock (MHz)
134 Texture Fill Rate (billion/sec)

GTX 770 Memory Specs:
7.0 Gbps Memory Speed
2048 MB Standard Memory Config
GDDR5 Memory Interface
256-bit Memory Interface Width
224.3 Memory Bandwidth (GB/sec)

Compute Power: 3.2 Teraflops

GTX 960 Engine Specs:
1024 CUDA Cores
1127 Base Clock (MHz)
1178 Boost Clock (MHz)
72 Texture Fill Rate (GigaTexels/sec)

GTX 960 Memory Specs:
7.0 Gbps Memory Clock
2 GB Standard Memory Config
GDDR5 Memory Interface
128-bit Memory Interface Width
112 Memory Bandwidth (GB/sec)

Compute Power: 2.3 Teraflops

Mahigan · Aug 22, 2015

Carfax83 said:
This is purely semantics. It's like arguing which is more parallel, a quad core CPU or a dual core CPU with hyperthreading..

The GMU is a single unit, but since it doesn't execute anything, it doesn't really matter. The execution is done by the SMM units, and those are all expressly parallel.

It does matter. You have a Multi-Core (Parallel) CPU feeding 8 Asynchronous Compute Engines independently vs a Multi-Core (Parallel) CPU feeding a single unit. That's on one end.

On the other you have 8 Asynchronous Compute Engines prioritizing work and sending the work off, independently, to the available compute resources (Out of Order) vs. a single unit feeding a work distributor (extra level of latency) which then assigns the work to the available compute resources (In Order).

That's a pretty big difference.

If NVidia were as crippled with asynchronous compute as you claim, I believe that the performance deficit would be much larger. But that's not what we see when we look at the benchmarks.

nVIDIA also have the capacity to better draw all of the units on the screen (see Star Swarm Benchmark results of GTX 980 vs 290x). Better Peak Rasterization Rate. Therefore nVIDIA would, technically, have a lead. A large lead. Until you throw in Asynchronous Shading into the mix.

AMD has a lead yes, but it's not huge or anything. In the PCper review, the 390x has at most a 10.6% lead over the GTX 980:

It doesn't need to be a huge lead and I don't expect AMD to have a huge lead. I'm not rooting for AMD. I'm objective. AMD lacks the Rasterization rate needed to draw all the triangles which make up the swarm of units on the screen.

The compression algorithm doesn't take MSAA into account. MSAA in and of itself has a massive impact on bandwidth..

If you looked at the benchmark you posted, you'll see that the medium preset was run with MSAA and the high without. In the medium preset, the Fury X is 12% faster than the 980 Ti in DX12. On the high preset with MSAA turned off, the Fury X has a 7% lead.

If you factor in that the 980 Ti is somehow faster in DX11 than in DX12, then the Fury X's lead is even smaller.

More units are drawn in the heavy preset vs the medium preset (peak rasterization rate coming into play). You would have to bench the medium preset with and without MSAA to draw any conclusions regarding memory banwidth but I'm not sure what you're trying to say here regardless. You won't draw any logical conclusions from comparing apples to oranges.

I'm skeptical of the Ars Technica benchmarks, as they are the only ones showing such to my knowledge.

Nobody else benched a 290x. PCPer benched a 390x but did not bench a GTX 980 Ti. Therefore we cannot derive any conclusions from that.

It's extremely unlikely that Fiji and Hawaii are bottlenecked by rasterization. I don't think I've ever seen any modern GPU being bottlenecked by it's rasterizers in an actual game, only in synthetic benchmarks..

Have you even seen a game like Ashes of the Singularity? Nope... no game has that many drawn simultaneous units on the screen.

Shivansps · Aug 22, 2015

Again, what your source on the Nvidia working with them? because writting a shadder means nothing.

Digidi · Aug 22, 2015

But if AMD rasterizer will be bad than why AMD is leading the Overhead test of the 3dmark?
3dmark overhead test is also a pure rasterizer test with a huge polygoneoutput !!!

Mahigan · Aug 22, 2015

Digidi said:
But if AMD rasterizer will be bad than why AMD is leading the Overhead test of the 3dmark?
3dmark overhead test is also a pure rasterizer test with a huge polygoneoutput !!!

The 3D Mark Overhead test only checks for the amount of Draw Calls a card can deal with. It does not actually require the card to do all of the work. Each geometry is a unique, procedurally-generated, indexed mesh containing 112 -127 triangles. The geometries are drawn with a simple shader, without post processing.

To do this, Futuremark has written a relatively simple test that draws out a very simple scene with an ever-increasing number of objects in order to measure how many draw calls a system can handle before it becomes saturated. As expected for a synthetic test, the underlying rendering task is very simple – render an immense amount of building-like objections at both the top and bottom of the screen – and the bottleneck is in processing the draw calls. Generally speaking, under this test you should either be limited by the number of draw calls you can generate (CPU limited) or limited by the number of draw calls you can consume (GPU’s command processor limited), and not the GPU’s actual rendering capabilities. The end result is that the API Overhead Feature Test can push an even larger number of draw calls than Star Swarm could.

- Anandtech

Silverforce11 · Aug 22, 2015

sontin said:
No, it is not. It is optimized for AMD hardware.
Otherwise nVidia hardware would be faster with DX12 like they are in King of Wushu or Fable Legends.

Link or you making stuff up again like your other statements?

ps. Youtube demo showcasing DX12 isn't a BENCHMARK.

As for FABLE, this is what their lead developer had to say regarding DX12 and GCN, listen closely:
https://youtu.be/7MEgJLvoP2U?t=20m59s

Carfax83 · Aug 22, 2015

Mahigan said:
It does matter. You have a Multi-Core (Parallel) CPU feeding 8 Asynchronous Compute Engines independently vs a Multi-Core (Parallel) CPU feeding a single unit. That's on one end.

On the other you have 8 Asynchronous Compute Engines prioritizing work and sending the work off, independently, to the available compute resources (Out of Order) vs. a single unit feeding a work distributor (extra level of latency) which then assigns the work to the available compute resources (In Order).

That's a pretty big difference.

Let's look at this differently. The entire purpose of the GMU from what I've read, is to attempt to utilize the GPU resources as much as possible in an efficient manner.

So for your assertion to be true, you would have to believe that NVidia's engineers are so incompetent that they would stick with a single GMU through two architectures despite your notion that a single GMU obviously isn't enough to keep the GPU busy..

But when you look at the real world evidence, it becomes clear that NVidia's Maxwell architecture is much more efficient than AMD's GCN, as it's able to do more with less basically.

In fact, it's AMD's GCN architecture which has difficulty utilizing the shaders effectively, not Maxwell.. So in light of this, I still don't think your argument has any merit.

It doesn't need to be a huge lead and I don't expect AMD to have a huge lead. I'm not rooting for AMD. I'm objective. AMD lacks the Rasterization rate needed to draw all the triangles which make up the swarm of units on the screen.

What do you think about Digidi's comment that AMD leads in the 3D Mark overhead test?

More units are drawn in the heavy preset vs the medium preset (peak rasterization rate coming into play). You would have to bench the medium preset with and without MSAA to draw any conclusions regarding memory banwidth but I'm not sure what you're trying to say here regardless. You won't draw any logical conclusions from comparing apples to oranges.

To me it's logical deduction. The medium preset had less objects to draw but used MSAA, whilst the high preset obviously had more objects to draw and didn't use MSAA; likely due to the massive performance hit that it would incur.

You don't need to be a 3D programmer to know that MSAA plus high resolution eats up a lot of bandwidth..

Nobody else benched a 290x. PCPer benched a 390x but did not bench a GTX 980 Ti. Therefore we cannot derive any conclusions from that.

PCper and Computerbase.de used a 390x and a 390 respectively, both of which are either faster or equal to the 290x. In the PCPer review, the 390x was competing against a GTX 980 and whilst the 980 was a bit slower, it was far from being trounced.

In the computerbase.de review, the GTX 980 Ti faced a Fury X, which is much faster than a 290x, yet the performance was very similar between the two..

Have you even seen a game like Ashes of the Singularity? Nope... no game has that many drawn simultaneous units on the screen.

I don't think Ashes of the Singularity is unique when it comes to the amount of objects drawn.. What it is unique in doing, is how they are drawn. The objects or units are all likely unique in some manner and require separate draw calls like Star Swarm..

With DX11, developers used tricks such as instancing to increase the amount of objects on screen, without having the CPU grind to a halt.. Basically, for a GPU like the Fury X to be maxed out rasterization wise, would require the entire screen to be cluttered with so many objects that you wouldn't even be able to see what was going on..

Mahigan · Aug 22, 2015

Carfax83 said:
Let's look at this differently. The entire purpose of the GMU from what I've read, is to attempt to utilize the GPU resources as much as possible in an efficient manner.

The GMU was designed with OpenCL and CUDA programming in mind. It was not designed with DirectX 12 in mind. It was patched, with Maxwell 2, in order to add a degree of Parallelism which Maxwell and Kepler both lacked.

So for your assertion to be true, you would have to believe that NVidia's engineers are so incompetent that they would stick with a single GMU through two architectures despite your notion that a single GMU obviously isn't enough to keep the GPU busy..

nVIDIAs engineers weren't incompetent. They're rather conservative. They take their time to adopt new features. AMD engineers, on the other hand, take enormous risks. The risk they took with the ACEs, which sat there doing nothing since the 290x was released, is now paying off. nVIDIA is banking on Asynchronous Shading to not be a huge factor until Pascal releases. Once Pascal releases, you'll notice all of the ACE-like units it will have and you can think back to this thread.

But when you look at the real world evidence, it becomes clear that NVidia's Maxwell architecture is much more efficient than AMD's GCN, as it's able to do more with less basically.

In fact, it's AMD's GCN architecture which has difficulty utilizing the shaders effectively, not Maxwell.. So in light of this, I still don't think your argument has any merit.

Under DX11. A Serial API. Your Statements are correct. Under DX12. A Parallel API. Your Statements are erroneous.

What do you think about Digidi's comment that AMD leads in the 3D Mark overhead test?

I answered him. I'll copy paste it here:

The 3D Mark Overhead test only checks for the amount of Draw Calls a card can deal with. It does not actually require the card to do all of the work. Each geometry is a unique, procedurally-generated, indexed mesh containing 112 -127 triangles. The geometries are drawn with a simple shader, without post processing.

"To do this, Futuremark has written a relatively simple test that draws out a very simple scene with an ever-increasing number of objects in order to measure how many draw calls a system can handle before it becomes saturated. As expected for a synthetic test, the underlying rendering task is very simple – render an immense amount of building-like objections at both the top and bottom of the screen – and the bottleneck is in processing the draw calls. Generally speaking, under this test you should either be limited by the number of draw calls you can generate (CPU limited) or limited by the number of draw calls you can consume (GPU’s command processor limited), and not the GPU’s actual rendering capabilities. The end result is that the API Overhead Feature Test can push an even larger number of draw calls than Star Swarm could." - Anandtech

PCper and Computerbase.de used a 390x and a 390 respectively, both of which are either faster or equal to the 290x.

Incorrect statements once again. We're talking DX12 and Asynchronous Shading. Asynchronous shading maximizes compute throughput by prioritizing and scheduling streams to available compute resources.

A 290x is slightly slower than a 390x (core clock and memory bandwidth being the only defining characteristics). Core Clock being the characteristic which matters most for Compute. A 390x is a repackaged 290x with some updated clocks and 8GB of GDDR5 vs 4GB of GDDR5 on the 290x.

A 290x is faster than a 390 in compute tasks. A 390 is a repackaged 290.

In the PCPer review, the 390x was competing against a GTX 980 and whilst the 980 was a bit slower, it was far from being trounced.

In the computerbase.de review, the GTX 980 Ti faced a Fury X, which is much faster than a 290x, yet the performance was very similar between the two..

Your first statement is irrelevant.

Your second statement ignores the logical work I made to deduce that Peak Rasterization rate is holding the AMD GCN parts back in Ashes of the Singularity. Because of this we see a 290x performing similarly to a Fury-X. (a little slower that's it). We know that Ashes of the Singularity is GPU bottlenecked because the Benchmark analysis tool lets us know what the CPU frame rate is.

Therefore something, which is similar between the 290x and Fury-X, is their shared bottleneck. Peak Rasterization rate fits the bill when we look at what Ashes of the Singularity does. It draws a TON of Triangles/Polygons in order to create an enormous array of individual units onto the screen. This is something which has never been done before (save for the Star Swarm time demo but that one lacked Asynchronous Shading).

I don't think Ashes of the Singularity is unique when it comes to the amount of objects drawn.. What it is unique in doing, is how they are drawn. The objects or units are all likely unique in some manner and require separate draw calls like Star Swarm..

They require separate Draw Calls, sure, calls made in order to draw Triangles/Polygons amongst other things. The Draw call rate doesn't even begin to saturate either GCN or Maxwell GPUs.

Take a 290x, it can handle several million draw calls yet in Star Swarm, with only 100,000 draw calls, it is bottlenecked. Why?

Digidi · Aug 22, 2015

Yes its a draw call test. And what is a draw call test? Sending as much Polygons as possible to the gpu! This Polygons have to be RASZERISED! So the draw call test ist also a rasterizer Test!!! If the rasterizer cant handle the huge amount of drawcall you will See realy Bad results!

Drawcall test is to give the gpu AS much Polygons you can. And the gpu have tho change the polygons into pixel, and this is exactly the job of the rasterizer!

So If you have a Bad rasterizer then your drawcall test is Bad. Thats the nature of the drawcall test. Feed the rasterizer with drawcalls till he give up!

AMD have no Bad results. Nvidia have Bad results. But nvidia have a Problem to feed the rasterizer!

There are two bottelnecks. Nvidia cant feed rasterizert and AMD have a Bad feeding of the shader.

Silverforce11 · Aug 22, 2015

Is it possible that NV performs worse in DX12 because their hardware is gimped for async compute/shaders? Yes/No?

Unless you know for a fact, then yes, it's possible.

Time will tell all.

Digidi · Aug 22, 2015

AMD have a huge amount of shader to fix there lag with the shader feeding. Nvidia dont have much shader. I think this is the Problem of nvidia. A total unbalanced gpu. To much rasterizer which cant be feedet by the command processor and not enough shader to color up the pixel.

Enigmoid · Aug 22, 2015

Mahigan said:
The difference between Maxwell and Maxwell 2 is that Maxwell's Grid Management Unit can only send either a Graphics task or 32 Compute tasks to the work Distributor. It cannot send both in Parallel.

Therefore you're correct in pointing out that with Maxwell 2 the communication between the Grid Management Unit and Work Distributor is now Parallel.

The problem is that this doesn't change the fact that Maxwell 2 still only contains a single Grid Management Unit. This still remains as a bottleneck.

nVIDIAs Parallelism, under Maxwell 2, is thus limited to 1 Graphics and 31 Compute tasks. AMDs Parallelism, under GCN 1.1 (290 series) and GCN 1.2 is limited to 1 Graphics and 64 Compute tasks.

Another difference is that AMDs GCN 1.1 (290 series)/GCN 1.2 have 8 independent Asynchronous Compute Engines each able to schedule and prioritize work independently of one another. With Maxwell 2, it's a single Grid Management Unit. You can see why GCN 1.1 (290 series)/GCN 1.2 can best take advantage of the available compute resources.

Don't make the assumption that quantity is all that matters. Don't also make the assumption that AMD's and Nvidia's approaches are comparable by pure numbers like you are doing.

Carfax83 said:
Let's look at this differently. The entire purpose of the GMU from what I've read, is to attempt to utilize the GPU resources as much as possible in an efficient manner.

So for your assertion to be true, you would have to believe that NVidia's engineers are so incompetent that they would stick with a single GMU through two architectures despite your notion that a single GMU obviously isn't enough to keep the GPU busy..

But when you look at the real world evidence, it becomes clear that NVidia's Maxwell architecture is much more efficient than AMD's GCN, as it's able to do more with less basically.

In fact, it's AMD's GCN architecture which has difficulty utilizing the shaders effectively, not Maxwell.. So in light of this, I still don't think your argument has any merit.

Basically this. Fury (and also AMD's other cards) are really front end limited and cannot extract full hardware parallelism. This can easily be seen in the poor scaling exhibited by the Fury X over Hawaii. This is also why GCN does so well with asynchronous compute - the GPU simply cannot keep the whole chip active because of bottlenecks on the front end.

People are looking at this the wrong way. Asynchronous compute shows gains because the GPU cannot be fully utilized with one task in serial. This is bad. Asynchronous compute is a method to mitigate this by running other tasks on the non utilized hardware (ie running physics calculations while the geometry units are busy with tesselation).

It appears for Nvidia that asynchronous compute is not as much of a benefit. This is likely because the front end isn't holding the GPU back as much.

Looking over the documentation I can't find anything that says that AMD's 8 queues per ACE are submitted in parallel. The ACE units operate in parallel and can manage 8 queues but nothing appears to say that those 8 queues are done in parallel.

Furthermore AMD's recent blog post states this.

http://developer.amd.com/community/blog/2015/06/05/concurrency-in-modern-3d-graphics-apis/

Copy queues support all kinds of copy operations, including format conversions, multi-sample anti-aliasing (MSAA) resolves, and swizzling
Compute queues are a superset of copy queues, and also support dispatching compute tasks
Graphics queues are a superset of compute queues and also support rendering operations

AMD drivers currently support one queue of each type.
Which seems to indicate that while the hardware is there, more than 1 queue of each type is not enabled by drivers yet.

If anything can find anything on this it would be appreciated.

Ashes of the Singularity User Benchmarks Thread

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Junior Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Senior member

Senior member

Diamond Member

Junior Member

Senior member

Lifer

Diamond Member

Senior member

Junior Member

Lifer

Junior Member

Platinum Member