computerbaseAshes of the Singularity Beta1 DirectX 12 Benchmarks

Page 33 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Feb 19, 2009
10,457
10
76
How can the "graphics rendering" stalling the pipeline when a GTX980TI is >50% faster with graphics workload? Do you even thing about this?!

And how will "more compute workload" reduce the performance advantages of the GTX980TI over the GTX970 when the execution happens after the graphics rendering? For example 0,62ms + 0,62ms is still 38% less than 1ms + 1ms...

It was just presented to you above by Mahigan. The chart specifically measures latency time for shaders to do a compute task.

There is an inherent cost associated with the context switch. Regardless of the actual workload, even a small one, flipping the context of the shaders hurts performance.

So the more compute you shove into the same serial pipeline, the more the shaders stall due to having to context switch.

As for your believe that X% more paper specs should = X% more performance, this has rarely ever been true. Very few games scale perfectly. There's always a bottleneck somewhere, particularly in DX11. DX12 aims to make it better by using the multi-engine approach, but when your hardware only has one engine, it can't take advantage of this.

Just have a look:



Why is the 980Ti/Titan X not 50% above the 980??
 

dogen1

Senior member
Oct 14, 2014
739
40
91
Those settings are not "heavily GPU-Bound":

With DX11 the GTX980TI is 10% faster, with DX12 only 27% faster than the GTX970. This card schould be ~47-50% better.

Spec wise it's only ~20-35% higher. Perhaps this game is more compute bound than most? The 980 Ti is only 22% more powerful than a 980 in that area.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
GPUs need thousends of threads to hide the latency. Using less threads to fill the GPU is just an unoptimized workload.



Seriously, what have you just written? More work will result in less advantages? That doesnt make any sense. 60% advantages will result in 60% more performance. It is the nature of a compute archtecture.
Yes, and Maxwell doesn't support Asynchronous compute + graphics so compute jobs in one batch and graphics jobs in another. They're not executed concurrently so Maxwell doesn't get the benefit of a higher GPU utilization.

As for Tflops, that's the theoretical figure which ignores cache and register issues. The calculations to achieve theoretical Tflops of a GTX 980 Ti is: 2(2816xclock)=

That's the theoretical peak. The problem with Maxwell is that all of the CUDA cores and SMMs share resources. CUDA cores share the same L1 cache.


An SMM is comprised of 128 CUDA cores split into 4 (quadrants). That's 32 CUDA cores per Quadrant. Two Quadrants, 64 CUDA cores, share 1 pool of L1 texture/L1 cache. This cache can either be used as Texture cache, for Graphics operations, or as L1 cache for Compute operations. While an SMM is filled with Compute jobs, it cannot process Graphics jobs and vice versa. There's a 64KB pool of memory which is shared by the whole SMM (including the Polymorph Engine).


Contrast that with a GCN CU. Texture cache is separate from the LDS (Local Data Share). The CU only handles Texturing and Compute jobs and each has its own dedicated resources.

Now we take a step back to the Shader engine view,
Each 4 CUs share a cache. But no elements from the rendering pipeline share any resources, at this stage just as in the last, with GCNs compute units.

Not only does every element in GCN have its own cache but there are also shared caching pools (GDS (Global Data Share) and L2 cache) seen bellow:

So GCN is more apt at hitting its theoretical flops than Maxwell provided it is fed in parallel. Adding new CUs does not hinder GCNs per SIMD performance the way it does Maxwell. GCN3 is bottlenecked else where though (front end) but not compute wise.

So if you remove SMMs from Maxwell, you have less units accessing the shared L2 cache.

So you gain on a reduced latency for every SIMD. This shows up as an increase in compute efficiency.
 
Last edited:

Mahigan

Senior member
Aug 22, 2015
573
0
0
Do you really ask this? Maybe because the reference GTX980 clocks (much) higher than the GM200 cards...

And you still havent understood DX12. You dont mix draw and dispatch calls in one queue. This is wrong. If a developer is doing this then he has no clue what he is doing.



At the same clock the GTX980TI has 37,5% more compute performance (22SMM vs. 16SMM).

You can mix Graphics and Compute into the same batch on GCN. The cost of a context switch is only 1 cycle (vs up to 1000 on Maxwell).

But I'm not even talking about execution, I'm talking about processing. While graphics work is populating two SMM quadrants, neither can do compute and vice versa. That's because they share the same cache. CUs, however, can process texturing jobs while processing compute jobs.

The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design.
Oh really NVIDIA, tell me more...

However the maximum number of active thread blocks per multiprocessor has been doubled over SMX to 32, which should result in an automatic occupancy improvement for kernels that use small thread blocks of 64 or fewer threads (assuming available registers and shared memory are not the occupancy limiter).

So increased occupancy, over Kepler, results in...
Reduced Arithmetic Instruction Latency

Another major improvement of SMM is that dependent arithmetic instruction latencies have been significantly reduced. Because occupancy (which translates to available warp-level parallelism) is the same or better on SMM than on SMX, these reduced latencies improve utilization and throughput.

But is hampered by what I bolded out above, shared memory.

So if you increase the amount of units (compute, raster, rops, texture etc) then you increase the load on shared memory which results in a higher occupancy of available memory resources which results in latencies going up.

So like I said, and NVIDIA agrees, a GTX 980 Ti will have poorer per SIMD performance than a GTX 970 if both are stressed. The amount of SIMDs on GTX 980 Ti will outweigh the increase in latency but 60% more flops does not result in 60% more compute performance. Efficiency is key.

Thanks NVIDIA

https://devblogs.nvidia.com/paralle...ould-know-about-new-maxwell-gpu-architecture/
 
Last edited:

Shivansps

Diamond Member
Sep 11, 2013
3,873
1,527
136
The pclab review is so confusing, Celeron G3900 to 6700K they all perform nearly the same...

http://pclab.pl/art67995.html

Why reviewers are getting some wild results out of this? its like the Star Swarm crap all over again, except its now async fuss instead of raw draw calls...
 
Last edited:
Feb 19, 2009
10,457
10
76
The pclab review is so confusing, Celeron G3900 to 6700K they all perform nearly the same...

http://pclab.pl/art67995.html

Why reviewers are getting some wild results out of this? its like the Star Swarm crap all over again, except its now async fuss instead of raw draw calls...

When has pclab ever been regularly consistent vs the rest of the tech sites?

This is why they are considered a joke site, similar to ABT.

[H] is doing all they can to be added to that list with their forum tirade against AMD too. :/ Not impressed at all.
 

Samwell

Senior member
May 10, 2015
225
47
101
Very interesting stuff measured by german toms hardware:
http://www.tomshardware.de/ashes-of...tx-12-dx12-gaming,testberichte-242049.html#p5

I thought async might improve efficiency more, but it's staying more or less the same for Fury X. GTX980 is staying the same, which i expected for nvidia, but 980TI is loosing efficiency with DX12 and even more with Async. Only 390X is profiting from async, but by using 330W it's anyway very unefficient.
 

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
Very interesting stuff measured by german toms hardware:
http://www.tomshardware.de/ashes-of...tx-12-dx12-gaming,testberichte-242049.html#p5

I thought async might improve efficiency more, but it's staying more or less the same for Fury X. GTX980 is staying the same, which i expected for nvidia, but 980TI is loosing efficiency with DX12 and even more with Async. Only 390X is profiting from async, but by using 330W it's anyway very unefficient.

efficiency or not, isnt about high power usage but about performance per power used.

also we can definitively say that @zlatan is a legitimate source.
 

kondziowy

Senior member
Feb 19, 2016
212
188
116
Last edited:

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
Very interesting stuff measured by german toms hardware:
http://www.tomshardware.de/ashes-of...tx-12-dx12-gaming,testberichte-242049.html#p5

I thought async might improve efficiency more, but it's staying more or less the same for Fury X. GTX980 is staying the same, which i expected for nvidia, but 980TI is loosing efficiency with DX12 and even more with Async. Only 390X is profiting from async, but by using 330W it's anyway very unefficient.


Here is my post from 2013 on this matter:
http://forums.anandtech.com/showpost.php?p=35576139&postcount=624

Thou, I expected mantle to use this and did not know that dx12 will be based on mantle.

Still, You heard it here first!
 
Feb 19, 2009
10,457
10
76
Here is my post from 2013 on this matter:
http://forums.anandtech.com/showpost.php?p=35576139&postcount=624

Thou, I expected mantle to use this and did not know that dx12 will be based on mantle.

Still, You heard it here first!

Some custom GPUs have a more liberal TDP limit in their bios than others. Fury X seems to obey it's paper TDP.

But certainly, a GPU stressing more units at once due to parallel graphics + compute is going to use more power.

Also, nice thread, blast from the past 2013! How funny it was, I said back then, all the Mantle haters in that thread will soon be running games built on Mantle's code. They detested anything to do with AMD, but it's gonna be everywhere!!
 

Samwell

Senior member
May 10, 2015
225
47
101
efficiency or not, isnt about high power usage but about performance per power used.

also we can definitively say that @zlatan is a legitimate source.

No idea what you're talking about. I'm talking about Perf/W getting even slightly worse with Async with Fury in AotS.

Here is my post from 2013 on this matter:
http://forums.anandtech.com/showpost.php?p=35576139&postcount=624

Thou, I expected mantle to use this and did not know that dx12 will be based on mantle.

Still, You heard it here first!

Yeah nice

Some custom GPUs have a more liberal TDP limit in their bios than others. Fury X seems to obey it's paper TDP.

But certainly, a GPU stressing more units at once due to parallel graphics + compute is going to use more power.

Yes, higher power was expected, but why is Perf/W going down with Fury and Async even if it's only slightly? Makes somehow no sense to me.
 

Dygaza

Member
Oct 16, 2015
176
34
101
Yes, higher power was expected, but why is Perf/W going down with Fury and Async even if it's only slightly? Makes somehow no sense to me.

That difference is very small, so it can very well be withing margin of error. Run 2nd run and get 0.5 fps more with same wattage and the performance is better. But ofc it's always also possible that there is small inefficiency in async code.
 

Riek

Senior member
Dec 16, 2008
409
14
76
Very interesting stuff measured by german toms hardware:
http://www.tomshardware.de/ashes-of...tx-12-dx12-gaming,testberichte-242049.html#p5

I thought async might improve efficiency more, but it's staying more or less the same for Fury X. GTX980 is staying the same, which i expected for nvidia, but 980TI is loosing efficiency with DX12 and even more with Async. Only 390X is profiting from async, but by using 330W it's anyway very unefficient.

Would be interesting to see if the gpu clocks are affected due to TDP limitations.
 

kondziowy

Senior member
Feb 19, 2016
212
188
116
Why would perf/W improve, when you are using more hardware and need to do more work to distribute tasks? It would make no sense to me.
 
Feb 19, 2009
10,457
10
76
That difference is very small, so it can very well be withing margin of error. Run 2nd run and get 0.5 fps more with same wattage and the performance is better. But ofc it's always also possible that there is small inefficiency in async code.

Well, AT got a 20% perf gain from Fury X with Async on vs off at 4K.

Toms got less. But it's very close in terms of % perf gained and % power use increase. Within margin of error.

Now that 390X at Toms, with no TDP limit, it looks as if its mining coins! lol
 

Erenhardt

Diamond Member
Dec 1, 2012
3,251
105
101
Well, AT got a 20% perf gain from Fury X with Async on vs off at 4K.

Toms got less. But it's very close in terms of % perf gained and % power use increase. Within margin of error.

Now that 390X at Toms, with no TDP limit, it looks as if its mining coins! lol

Power consumption seems questionable for 390x. I mean, in DX12 without AC it barely gets a boost yet power consumption is through the roof. I mean, like 50% more. It doesn't make much sense.

Increased power consumption along with increased performance is to be expected, but we don't see it in this example.
 

Dygaza

Member
Oct 16, 2015
176
34
101
Power consumption seems questionable for 390x. I mean, in DX12 without AC it barely gets a boost yet power consumption is through the roof. I mean, like 50% more. It doesn't make much sense.

Increased power consumption along with increased performance is to be expected, but we don't see it in this example.

Could power gating simply work differently as the gpu is fed differently than it is under dx11?
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
No idea what you're talking about. I'm talking about Perf/W getting even slightly worse with Async with Fury in AotS.

.

16% higher power for 14% higher perfs, that s less than slightly actualy...

Most interesting TDP wise is that with DX12 the 980ti Gaming use 220-225W with or without Async enabled in the game and its efficency is no better than Fiji s....

Power consumption seems questionable for 390x. I mean, in DX12 without AC it barely gets a boost yet power consumption is through the roof. I mean, like 50% more. It doesn't make much sense.

Increased power consumption along with increased performance is to be expected, but we don't see it in this example.

It s also the case for the 980TI..
 
Last edited:

Udgnim

Diamond Member
Apr 16, 2008
3,664
111
106
Power consumption seems questionable for 390x. I mean, in DX12 without AC it barely gets a boost yet power consumption is through the roof. I mean, like 50% more. It doesn't make much sense.

Increased power consumption along with increased performance is to be expected, but we don't see it in this example.

they're using a MSI model

MSI 390 models tend to use more power than other manufacturer models
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Why would perf/W improve, when you are using more hardware and need to do more work to distribute tasks? It would make no sense to me.

You can reduce the CPU workload and time:

https://developer.nvidia.com/transitioning-opengl-vulkan

nVidia cards dont take advantages from DX12 so the CPU overload should be much lower. But this doesnt happen on a GTX980TI... More proof that the rendering path is not optimized for nVidia's hardware.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |