GP100 and GP104 are different architectures

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
  • GP100 has 64 FP32 cores per SM, GP104 has 128 FP32 cores per SM
  • GP100 has 2 processing blocks per SM, GP104 has 4 processing blocks per SM
  • GP100 has 64KB of shared memory per SM, GP104 has 96KB of shared memory per SM
  • GP100 has 32k registers per processing block, GP104 has 16k registers per processing block
  • GP100 has 32 FP64 cores per SM, GP104 has 4 FP64 cores per SM

Hardware.fr has some helpful diagrams:





I know it's controversial to say, but... GP104 looks like a straight die shrink of Maxwell. Whereas GP100 looks like a new, compute oriented architecture.
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
GP100 can split it's fp32 units in 2x fp16, 104 can't. It is quite different.

It's not a maxwell shrink either though.

Has much quicker pre emption and performance doesn't go backward with async.
 

Flapdrol1337

Golden Member
May 21, 2014
1,677
93
91
Hm, curious, dx11 back on top at 4K. Though the regression at 4K with async is 0.2%, not 2%, not worse than 980Ti.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
the regression is 3% as per chat(2.8 in reality)
its worse given the compute power 1080 has one would assume it would have been 10-15% ahead of amd on dx12 titles.. yet on serious sites they are 8-10% faster
 

renderstate

Senior member
Apr 23, 2016
237
0
0
At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)
 

ViRGE

Elite Member, Moderator Emeritus
Oct 9, 1999
31,516
167
106
They're no more different than GF100 and GF104 were. In fact it's less so. There's a difference, and sometimes it's important, but adding HPC features to one does not make them radically different. All the fundamentals are the same, the organization is just a big different.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)
yeap https://m.reddit.com/r/hardware/comments/4iaxx6/private_pascal_architecture_briefing_supposedly/
have fun its clear that people confuse a lot async timewarp with async compute and they think because it can now do the first it means that it can do the second too...
but in reality the very same thing that is happening on maxwell is happening on pascal
 

Det0x

Golden Member
Sep 11, 2014
1,065
3,115
136
Has much quicker pre emption and performance doesn't go backward with async.

You are correct, but it doesn't increase either..



http://www.bitsandchips.it/52-engli...scal-in-trouble-with-asyncronous-compute-code

Seems like they were semi correct 2 months ago

Pascal in trouble with Asynchronous Compute code

According to our sources, next GPU micro architecture Pascal from NVIDIA will be in trouble if it will have to heavly use Asynchronous Compute code in video games.

Broadly speaking, Pascal will be an improved version of Maxwell, especially about FP64 performances, but not about Asyncronous Compute performances. NVIDIA will bet on raw power, instead of Asynchronous Compute abilities. This means that Pascal cards will be highly dependent on driver optimizations and games developers kindness. So, GamesWorks optimizations will play a fundamental role in company strategy.

*edit*

I can't say to much on this topic, but Pascal will be an improvement over Maxwell especially at this feature. But no, it won't have GCN-like capabilities. It will be close to GCN 1.0, but nothing more.
There is 2 ace units in 7970 and 8 units in 290 as i recall (?)

Seems like he knew what he was talking about
 
Last edited:

Det0x

Golden Member
Sep 11, 2014
1,065
3,115
136
http://wccftech.com/nvidia-gtx-1080-async-compute-detailed/

Dynamic load balancing and improved pre-emption both improve the performance of async compute code considerably on Pascal compared to Maxwell. Although principally this is not exactly the same as Asynchronous Shading or Computing. Because Pascal still can’t execute async code concurrently without pre-emption. This is quite different from AMD’s GCN architecture which has Asynchronous Compute engines that enable the execution of multiple kernels concurrently without pre-emption.

AMD has long touted the asynchronous compute capabilities of its GCN graphics architecture. The company built what it calls ACEs, Asynchronous Compute Engines, into its hardware. It’s available in all of AMD’s GCN architecture based graphics cards, including the now more than four year old HD 7970.
What Nvidia is doing with preemption and dynamic load balancing right now, while not exactly async compute, can be used to accomplish similar goals.

End result:

No async compute in sight, but at least there is no performance decline using AC, as there is with Maxwell
 

Lepton87

Platinum Member
Jul 28, 2009
2,544
9
81
http://wccftech.com/nvidia-gtx-1080-async-compute-detailed/



End result:

No async compute in sight, but at least there is no performance decline using AC, as there is with Maxwell
How is that an improvement over simply disabling it on Maxwell? Surely NV users are wise enough to turn off features that only lower the frame-rates? Things like tessellating flat walls is something that only AMD's users are stupid enough not to disable that's why NV pushed to enable such frivolous features so AMD user's had the worse experience which they deserved simply by owning forbidden hardware.
 

airfathaaaaa

Senior member
Feb 12, 2016
692
12
81
How is that an improvement over simply disabling it on Maxwell? Surely NV users are wise enough to turn off features that only lower the frame-rates? Things like tessellating flat walls is something that only AMD's users are stupid enough not to disable that's why NV pushed to enable such frivolous features so AMD user's had the worse experience which they deserved simply by owning forbidden hardware.

you know amd didnt had any sort of tessalation control untill relatively recently
in batman no matter what the setting the game was always overriding it so it wasnt really matter
 

xpea

Senior member
Feb 14, 2014
449
150
116
At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)
+100
+1000
This post should be pinned :thumbsup:

its incredible to see AMD biggest architecture failure, ie very poor usage of their raw power (both in FLOPs and memory bandwidth), becoming a praise :'(

Even worst, on AMD, devs have to do all the job and band aid the hardware deficiencies with async compute to finally extract the performance level you expect from raw numbers. And it's a good thing ? :\

For months, AMD tries to hide and spin the truth with this async BS, but reality is AMD CGN can't reach their full potential without deep heavy optimization from the devs, at least much more than with Nvidia.


PS: don't get me wrong, the ability to run compute and gfx queues in parallel and switching from one to other without penalty, are good features. But we should not forget that a GPU must also be able to maximize its performance under pure gfx load.
 

Lepton87

Platinum Member
Jul 28, 2009
2,544
9
81
+100
+1000
This post should be pinned :thumbsup:

its incredible to see AMD biggest architecture failure, ie very poor usage of their raw power (both in FLOPs and memory bandwidth), becoming a praise :'(

Even worst, on AMD, devs have to do all the job and band aid the hardware deficiencies with async compute to finally extract the performance level you expect from raw numbers. And it's a good thing ? :\

For months, AMD tries to hide and spin the truth with this async BS, but reality is AMD CGN can't reach their full potential without deep heavy optimization from the devs, at least much more than with Nvidia.


PS: don't get me wrong, the ability to run compute and gfx queues in parallel and switching from one to other without penalty, are good features. But we should not forget that a GPU must also be able to maximize its performance under pure gfx load.

Yeah GCN is such a failure! Look how great Kepler is from the same period!
Fury is clearly a stop-gap solution so it doesn't have the front-end necessary for such a huge number of shaders yet it still matches GM200. Overclocking potential is the only thing that makes it worse. And that is terrible? Right....
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Renderstate, I have to disagree.

What you're saying would require that a GPU be capable completing compute and graphics tasks at "EXACTLY" the same time. Meaning that if a compute task took 5ms so would a graphics task. This happening without fences (synchronization points).

This would need to be the case 100% of the time. If it weren't the case then we would see either a performance boost or loss, depending on the situation, between having Async on and off.

What we see, instead, is no performance increase or loss between having Async on or off. The most likely explanation for this is that the GTX 1080 is not executing graphics and compute tasks in parallel. In other words, the GTX 1080 does not support Async compute + graphics. Instead, the GTX 1080, like Maxwell before it, is executing graphics and compute tasks quickly but in a serial manner.

This doesn't mean that Pascal doesn't support Async Compute (executing graphics and compute tasks without a defined order) but it does mean that it doesn't support Async Compute + Graphics.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
What Pascal has fixed is preemption, so switching between compute and graphics tasks no longer requires a flush of the SM.

This minimizes idle time and as such minimizes the performance impact of synchronization points (fences). Due to this, performance between having Async turned on and off is pretty much the same. No more performance loss.

Hope that explains it.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Renderstate, I have to disagree.
What you're saying would require that a GPU be capable completing compute and graphics tasks at "EXACTLY" the same time. Meaning that if a compute task took 5ms so would a graphics task. This happening without fences (synchronization points).

Fences are used to synchronize both queues. Without fences there wouldnt be a need for load balancing. :\

What we see, instead, is no performance increase or loss between having Async on or off. The most likely explanation for this is that the GTX 1080 is not executing graphics and compute tasks in parallel. In other words, the GTX 1080 does not support Async compute + graphics. Instead, the GTX 1080, like Maxwell before it, is executing graphics and compute tasks quickly but in a serial manner.

This doesn't mean that Pascal doesn't support Async Compute (executing graphics and compute tasks without a defined order) but it does mean that it doesn't support Async Compute + Graphics.
And that is your typical nonsense. We see a low to no improvement because Pascal is nearly running at 100%. There is no room left to execute something else on the GPU.

What Pascal has fixed is preemption, so switching between compute and graphics tasks no longer requires a flush of the SM.

This minimizes idle time and as such minimizes the performance impact of synchronization points (fences). Due to this, performance between having Async turned on and off is pretty much the same. No more performance loss.

Hope that explains it.

Preemption has nothing to do with Async Compute.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Sontin, don't opine on topics you're not well versed in.

Preemption is not tied to Async Compute directly but comes into play when switching between Graphics and Compute loads. What Async Compute + Graphics does is execute both a Compute and a Graphics task in parallel. So a switch is involved if your architecture can't execute both tasks in parallel. One task will be executed before the other with a fence placed to synchronize both queues. So preemption is involved here.

I'm fully aware of what fences are used for. I taught you.

The point of my comment was that absent fences, from the equation, both tasks would need to have an identical execution time for renderstates comment to be correct. The fence would be pointless in that case as synchronization wouldn't need enforcement.

Example:
Async Compute + Graphics working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 10ms

Async Compute + Graphics not working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 15ms

Now you'd expect a performance loss on the latter but with Pascal vs Fury, we're talking about two different architectures. Pascal's higher clocks lowers the execution times of both compute and graphics tasks.

Example:
Graphics 5ms
Compute 3ms
Fence placed
Total Execution time = 8ms

So Pascal, although it doesn't support Async Compute + Graphics, still bests Fiji because it can execute both tasks in serial quicker than Fiji can in Parallel.

What we observe is about a 5FPS lead (10%) over Fiji (which is tiny) under AotS. This compared to the huge lead Pascal enjoys in non-Async compute + graphics titles. This makes sense if the above is true.

Because in non-Async compute + graphics titles, Fiji would take 15ms, not 10ms, vs Pascal's 8ms.

Understand?

As for Maxwell vs Pascal. The context switch involved between Graphics and Compute tasks caused a performance loss with Async compute + graphics turned on. Maxwell would process the compute task and then the graphics task with the fence enforcing synchronization between the two. This caused a performance loss.

With Pascal, that issue is fixed. So now Pascal only loses 0.1 FPS when Async compute + graphics is turned on. Pascal doesn't gain anything, because it doesn't support the feature, but it doesn't really lose anything either.

 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
Fences are used to synchronize both queues. Without fences there wouldnt be a need for load balancing. :\

And that is your typical nonsense. We see a low to no improvement because Pascal is nearly running at 100%. There is no room left to execute something else on the GPU.



Preemption has nothing to do with Async Compute.
That is the thing that you and a few others here seem either unable or unwilling to accept.

With modern complex programs, it might be true to say, that you always have room to execute something else on the processor, except for brief periods of full usage. Thus true Async Compute is beneficial.

Of course, admitting this would also mean you have to admit that Nvidia has a deficiency in their GPUs. Impossible, thus the denial.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
Oh and its not necessarily preemption that is involved but a context switch. Preemption is based on allowing for smooth context switching without incurring a performance loss from latency involved in the switch.

Asynchronous Compute has nothing to do with context switching. Asynchronous compute + graphics, how however, makes use of a context switch. This is because you're switching between a graphics context and a compute context.

With Maxwell, the SM required a flush prior to a switch. With Pascal, this is no longer the case. Pascal can switch contexts at a finer grained level, instruction level, rather than a coarse grained level, thread block level.
 

Mahigan

Senior member
Aug 22, 2015
573
0
0
I'll add more...

What is preemption used for and why is a context switch involved?

Preemption is used to pause/stop currently executing work in order to push through a higher priority workload.

When is preemption used? VR and more specifically when an Asynchronous Time Warp is required. Why an Asynchronous time warp? Imagine you're playing a game, in VR, and you move your head. This movement of your head changes your viewing angle. What an asynchronous time warp does is that it takes the last rendered frame and warps it slightly in order to take the new viewing angle into account. An asynchronous time warp uses a compute shader in order to warp the last rendered frame.

So your GPU is executing a graphics task and you move your head, that graphics task is paused, an asynchronous time warp compute shader is executed at a higher priority and then the graphics task execution continues.

So what happened was that you switched from a graphics context to a compute context and back to a graphics context. In other words, context switching.

Preemption makes use of context switching.

Some architectures, like Maxwell, suffer from slow context switching. What that means is this:

Graphics task is executing and you move your head. The application waits on the graphics task to finish execution (thread block), the SM then has to be flushed and then the compute task is pushed through. So you end up with a longer execution time (frame latency) which affects your FPS because you have to wait for the draw call to complete before you can preempt it with a compute task.

So switching between compute and graphics tasks, even if they're executed one after the other, incurs latency if your architecture suffers from slow context switching.

Pascal rectifies this.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Preemption is not tied to Async Compute directly but comes into play when switching between Graphics and Compute loads. What Async Compute + Graphics does is execute both a Compute and a Graphics task in parallel. So a switch is involved if your architecture can't execute both tasks in parallel. One task will be executed before the other with a fence placed to synchronize both queues. So preemption is involved here.

There is no switch involved when you preempt the workload after another. This is the way nVidia is dealing with Graphic and Compute unter DX12.

I'm fully aware of what fences are used for. I taught you.

The point of my comment was that absent fences, from the equation, both tasks would need to have an identical execution time for renderstates comment to be correct. The fence would be pointless in that case as synchronization wouldn't need enforcement.

Example:
Async Compute + Graphics working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 10ms

Async Compute + Graphics not working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 15ms

Fences are a synchronisation point for the queues. Without fences the driver and hardware can schedule new workload immediately - so no need for load balancing.

Understand?

Maxwell loses performance with Async Compute. Pascal would be losing the same amount of performance. Pascal gains performance in lower resolution or with less workload. This would be impossible if you were right.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Oh god, I didn't mean for this to turn into yet another Async Compute argument D: I just found it interesting that NVidia has made such a different design for their gaming card vs. their compute card.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |