GP100 and GP104 are different architectures

NTMBK · May 18, 2016

GP100 has 64 FP32 cores per SM, GP104 has 128 FP32 cores per SM
GP100 has 2 processing blocks per SM, GP104 has 4 processing blocks per SM
GP100 has 64KB of shared memory per SM, GP104 has 96KB of shared memory per SM
GP100 has 32k registers per processing block, GP104 has 16k registers per processing block
GP100 has 32 FP64 cores per SM, GP104 has 4 FP64 cores per SM

Hardware.fr has some helpful diagrams:

I know it's controversial to say, but... GP104 looks like a straight die shrink of Maxwell. Whereas GP100 looks like a new, compute oriented architecture.

Flapdrol1337 · May 18, 2016

GP100 can split it's fp32 units in 2x fp16, 104 can't. It is quite different.

It's not a maxwell shrink either though.

Has much quicker pre emption and performance doesn't go backward with async.

airfathaaaaa · May 18, 2016

you should check your link better and provide the 4k result too
http://www.computerbase.de/2016-05/...agramm-ashes-of-the-singularity-async-compute
pre emption has nothing to do with async compute its for async time wrap

what you see on 4k is basicly the card running out of steam to brute force it and just regress 2-3 % and this is worse than 980ti...

Flapdrol1337 · May 18, 2016

Hm, curious, dx11 back on top at 4K. Though the regression at 4K with async is 0.2%, not 2%, not worse than 980Ti.

airfathaaaaa · May 18, 2016

the regression is 3% as per chat(2.8 in reality)
its worse given the compute power 1080 has one would assume it would have been 10-15% ahead of amd on dx12 titles.. yet on serious sites they are 8-10% faster

renderstate · May 18, 2016

At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)

ViRGE · May 18, 2016

They're no more different than GF100 and GF104 were. In fact it's less so. There's a difference, and sometimes it's important, but adding HPC features to one does not make them radically different. All the fundamentals are the same, the organization is just a big different.

airfathaaaaa · May 18, 2016

renderstate said:
At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)

yeap https://m.reddit.com/r/hardware/comments/4iaxx6/private_pascal_architecture_briefing_supposedly/
have fun its clear that people confuse a lot async timewarp with async compute and they think because it can now do the first it means that it can do the second too...
but in reality the very same thing that is happening on maxwell is happening on pascal

Det0x · May 18, 2016

Flapdrol1337 said:
Has much quicker pre emption and performance doesn't go backward with async.

You are correct, but it doesn't increase either..

http://www.bitsandchips.it/52-engli...scal-in-trouble-with-asyncronous-compute-code

Seems like they were semi correct 2 months ago

Pascal in trouble with Asynchronous Compute code

According to our sources, next GPU micro architecture Pascal from NVIDIA will be in trouble if it will have to heavly use Asynchronous Compute code in video games.

Broadly speaking, Pascal will be an improved version of Maxwell, especially about FP64 performances, but not about Asyncronous Compute performances. NVIDIA will bet on raw power, instead of Asynchronous Compute abilities. This means that Pascal cards will be highly dependent on driver optimizations and games developers kindness. So, GamesWorks optimizations will play a fundamental role in company strategy.

*edit*

zlatan said:
I can't say to much on this topic, but Pascal will be an improvement over Maxwell especially at this feature. But no, it won't have GCN-like capabilities. It will be close to GCN 1.0, but nothing more.

There is 2 ace units in 7970 and 8 units in 290 as i recall (?)

Seems like he knew what he was talking about

Det0x · May 18, 2016

http://wccftech.com/nvidia-gtx-1080-async-compute-detailed/

Dynamic load balancing and improved pre-emption both improve the performance of async compute code considerably on Pascal compared to Maxwell. Although principally this is not exactly the same as Asynchronous Shading or Computing. Because Pascal still can’t execute async code concurrently without pre-emption. This is quite different from AMD’s GCN architecture which has Asynchronous Compute engines that enable the execution of multiple kernels concurrently without pre-emption.

AMD has long touted the asynchronous compute capabilities of its GCN graphics architecture. The company built what it calls ACEs, Asynchronous Compute Engines, into its hardware. It’s available in all of AMD’s GCN architecture based graphics cards, including the now more than four year old HD 7970.
What Nvidia is doing with preemption and dynamic load balancing right now, while not exactly async compute, can be used to accomplish similar goals.

End result:

No async compute in sight, but at least there is no performance decline using AC, as there is with Maxwell

Lepton87 · May 18, 2016

Det0x said:
http://wccftech.com/nvidia-gtx-1080-async-compute-detailed/

End result:

No async compute in sight, but at least there is no performance decline using AC, as there is with Maxwell

How is that an improvement over simply disabling it on Maxwell? Surely NV users are wise enough to turn off features that only lower the frame-rates? Things like tessellating flat walls is something that only AMD's users are stupid enough not to disable that's why NV pushed to enable such frivolous features so AMD user's had the worse experience which they deserved simply by owning forbidden hardware.

airfathaaaaa · May 18, 2016

Lepton87 said:
How is that an improvement over simply disabling it on Maxwell? Surely NV users are wise enough to turn off features that only lower the frame-rates? Things like tessellating flat walls is something that only AMD's users are stupid enough not to disable that's why NV pushed to enable such frivolous features so AMD user's had the worse experience which they deserved simply by owning forbidden hardware.

you know amd didnt had any sort of tessalation control untill relatively recently
in batman no matter what the setting the game was always overriding it so it wasnt really matter

xpea · May 18, 2016

renderstate said:
At higher resolution it's likely the 1080 has already enough work to fully utilize its units and there is no much to gain from async compute, at least in this game. The exact opposite behavior on Fury could be a sign that it's not able to work efficiently on gfx only.
Fury has similar flops and even higher mem bandwidth, if it was truly efficient it should be on par to a 1080.

Let me repeat this again: a lack of performance improvement with async compute does not mean it's not working, it could mean instead that the GPU is already operating in an efficient way and there are not many idle cycles to fill with other work to do.

The irony is that a GPU that is not very efficient at keeping itself busy with gfx work but that has great support for async compute will see more benefits than a GPU that already does a good job at staying busy with gfx work.

To really understand what is going on we would need a real-time breakdown of the GPU cores to see when they are not idle and what they are working on throughout the frame. I am not sure any freely available tool can provide such information for AMD and NVIDIA architectures.

Lastly, since I am sick of this, if you are not interested in an HONEST technical discussion and you are just a blind supporter of this or that company and you have nothing technical to add PLEASE IGNORE ME and go trash some other thread. Thank you. (and I wish moderator would take action on the usual 2 or 3 people that spend their days on this forum destroying every interesting discussion!)

+100
+1000
This post should be pinned :thumbsup:

its incredible to see AMD biggest architecture failure, ie very poor usage of their raw power (both in FLOPs and memory bandwidth), becoming a praise :'(

Even worst, on AMD, devs have to do all the job and band aid the hardware deficiencies with async compute to finally extract the performance level you expect from raw numbers. And it's a good thing ? :\

For months, AMD tries to hide and spin the truth with this async BS, but reality is AMD CGN can't reach their full potential without deep heavy optimization from the devs, at least much more than with Nvidia.

PS: don't get me wrong, the ability to run compute and gfx queues in parallel and switching from one to other without penalty, are good features. But we should not forget that a GPU must also be able to maximize its performance under pure gfx load.

Lepton87 · May 18, 2016

xpea said:
+100
+1000
This post should be pinned :thumbsup:

its incredible to see AMD biggest architecture failure, ie very poor usage of their raw power (both in FLOPs and memory bandwidth), becoming a praise :'(

Even worst, on AMD, devs have to do all the job and band aid the hardware deficiencies with async compute to finally extract the performance level you expect from raw numbers. And it's a good thing ? :\

For months, AMD tries to hide and spin the truth with this async BS, but reality is AMD CGN can't reach their full potential without deep heavy optimization from the devs, at least much more than with Nvidia.

PS: don't get me wrong, the ability to run compute and gfx queues in parallel and switching from one to other without penalty, are good features. But we should not forget that a GPU must also be able to maximize its performance under pure gfx load.

Yeah GCN is such a failure! Look how great Kepler is from the same period!
Fury is clearly a stop-gap solution so it doesn't have the front-end necessary for such a huge number of shaders yet it still matches GM200. Overclocking potential is the only thing that makes it worse. And that is terrible? Right....

sweetusernames · May 19, 2016

what

isn't it even more sad that with GCN's poor shader usage, it can still compete with Maxwell?

Mahigan · May 19, 2016

Renderstate, I have to disagree.

What you're saying would require that a GPU be capable completing compute and graphics tasks at "EXACTLY" the same time. Meaning that if a compute task took 5ms so would a graphics task. This happening without fences (synchronization points).

This would need to be the case 100% of the time. If it weren't the case then we would see either a performance boost or loss, depending on the situation, between having Async on and off.

What we see, instead, is no performance increase or loss between having Async on or off. The most likely explanation for this is that the GTX 1080 is not executing graphics and compute tasks in parallel. In other words, the GTX 1080 does not support Async compute + graphics. Instead, the GTX 1080, like Maxwell before it, is executing graphics and compute tasks quickly but in a serial manner.

This doesn't mean that Pascal doesn't support Async Compute (executing graphics and compute tasks without a defined order) but it does mean that it doesn't support Async Compute + Graphics.

Mahigan · May 19, 2016

What Pascal has fixed is preemption, so switching between compute and graphics tasks no longer requires a flush of the SM.

This minimizes idle time and as such minimizes the performance impact of synchronization points (fences). Due to this, performance between having Async turned on and off is pretty much the same. No more performance loss.

Hope that explains it.

sontin · May 19, 2016

Mahigan said:
Renderstate, I have to disagree.
What you're saying would require that a GPU be capable completing compute and graphics tasks at "EXACTLY" the same time. Meaning that if a compute task took 5ms so would a graphics task. This happening without fences (synchronization points).

Fences are used to synchronize both queues. Without fences there wouldnt be a need for load balancing. :\

What we see, instead, is no performance increase or loss between having Async on or off. The most likely explanation for this is that the GTX 1080 is not executing graphics and compute tasks in parallel. In other words, the GTX 1080 does not support Async compute + graphics. Instead, the GTX 1080, like Maxwell before it, is executing graphics and compute tasks quickly but in a serial manner.

This doesn't mean that Pascal doesn't support Async Compute (executing graphics and compute tasks without a defined order) but it does mean that it doesn't support Async Compute + Graphics.

And that is your typical nonsense. We see a low to no improvement because Pascal is nearly running at 100%. There is no room left to execute something else on the GPU.

Mahigan said:
What Pascal has fixed is preemption, so switching between compute and graphics tasks no longer requires a flush of the SM.

This minimizes idle time and as such minimizes the performance impact of synchronization points (fences). Due to this, performance between having Async turned on and off is pretty much the same. No more performance loss.

Hope that explains it.

Preemption has nothing to do with Async Compute.

Mahigan · May 19, 2016

Sontin, don't opine on topics you're not well versed in.

Preemption is not tied to Async Compute directly but comes into play when switching between Graphics and Compute loads. What Async Compute + Graphics does is execute both a Compute and a Graphics task in parallel. So a switch is involved if your architecture can't execute both tasks in parallel. One task will be executed before the other with a fence placed to synchronize both queues. So preemption is involved here.

I'm fully aware of what fences are used for. I taught you.

The point of my comment was that absent fences, from the equation, both tasks would need to have an identical execution time for renderstates comment to be correct. The fence would be pointless in that case as synchronization wouldn't need enforcement.

Example:
Async Compute + Graphics working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 10ms

Async Compute + Graphics not working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 15ms

Now you'd expect a performance loss on the latter but with Pascal vs Fury, we're talking about two different architectures. Pascal's higher clocks lowers the execution times of both compute and graphics tasks.

Example:
Graphics 5ms
Compute 3ms
Fence placed
Total Execution time = 8ms

So Pascal, although it doesn't support Async Compute + Graphics, still bests Fiji because it can execute both tasks in serial quicker than Fiji can in Parallel.

What we observe is about a 5FPS lead (10%) over Fiji (which is tiny) under AotS. This compared to the huge lead Pascal enjoys in non-Async compute + graphics titles. This makes sense if the above is true.

Because in non-Async compute + graphics titles, Fiji would take 15ms, not 10ms, vs Pascal's 8ms.

Understand?

As for Maxwell vs Pascal. The context switch involved between Graphics and Compute tasks caused a performance loss with Async compute + graphics turned on. Maxwell would process the compute task and then the graphics task with the fence enforcing synchronization between the two. This caused a performance loss.

With Pascal, that issue is fixed. So now Pascal only loses 0.1 FPS when Async compute + graphics is turned on. Pascal doesn't gain anything, because it doesn't support the feature, but it doesn't really lose anything either.

maddie · May 19, 2016

sontin said:
Fences are used to synchronize both queues. Without fences there wouldnt be a need for load balancing. :\

And that is your typical nonsense. We see a low to no improvement because Pascal is nearly running at 100%. There is no room left to execute something else on the GPU.

Preemption has nothing to do with Async Compute.

That is the thing that you and a few others here seem either unable or unwilling to accept.

With modern complex programs, it might be true to say, that you always have room to execute something else on the processor, except for brief periods of full usage. Thus true Async Compute is beneficial.

Of course, admitting this would also mean you have to admit that Nvidia has a deficiency in their GPUs. Impossible, thus the denial.

Pottuvoi · May 19, 2016

Flapdrol1337 said:
GP100 can split it's fp32 units in 2x fp16, 104 can't. It is quite different.

Has this been confirmed?

Would have thought that the FP16 performance would have been one of the big improvements when compared to Maxwell.

Mahigan · May 19, 2016

Oh and its not necessarily preemption that is involved but a context switch. Preemption is based on allowing for smooth context switching without incurring a performance loss from latency involved in the switch.

Asynchronous Compute has nothing to do with context switching. Asynchronous compute + graphics, how however, makes use of a context switch. This is because you're switching between a graphics context and a compute context.

With Maxwell, the SM required a flush prior to a switch. With Pascal, this is no longer the case. Pascal can switch contexts at a finer grained level, instruction level, rather than a coarse grained level, thread block level.

Mahigan · May 19, 2016

I'll add more...

What is preemption used for and why is a context switch involved?

Preemption is used to pause/stop currently executing work in order to push through a higher priority workload.

When is preemption used? VR and more specifically when an Asynchronous Time Warp is required. Why an Asynchronous time warp? Imagine you're playing a game, in VR, and you move your head. This movement of your head changes your viewing angle. What an asynchronous time warp does is that it takes the last rendered frame and warps it slightly in order to take the new viewing angle into account. An asynchronous time warp uses a compute shader in order to warp the last rendered frame.

So your GPU is executing a graphics task and you move your head, that graphics task is paused, an asynchronous time warp compute shader is executed at a higher priority and then the graphics task execution continues.

So what happened was that you switched from a graphics context to a compute context and back to a graphics context. In other words, context switching.

Preemption makes use of context switching.

Some architectures, like Maxwell, suffer from slow context switching. What that means is this:

Graphics task is executing and you move your head. The application waits on the graphics task to finish execution (thread block), the SM then has to be flushed and then the compute task is pushed through. So you end up with a longer execution time (frame latency) which affects your FPS because you have to wait for the draw call to complete before you can preempt it with a compute task.

So switching between compute and graphics tasks, even if they're executed one after the other, incurs latency if your architecture suffers from slow context switching.

Pascal rectifies this.

sontin · May 19, 2016

Mahigan said:
Preemption is not tied to Async Compute directly but comes into play when switching between Graphics and Compute loads. What Async Compute + Graphics does is execute both a Compute and a Graphics task in parallel. So a switch is involved if your architecture can't execute both tasks in parallel. One task will be executed before the other with a fence placed to synchronize both queues. So preemption is involved here.

There is no switch involved when you preempt the workload after another. This is the way nVidia is dealing with Graphic and Compute unter DX12.

I'm fully aware of what fences are used for. I taught you.

The point of my comment was that absent fences, from the equation, both tasks would need to have an identical execution time for renderstates comment to be correct. The fence would be pointless in that case as synchronization wouldn't need enforcement.

Example:
Async Compute + Graphics working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 10ms

Async Compute + Graphics not working
Graphics 10ms
Compute 5ms
Fence placed
Total execution time = 15ms

Fences are a synchronisation point for the queues. Without fences the driver and hardware can schedule new workload immediately - so no need for load balancing.

Understand?

Maxwell loses performance with Async Compute. Pascal would be losing the same amount of performance. Pascal gains performance in lower resolution or with less workload. This would be impossible if you were right.

NTMBK · May 19, 2016

Oh god, I didn't mean for this to turn into yet another Async Compute argument D: I just found it interesting that NVidia has made such a different design for their gaming card vs. their compute card.

GP100 and GP104 are different architectures

Lifer

Golden Member

Senior member

Golden Member

Senior member

Senior member

Elite Member, Moderator Emeritus

Senior member

Golden Member

Golden Member

Platinum Member

Senior member

Senior member

Platinum Member

Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Lifer