NVIDIA Pascal Thread

jpiniero · Apr 8, 2016

Silverforce11 said:
The only reasons why there's no GP102 for gaming, yields of a bigger ~450mm2 chip will be worse.

Any working GP100 chip is going to need to be multiple K. It's just unrealistic to think they could get many gamers to buy into that. High end Quadros yes, and I'm sure they could get Tesla buyers with less cores too. Yields at TSMC are obviously better than you think, otherwise you wouldn't be seeing a 610 mm2 die at this point.

Even a theoretical 450 mm2 GP102 would still be crazy expensive, but you could probably sell full dies as mid range Quadros and cut as $1099+ Titans.

Silverforce11 · Apr 8, 2016

@jpiniero
Not necessarily.

It's $ per wafer they can get. If a bunch of dies goes to $12k Tesla, a bunch more goes to $5K Quadros, the rest can be $1.5K GTX Titans. Overall they will make a lot per wafer that way. Telsa/Quadro in effect, subsidizes the existence of a huge high-end gaming chip from NV. Like it has for many generations.

raghu78 · Apr 8, 2016

ShintaiDK said:
If Polaris 10 doesn't get GDDR5X, its bottlenecked. Then a 2500-2800sp cut down Vega 11 part is going to be much faster. Not to mention it could have another TMU/ROP layout as well.

dude look back at AMD's true last gen stack (without Fiji)

HD 7770 -123 sq mm 640 sp
HD 7870 - 212 sq mm 1280 sp
HD 7970 - 360 sq mm 2048 sp
R9 290X - 438 sq mm. 2816 sp

So whatever Polaris 10 sp is Vega 10 will double it easily.

JDG1980 · Apr 8, 2016

Silverforce11 said:
Vega 11 is the competitor to GP104. Polaris 10 is just too small to compete and it will face off versus GP106 as expected.

If GP104 really only has 2560 shaders, Polaris 10 might actually beat it.

alcoholbob · Apr 8, 2016

JDG1980 said:
If GP104 really only has 2560 shaders, Polaris 10 might actually beat it.

I think if GP100 launch is any indication, Nvidia may be trying to win this generation with clockspeed. 2560 cores but if base clock is 1.5 or 1.6ghz you are looking at 60-65% faster than a stock 980, and about as fast as an overclocked 1.3-1.4ghz 980ti.

Head1985 · Apr 8, 2016

980TI is only 22% faster than GTX980.If Gp104 is 65% faster than GTX980 then it will be 40% faster than 980TI.
i really dont know why everyobne overestimate 980Ti performance.Its really not some miracle card.
Its just 20-23% faster than GTX980.

Btw if its 40% faster than GTX980Ti then 980Ti cant match it even at 1500Mhz.

Btw i also think polaris 10=Gp106
gp104 is 300mm2 SKu.Polaris cant match it with 232mm2, but it will probably beat 200mm2 GP106/1060

Mondozei · Apr 9, 2016

Silverforce11 said:
But you just wait, when Pascal GTX debuts, the VR hype from NV is gonna be all aboard that latency train.

Nvidia will hype VR regardless of what happens, simply because it's the new hot thing in gaming. That in of itself doesn't prove anything, even if it needs to be said that I basically agree with your premise about the importance of preemption for VR in general.

alcoholbob said:
I think if GP100 launch is any indication, Nvidia may be trying to win this generation with clockspeed. 2560 cores but if base clock is 1.5 or 1.6ghz you are looking at 60-65% faster than a stock 980, and about as fast as an overclocked 1.3-1.4ghz 980ti.

Yep.

extide · Apr 9, 2016

Head1985 said:
980TI is only 22% faster than GTX980.If Gp104 is 65% faster than GTX980 then it will be 40% faster than 980TI.
i really dont know why everyobne overestimate 980Ti performance.Its really not some miracle card.
Its just 20-23% faster than GTX980.

Btw if its 40% faster than GTX980Ti then 980Ti cant match it even at 1500Mhz.

Btw i also think polaris 10=Gp106
gp104 is 300mm2 SKu.Polaris cant match it with 232mm2, but it will probably beat 200mm2 GP106/1060

We don't REALLY know that polaris 10 is 232mm^2. I mean I would say that Polaris 10 is actually NOT the chip you are talking about, and rather that one, on the linkedin profile, is something that will never come to retail, especially since it was posted publicly. That's my opinion on that matter.

coercitiv · Apr 9, 2016

What Pascal chip goes into the Drive PX2?

I'm asking because it occurs to me, not only there isn't going to be a gaming focused big chip this generation, but it may just be that all the chips in the Pascal line are HPC focused this time around.

USER8000 · Apr 9, 2016

extide said:
We don't REALLY know that polaris 10 is 232mm^2. I mean I would say that Polaris 10 is actually NOT the chip you are talking about, and rather that one, on the linkedin profile, is something that will never come to retail, especially since it was posted publicly. That's my opinion on that matter.

Plus it is a denser process AMD is using too.

Adored · Apr 9, 2016

USER8000 said:
Plus it is a denser process AMD is using too.

Plus everybody keeps forgetting about the primitive discard accelerator which ought to help with memory bandwidth as well. I think a 232mm2 Polaris can be close to a 294mm2 Pascal.

antihelten · Apr 9, 2016

Silverforce11 said:
So you can tell VR gamers to not move their heads until they are at a graphics draw call completion?

Don't be silly man.

That Async Timewarp needs to fire as soon as people move their heads, that's the entire point of it. You cannot control when people move to fall in-line to prevent stalls of Async Timewarp. -_-

But you just wait, when Pascal GTX debuts, the VR hype from NV is gonna be all aboard that latency train. We can come back and discuss how you are wrong, again.

Async timewarp does not need to fire as soon as people move their heads (since people are constantly moving their heads, which is measured 1000Hz by the IMUs). Async timewarp needs to fire at the last possible moment of the rendering pipeline, but still early enough that it can be applied in time for the next frame refresh.

And as Sontin said Async Timewarp and Async Compute has essentially nothing to do with each other. An Async Timewarp could in theory be performed with Async Compute (and thus be allowed to run alongside the graphics rendering), or you can do it in the way Nvidia is doing here by preempting it, which basically just means that it gets inserted into the rendering pipeline at the most opportune time (instead of having to wait for another draw call to finish first.

Silverforce11 · Apr 9, 2016

Adored said:
Plus everybody keeps forgetting about the primitive discard accelerator which ought to help with memory bandwidth as well. I think a 232mm2 Polaris can be close to a 294mm2 Pascal.

Depends how the new GCN turns out, but they have actually got Hyper-threading for SPs. For REAL!

http://forums.anandtech.com/showpost.php?p=38154409&postcount=19

^ There's a patent paper there for next-gen GCN. Take some time to read it, it's mind blowing stuff.

On paper, there's potential for 4x the throughput for each SP. Though I suspect that's under perfect scenario, but still, x1 to x2 (game load dependent) per SP performance vs older GCN SP is there on the table.

Polaris GCN has gone wide with each SP being able to run multiple threads in parallel, a feat that's pretty crazy when you realize the amount of synchronization it requires to keep the hardware scheduler aware of each ALU uptime, to keep the warp scheduler keeping it busy.

There's also SP independent power gating and clock boost, so if an SP is only running one thread, it will auto boost to finish the task quicker.

Insane changes TBH, more than I expected.

Silverforce11 · Apr 9, 2016

antihelten said:
you can do it in the way Nvidia is doing here by preempting it, which basically just means that it gets inserted into the rendering pipeline at the most opportune time (instead of having to wait for another draw call to finish first.

Which wasn't possible on Maxwell and older, because it has to wait for graphics to finish first before being able to context switch the pipeline to handle compute.

With Pascal, that's an instantaneous change very much like GCN.

It works best if the compute timewarp can run in parallel, as Async Compute.

It works okay if the timewarp can go on a priority context and there's no delay for context switching. This is Pascal's uarch change, based on that article.

It works the least well the current way.

This is why NV say Maxwell at the BEST is only capable of 25ms motion to photon latency via async timewarp. Still above the 20ms recommended.

http://www.geforce.com/whats-new/ar...us-the-only-choice-for-virtual-reality-gaming

The standard VR pipeline from input in (when you move your head) to photons out (when you see the action occur in-game) is about 57 milliseconds (ms). However, for a good VR experience, this latency should be under 20ms.

Combined, and with the addition of further NVIDIA-developer tweaks, the VR pipeline is now only 25ms.

With this change in Pascal, they will get below that 20ms mark and there will be a lot of hoorah!

Why do I say this is "basic" Async Compute?

Because currently NV GPUs actually take a performance hit when AC is run. This is again because of their slow context switch. It causes stalls in the pipeline, wasting time where no work can be done.

The change in Pascal means even if they cannot run graphics + compute in parallel, devs calling for Async Compute, or even general games that use a lot of compute, won't cause stalls. In theory, it should behave like GCN where the graphics/compute context switch is fast.

Aside from actually having multi-engine like ACEs, this is a good fix by NV to add into Pascal as it resolves their weakness with performance regression with AC, and poor VR preemption due to stalls from slow context switches.

^ In the first instance, preemption, think of the slow context switch as adding idle time where the shaders cannot run as they are switching to handle graphics or compute. This is the problem with NV's current uarch, as pointed out by AMD's Robert Hallock when the Async Compute debacle started. This leads to NV losing performance when AC is used.

Basically with Pascal, it's much faster, what they call "fine-grained preemption", more flexible, anytime. And compute on priority can even override the current graphics task if that is needed.

antihelten · Apr 9, 2016

Silverforce11 said:
Which wasn't possible on Maxwell and older, because it has to wait for graphics to finish first before being able to context switch the pipeline to handle compute.

With Pascal, that's an instantaneous change very much like GCN.

It works best if the compute timewarp can run in parallel, as Async Compute.

It works okay if the timewarp can go on a priority context and there's no delay for context switching. This is Pascal's uarch change, based on that article.

It works the least well the current way.

Obviously this is an improvement over Maxwell, but that still doesn't make it Async compute.

Silverforce11 said:
Why do I say this is "basic" Async Compute?

Because currently NV GPUs actually take a performance hit when AC is run. This is again because of their slow context switch. It causes stalls in the pipeline, wasting time where no work can be done.

The change in Pascal means even if they cannot run graphics + compute in parallel, devs calling for Async Compute, or even general games that use a lot of compute, won't cause stalls. In theory, it should behave like GCN where the graphics/compute context switch is fast.

Just because Pascal no longer incurs a performance penalty from async compute, doesn't mean that it is supporting async compute, not even a "basic" async compute. The whole point of async compute is to improve performance, not simply maintain a status quo in performance.

Pascal no longer incurring a performance penalty, simply means that they have fixed their "workaround" (preemption) for not having async compute support. Of course instruction level pre-emption will likely be useful in many other areas as well, so it's not just a fix for lack of async support.

sontin · Apr 9, 2016

Silverforce11 said:
Which wasn't possible on Maxwell and older, because it has to wait for graphics to finish first before being able to context switch the pipeline to handle compute.

Nonsense. They must wait until a draw call is finished to run the Async Timewarp workload. There is no context switch involved.

This is why NV say Maxwell at the BEST is only capable of 25ms motion to photon latency via async timewarp. Still above the 20ms recommended.

http://www.geforce.com/whats-new/ar...us-the-only-choice-for-virtual-reality-gaming

13ms comes from the 75hz display. :\
Do you even read the article?!

Basically with Pascal, it's much faster, what they call "fine-grained preemption", more flexible, anytime. And compute on priority can even override the current graphics task if that is needed.

This is possible today, too - after a draw call.
Performance penalty happens because of wrong scheduled compute queues.

Silverforce11 · Apr 9, 2016

@sontin

Read and learn man. You come up with such random stuff that goes against what even NVIDIA tells developers about what their hardware is capable of.

https://developer.nvidia.com/sites/...works/vr/GameWorks_VR_2015_Final_handouts.pdf

^ NV claims it supports priority context...

^ They even say their priority context takes over the whole GPU and preempts whatever it's working on to switch to the new task... LOL

^ Except it can't. It's not actually a priority context at all. It gets stuck in traffic like everything else.

It's the same fault they had when they claim their hardware supports Async Compute... except it can't.

Pascal changes this, it's a major change for them. Pascal has real priority preemption and instant context switching of graphics/compute rendering.

Celebrate it instead of talking silly like Maxwell can do it when NV clearly says it cannot.

Need I remind you, you were so insistent that Maxwell also supports Async Compute. There were lots of threads where I tried to educate you otherwise, but nope, you were determined... and wrong.

Silverforce11 · Apr 9, 2016

antihelten said:
Just Pascal no longer incurs a performance penalty from async compute, doesn't mean that it is supporting async compute, not even a "basic" async compute. The whole point of async compute is to improve performance, not simply maintain a status quo in performance.

Pascal no longer incurring a performance penalty, simply means that they have fixed their "workaround" (preemption) for not having async compute support. Of course instruction level pre-emption will likely be useful in many other areas as well, so it's not just a fix for lack of async support.

In many ways, it's similar to GCN 1.0, it doesn't gain much performance from Async Compute, but it doesn't regress in performance. The outcome of Pascal will be better for NV in DX12 games that use AC, and much better for VR.

sontin · Apr 9, 2016

Silverforce11 said:
@sontin

Read and learn man. You come up with such random stuff that goes against what even NVIDIA tells developers about what their hardware is capable of.

And in none of these slides is mentioned that preemption has a penatly.

You are the one who doesnt care what is said and making fanfiction. Stop it. :thumbsdown:

USER8000 · Apr 9, 2016

Silverforce11 said:
@sontin

Read and learn man. You come up with such random stuff that goes against what even NVIDIA tells developers about what their hardware is capable of.

https://developer.nvidia.com/sites/...works/vr/GameWorks_VR_2015_Final_handouts.pdf

^ NV claims it supports priority context...

^ They even say their priority context takes over the whole GPU and preempts whatever it's working on to switch to the new task... LOL

^ Except it can't. It's not actually a priority context at all. It gets stuck in traffic like everything else.

It's the same fault they had when they claim their hardware supports Async Compute... except it can't.

Pascal changes this, it's a major change for them. Pascal has real priority preemption and instant context switching of graphics/compute rendering.

Celebrate it instead of talking silly like Maxwell can do it when NV clearly says it cannot.

Need I remind you, you were so insistent that Maxwell also supports Async Compute. There were lots of threads where I tried to educate you otherwise, but nope, you were determined... and wrong.

sontin said:
And in none of these slides is mentioned that preemption has a penatly.

You are the one who doesnt care what is said and making fanfiction. Stop it. :thumbsdown:

Nvidia is making up fanfiction?? What??

Silverforce11 · Apr 9, 2016

USER8000 said:
Nvidia is making up fanfiction?? What??

Heh. They even tell developers to not frequently mix graphics and compute in queues because their context switch is a costly one. Read the dev handout.

http://wccftech.com/nvidia-devs-computegraphics-toggle-heavyweight-switch/

On this topic, I have to commend zlatan, because he clearly knows what he is talking about ever since he graced this forum.

Outside of us forum warriors, he's probably one of the very few actual developers with deep experience.

Read what he says, everything turned out true.

There's earlier posts where he said Maxwell could not do Async Compute (as I did from 2014).

He called it on Pascal having this fine-grained preemption ability.

sontin · Apr 9, 2016

You can only have a context switch within a queue. Async Compute uses different queues. So no context switch involved. Performance penalty happens because of barriers and fences to synchronise these queues. If you cant let them run parallel every architecture will have a performance degression.

jpiniero · Apr 9, 2016

Silverforce11 said:
It's $ per wafer they can get. If a bunch of dies goes to $12k Tesla, a bunch more goes to $5K Quadros, the rest can be $1.5K GTX Titans. Overall they will make a lot per wafer that way. Telsa/Quadro in effect, subsidizes the existence of a huge high-end gaming chip from NV. Like it has for many generations.

There's only so far you can cut though before it starts to not make sense.

The P100 is 56 SMs. I bet they could sell a 50-52 SM cheaper Tesla and then 45-50 Quadros sometime in early 2017.

Theoretically the full GP104 is 40? (2560). I guess doing 2560 cores with high clock speeds shouldn't be a surprise but at 300 mm2 it's way too big for the price range anticipated. This node is going to be all about transistor usage efficiency and it looks bad for nVidia right now. Maybe that's why Volta also has shown up on the radar so soon because nVidia knows that Pascal is basically Fermi 2.0.

nvgpu · Apr 9, 2016

http://www.anandtech.com/show/7166/nvidia-announces-quadro-k6000

Quadro K6000 launched with full 2880 CUDA cores enabled.

http://www.anandtech.com/show/9096/nvidia-announces-quadro-m6000-quadro-vca-2015

http://www.anandtech.com/show/10179/nvidia-announces-24gb-quadro-m6000

Quadro M6000 launched with full 3072 CUDA cores enabled.

Quadro P6000(?) will launched with all 3840 CUDA cores enabled and hopefully with 8 stack HBM2 for 32GB of RAM in mass production, since you don't want to ship a 16GB Quadro flagship a year after you shipped the Quadro M6000 24GB.

Kris194 · Apr 9, 2016

jpiniero said:
There's only so far you can cut though before it starts to not make sense.

The P100 is 56 SMs. I bet they could sell a 50-52 SM cheaper Tesla and then 45-50 Quadros sometime in early 2017.

Theoretically the full GP104 is 40? (2560). I guess doing 2560 cores with high clock speeds shouldn't be a surprise but at 300 mm2 it's way too big for the price range anticipated. This node is going to be all about transistor usage efficiency and it looks bad for nVidia right now. Maybe that's why Volta also has shown up on the radar so soon because nVidia knows that Pascal is basically Fermi 2.0.

Two years later is soon? How? 2560 CC too big while GTX 980 on 28nm node has 2048CC?

NVIDIA Pascal Thread

Lifer

Lifer

Diamond Member

Golden Member

Diamond Member

Golden Member

Golden Member

Senior member

Diamond Member

Golden Member

Senior member

Golden Member

Lifer

Lifer

Golden Member

Diamond Member

Lifer

Lifer

Diamond Member

Golden Member

Lifer

Diamond Member

Lifer

Senior member

Member