NVIDIA Pascal Thread

NTMBK · Apr 6, 2016

antihelten said:
Actually Maxwell was perfectly capable of this as well, it just wasn't enabled in the desktop cards, only in mobile SoCs (i.e. Tegra X1):

Desktop Maxwell != mobile Maxwell. For a start, mobile Maxwell was 20nm! It was also compute capability 5.3 (the only part which was): http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities

The whole history of Maxwell and Pascal is a bit weird. Pascal is a bit of a change from Maxwell though, with SMs only having 64 shaders each. Each GPC has the same number of FP32 shaders (640), but with 10 SMs instead of 5. More registers and shared memory per shader, which will help with complex HPC kernels.

NTMBK · Apr 6, 2016

Silverforce11 said:
GP104 could well be GP100 minus all the FP64 cores that aren't needed. That could put it down to ~400mm2?

Wouldn't be the first time they did it- they've been doing that ever since the GTX 460. In early Kepler days, they had 2xGK104 Teslas for FP32 workloads, and GK110 for FP64. Then in Maxwell days they had GM200 for FP32, and GK210 for FP64.

Here's my GP104 guess:

-half the FP32 shaders of GP100 (3 GPCs, 30 SMs, 1920 FP32 shaders)
-half the memory bandwidth- 2xHBM2 stacks, 8GB memory
-reduced FP64 ratio- 1/4 or 1/8
-no NVLink connectors, or reduced to 1/2 links

antihelten · Apr 6, 2016

NTMBK said:
Desktop Maxwell != mobile Maxwell. For a start, mobile Maxwell was 20nm! It was also compute capability 5.3 (the only part which was): http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities

The whole history of Maxwell and Pascal is a bit weird. Pascal is a bit of a change from Maxwell though, with SMs only having 64 shaders each. Each GPC has the same number of FP32 shaders (640), but with 10 SMs instead of 5. More registers and shared memory per shader, which will help with complex HPC kernels.

I never said desktop Maxwell was the same as mobile Maxwell (the fact that they have different feature sets should make this obvious), however they are still both Maxwell, just like Maxwell 1 and Maxwell 2 can also both be said to be Maxwell. In other words the name Maxwell is a bit of an umbrella term that covers several different iterations of the same architecture, with at least one of these iterations being capable of double speed FP16 operations, like what is being advertised for Pascal.

Also I don't see what the process node has to do with anything, since that doesn't have any influence on the feature set of the architecture.

Also I would be careful about making any conclusions about how Pascal differs from Maxwell based upon GP100, since I think there's a fairly high chance that the makeup of the Pascal SMs will look significantly different in the gaming focused GP106/104.

Mahigan · Apr 6, 2016

Anyone notice something? P100 is more GCN-like. 4 Texture Mapping Units per SM, 64 SIMD cores per SM. NVIDIA have delivered a VERY similar design to GCN organization wise.

The lower CUDA cores per SM means that P100 won't run into this issue:

Whereas increased parallelism past 16 concurrent Warps maxed out the available local caches and began to spill into L2 cache.

This makes P100 a more parallel architecture from Maxwell but it is essentially Maxwell tweaked to look more GCN-like. I'm now interested to see what Vega brings to the table.

ROp wise, P100 looks to have 128 ROps. If we look at GM107:

We see a ROp to memory controller ratio of 8:1. Each 8 ROps had access to 1MB of L2 cache and its own 64-bit memory controller.

GM200/204 changed this to a ratio of 16:1. Each 16 ROps had access to 512KB of L2 cache and its own 64-bit memory controller. This solution had issues memory bandwidth wise which is why NVIDIA used ample amounts of color compression. Even with no other work but ROp work, GM200/204 could not hit its theoretical ROp throughput. This makes it highly unlikely that NVIDIA would increase the ROp ratio further with Pascal. Notice a loss of 10GPixels/s for each GM200/204 cards when compared to theoretical performance.

There are 8x 512-bit memory controllers on P100 which means you can pair 8 groups of ROps. If we assume 16:1 ratio that's 128 ROps.

There's 4MB of L2 cache on P100, if we assume 512KB per 16 ROps that's 8x16 for 128 ROps.

My take at least.

Silverforce11 · Apr 6, 2016

Mahigan said:
Anyone notice something? P100 is more GCN-like. 4 Texture Mapping Units per SM, 64 SIMD cores per SM. NVIDIA have delivered a VERY similar design to GCN organization wise.

Yup, I posted about it several times in this thread earlier.

Computerbase.de also noticed it, other tech sites did not mention the fact. Those clever Germans.

Their max warp size is 64, it's a perfect match for 64 CC per SM. Bang, one warp and it reaches 100% peak utilization.

It just so happens that GCN works best with a 64 wavefront/warp, and so console optimized engines will benefit Pascal as a consequence. Win-win scenario here for NV.

All they need is a multi-engine design and they are set for the console/DX12/Vulkan era, nullifying AMD's "long term chess game".

Head1985 · Apr 6, 2016

if they remove DP units(not only disable them) on Gp104 and GP106 and keep same patern like maxwell 6 to 2 GPC we will have:

GP100-6xGPC 3840SP-60SMX
GP104-4xGPC 2560SP-40SMX
GP106-2xGPC 1280SP-20SMX

maxwell was same:6 to 2 GPC
Gm200-6GPC 3072SP
Gm204-4GPC 2048SP
Gm206-2GPC 1024SP
They basically ADD only 25% more SP.

Gp104 and 106 will be very small SKU.GP104 will be around 230-250mm2 and 106 120-150mm2

Mahigan · Apr 6, 2016

Silverforce11 said:
Yup, I posted about it several times in this thread earlier.

Computerbase.de also noticed it, other tech sites did not mention the fact. Those clever Germans.

Their max warp size is 64, it's a perfect match for 64 CC per SM. Bang, one warp and it reaches 100% peak utilization.

It just so happens that GCN works best with a 64 wavefront/warp, and so console optimized engines will benefit Pascal as a consequence. Win-win scenario here for NV.

All they need is a multi-engine design and they are set for the console/DX12/Vulkan era, nullifying AMD's "long term chess game".

They can increase parallelism now seeing as the lower CUDA cores per SM means more local cache is available for concurrent Warps without spilling into L2 cache.

If we remember what Dan Baker said it's that Maxwell, in the early stages of developing AotS, could run Asynchronous Compute + Graphics but that the result was an unmitigated disaster in terms of performance and conformance. This makes sense seeing as anything past 16 concurrent warps results in local caches spilling into L2 cache which is already strained by the ROps.

NVIDIA have since shut down Async Compute + Graphics in their driver. There was also a notable CPU cost to using the feature which is something I had originally mentioned in my very first article on the topic over at overclock.net.

So, in reality, I don't think Pascal will support Asynchronous Compute + Graphics due to the CPU costs associated with emulating it through the static scheduling nature of Kepler/Maxwell/Pascal.

antihelten · Apr 6, 2016

I still very much doubt we will see the same SM layout in GP106/104 as we're seeing in GP100.

If they leave in the FP64 units and simply disable them (to make sure that the Geforce lineup doesn't encroach on Quadro/Tesla), then they would end up with a horribly inefficient design from a core/mm2 perspective.

For instance you would need 48 Pascal SMs to match the core count of GM200. 48 SM would be 80% of what GP100 has, and thus you're looking at a 500mm2 16nm die simply to match GM200. Not going to happen imho.

More likely they will have removed the FP64 units completely from GP106/104, but seeing as that is a fairly drastic change, they might also change a number of other things along the way, such as returning to the same 128 cores per SM layout that Maxwell had.

Head1985 · Apr 6, 2016

antihelten said:
I still very much doubt we will see the same SM layout in GP106/104 as we're seeing in GP100.

If they leave in the FP64 units and simply disable them (to make sure that the Geforce lineup doesn't encroach on Quadro/Tesla), then they would end up with a horribly inefficient design from a core/mm2 perspective.

For instance you would need 48 Pascal SMs to match the core count of GM200. 48 SM would be 80% of what GP100 has, and thus you're looking at a 500mm2 16nm die simply to match GM200. Not going to happen imho.

More likely they will have removed the FP64 units completely from GP106/104, but seeing as that is a fairly drastic change, they might also change a number of other things along the way, such as returning to the same 128 cores per SM layout that Maxwell had.

http://forums.anandtech.com/showpost.php?p=38147270&postcount=1128

Btw if 1080 have 2560SP:
GTX980 vs 1080
+25%SP
+25%clock
We are at +50% above 980 without new archicture changes.TITANX is now 35% faster than GTX980.If we add 10% from better new architecture we are at 60% above GTX980.
not bad for 250mm2 sku(if are DP disabled)

16nm finfet with HUGE clock boost helps alot..

beginner99 · Apr 6, 2016

Silverforce11 said:
They could make the same chip without any FP64 CC and it'll be ~400mm2. Hmm.

Any more guesses for GP104?

Exactly. That's why even a GP102 makes sense to me without all the FP64 or much less of it. Or as you say this is just GP104. Name does not really matter.

Point is I doubt we will see GP100 as a GPU as a similar powerful GPU could be made with much smaller die size and lower power use. Do gaming GPUs ever need FP64? They could actually completely remove it.

Timmah! · Apr 6, 2016

Head1985 said:
if they remove DP units(not only disable them) on Gp104 and GP106 and keep same patern like maxwell 6 to 2 GPC we will have:

GP100-6xGPC 3840SP-60SMX
GP104-4xGPC 2560SP-40SMX
GP106-2xGPC 1280SP-20SMX

maxwell was same:6 to 2 GPC
Gm200-6GPC 3072SP
Gm204-4GPC 2048SP
Gm206-2GPC 1024SP
They basically ADD only 25% more SP.

Gp104 and 106 will be very small SKU.GP104 will be around 230-250mm2 and 106 120-150mm2

Yeah, not good enough. I dont need GP100 with FP64, so i can live with 104. But it has to have at least same number of FP32 cores as GP100. Call such chip 102 and sell as Titan only from start, i dont care. But i am not buying a chip with less cores than Titan X has. Even if its technically somewhat faster than Titan X cause of higher clocks and some other IPC improvements.

Head1985 · Apr 6, 2016

Timmah! said:
Yeah, not good enough. I dont need GP100 with FP64, so i can live with 104. But it has to have at least same number of FP32 cores as GP100. Call such chip 102 and sell as Titan only from start, i dont care. But i am not buying a chip with less cores than Titan X has. Even if its technically somewhat faster than Titan X cause of higher clocks and some other IPC improvements.

You forgot about clock increase with 16nm finfet.

if 1080 have 2560SP:
GTX980 vs 1080
+25%SP
+25%clock
We are at +50% above 980 without new archicture changes.TITANX is now 35% faster than GTX980.If we add 10% from better new architecture we are at 60% above GTX980.Thats 25% faster than TITANX and 30% above 980TI.And if new architecture brings even more than 10% IPC...
not bad for 250mm2 sku(if are DP disabled)

16nm finfet with HUGE clock boost helps alot..

antihelten · Apr 6, 2016

Head1985 said:
http://forums.anandtech.com/showpost.php?p=38147270&postcount=1128

People have actually been commenting on this (gaming oriented pascal having a different layout from GP100, i.e 128 FP32 cores per SM instead of 64) several times in the thread so far:

http://forums.anandtech.com/showpost.php?p=38145828&postcount=931
http://forums.anandtech.com/showpost.php?p=38146592&postcount=1051
http://forums.anandtech.com/showpost.php?p=38147243&postcount=1125

I guess we're going around in circles at this point, since there's no real way to get any closer to the answer until more info on GP106/104 is released

Silverforce11 · Apr 6, 2016

Head1985 said:
if they remove DP units(not only disable them) on Gp104 and GP106 and keep same patern like maxwell 6 to 2 GPC we will have:

GP100-6xGPC 3840SP-60SMX
GP104-4xGPC 2560SP-40SMX
GP106-2xGPC 1280SP-20SMX

Gp104 and 106 will be very small SKU. GP104 will be around 230-250mm2 and 106 120-150mm2

Either that or the gaming focused chips like GP104 will be big, and a lot of FP32 CC. But I think that's about right, GP104 ~300mm2, 2560 SP, 4x GCP/40SMX.

With improved IPC and higher clocks, it'll beat the Titan X easy.

Head1985 · Apr 6, 2016

antihelten said:
People have actually been commenting on this (gaming oriented pascal having a different layout from GP100, i.e 128 FP32 cores per SM instead of 64) several times in the thread so far:

http://forums.anandtech.com/showpost.php?p=38145828&postcount=931
http://forums.anandtech.com/showpost.php?p=38146592&postcount=1051
http://forums.anandtech.com/showpost.php?p=38147243&postcount=1125

I guess we're going around in circles at this point, since there's no real way to get any closer to the answer until more info on GP106/104 is released

it will be 96 F32 units per SMX not 128.
Pascal have 64FP32 + 32FP64 in SMX.
Btw i dont think it will happen.It will be too fast with tsmc 16nm finfet clock increase.

1080 then will have 3840SP and GP100 5760SP and with 25% clock increase those cards will be more than 100% faster than GTX980 and TITANX.
1080 with 2560Sp should be 60% faster than GTX980 with 25% clock increase and 10% architectural IPC.that 25% faster than TITANX and 30% faster than 980TI.
And best thing is if GP104 have zero DP units it will be only 230-250mm2 large.

Silverforce11 · Apr 6, 2016

The more apt comparison is between GK110 and GK104, since big Kepler also had some FP64 CC per SM.

Basically the relationship between GP104 vs GP100 will be FP64 stripped out.

@Head1985

Technically not 25% clock speed increase, cos you have to compare it to the boost clocks which on Maxwell reference cards, already hit ~1.25ghz.

But GP104 with 2560 CC, better IPC and the clock speed gain, could definitely match the leak benchmark results, aka ~15% above Titan X. In games that are GCN optimized, it will be even faster compared to Titan X due to the new SM layout.

Timmah! · Apr 6, 2016

Head1985 said:
You forgot about clock increase with 16nm finfet.

if 1080 have 2560SP:
GTX980 vs 1080
+25%SP
+25%clock
We are at +50% above 980 without new archicture changes.TITANX is now 35% faster than GTX980.If we add 10% from better new architecture we are at 60% above GTX980.Thats 25% faster than TITANX and 30% above 980TI.And if new architecture brings even more than 10% IPC...
not bad for 250mm2 sku(if are DP disabled)

16nm finfet with HUGE clock boost helps alot..

I am not interested in something marginally faster than Titan X. And yeah, 25 percent is marginal increase to me. Given the fact, Titan X has 12GB VRAM to most likely 8GB on gp104, which i need as much as maximum speed, gp 104 is not really much of an upgrade in my eyes. I want those 3840 cores and 16GB vram and i am actually willing to pay 1000 for that, unlike 500 for 2560sp/8gb card.

Silverforce11 · Apr 6, 2016

Timmah! said:
I am not interested in something marginally faster than Titan X. And yeah, 25 percent is marginal increase to me. Given the fact, Titan X has 12GB VRAM to most likely 8GB on gp104, which i need as much as maximum speed, gp 104 is not really much of an upgrade in my eyes. I want those 3840 cores and 16GB vram and i am actually willing to pay 1000 for that, unlike 500 for 2560sp/8gb card.

You're not, but it's a mid-range chip. It will be marginally faster than the top-end chip of the previous gen. It's not going to be 50-70% faster, that's never happened in the recent GPU history.

Titan X + 15% (or Titan X + 30% in GCN optimized titles) at much less TDP makes it a winner really. Can't go wrong with that from a small mid-range chip.

Head1985 · Apr 6, 2016

Timmah! said:
I am not interested in something marginally faster than Titan X. And yeah, 25 percent is marginal increase to me. Given the fact, Titan X has 12GB VRAM to most likely 8GB on gp104, which i need as much as maximum speed, gp 104 is not really much of an upgrade in my eyes. I want those 3840 cores and 16GB vram and i am actually willing to pay 1000 for that, unlike 500 for 2560sp/8gb card.

25% is decent and its worst case scenario.25% its only with +10% ipc with new architecture.
It can be even 10% more.
If pascal deliver 20% IPC over maxwell then GTX1080 will be 35% faster than TITANX and 40% faster than 980Ti.
It will be more than GTX680 vs 580.
980 vs 1080
25%more SP
25%clock increase
20%IPC
thats 70% increase over GTX980.But even if its only 10% IPC it will be still 30% faster than 980TI.Thats pretty decent for 250mm2 SKU.And if you want more then wait for GP100 to 2017.

jpiniero · Apr 6, 2016

Has anyone gotten confirmation that the GP100 does not have any ROPs (and it's not just disabled on the Tesla?)

antihelten · Apr 6, 2016

Head1985 said:
it will be 96 F32 units per SMX not 128.
Pascal have 64FP32 + 32FP64 in SMX.
Btw i dont think it will happen.It will be too fast with tsmc 16nm finfet clock increase.

1080 then will have 3840SP and GP100 5760SP and with 25% clock increase those cards will be more than 100% faster than GTX980 and TITANX.
1080 with 2560Sp should be 60% faster than GTX980 with 25% clock increase and 10% architectural IPC.that 25% faster than TITANX and 30% faster than 980TI.
And best thing is if GP104 have zero DP units it will be only 230-250mm2 large.

You're assuming that FP64 cores will be replaced with FP32 cores on a 1 to 1 basis. The 128 FP32 cores per SM theory assumes that since FP64 cores are almost certainly bigger die space wise, a single FP64 core can be replaced with 2 FP32 cores, thus reaching 128 cores per SM, just like Maxwell. We don't really know which one it might be (it could also be just 64 FP32 cores per SM, if they simply remove the FP64 units without changing anything else)

How many cores 1080 has would depend entirely upon how many SMs it has. It may have 24 (to match GM200, if they go with 128 cores per SM), 28 (half of P100, and with 128 cores per SM, the same total number of FP32 cores as P100, i.e 3584), 30 (half of GP100), 40 (what you suggested) or something else entirely, who knows. The size of GP104 obviously also depends upon how many SMs they throw in it.

Silverforce11 said:
The more apt comparison is between GK110 and GK104, since big Kepler also had some FP64 CC per SM.

Basically the relationship between GP104 vs GP100 will be FP64 stripped out.

It might be worth mentioning here, but small Kepler (i.e. everything below GK110) and Maxwell are not totally devoid of FP64 cores.

Kepler has 8 FP64 units per SM (versus 64 for GK110), Maxwell has 4 FP64 units per SM. So GP104 would likely still have some amount of FP64 units left in (if it follows the same ratio as GK104 to GK110, then GP104 would have 4 FP64 units per SM).

Cookie Monster · Apr 6, 2016

Silverforce11 said:
So was Fermi 480 and Kepler Titan/780Ti, these big chips were HPC focused with 1/3 FP64. They were good for gaming too.

Why the expectations that it will be different this time?

Because of several things. One is that Pascal is not only 1:2 DP rate but it packs a peak 5.3TFLOPs of double precision performance.. That is roughly ~ x3.5 a GK110. It packs way too many FP64 CC for it to be useful in a 3D gaming environment. On top of that, it has things like NVlink (IBM only) which is also meaningless in the gaming aspect of things. Another is that this GPU might not have any ROPs present for instance.

Unlike the previous approaches where they could get away with a single GPU to do both, looks like they literally went all out for KNL maybe? (sort reminds me of the GT200 and the can of w***po ass but this time this thing might actually kick some serious ass in HPC/GPGPU/Compute applications). Actually, I wouldn't mind having one at work!

I expect a game tailored GPU(s) probably because they can a) afford multiple GPUs probably and b) the FP64 cores waste way too much power and die space for nothing. A reason why Fury for instance has 1/16 FP64 rate.

Plus if such "GP102" exists, this can also be sold as Quadros/Teslas probably because it will pack way more FP32 CCs and i.e. its single precision performance will be very high. Perhaps this will feature GDDR5X while the GP104 features GDDR5 and limited FP64.

What ever it is, i doubt there will be anything P100 based consumer gaming product(s) nor will such a product use HBM2.

gamervivek · Apr 6, 2016

Silverforce11 said:
So was Fermi 480 and Kepler Titan/780Ti, these big chips were HPC focused with 1/3 FP64. They were good for gaming too.

Why the expectations that it will be different this time?

It's relative. This thing will get murdered by AMD's 600mm2 behemoth. nvidia don't have the luxury of going up only against 300-350mm2 chips from AMD.

I can see even the 400mm2 Vega taking GP100 down in gaming depending on clockspeed, 600mm2 chip from AMD would just beat it silly. Unless of course AMD go down the same road as nvidia which seems pretty unlikely or don't have a 600mm2 chip in the works at all which is more likely.

PhonakV30 · Apr 6, 2016

Apparently There are 2 GPU.
TAIWAN 1543
N7C410 000
LR21 ( ِDownwards )
http://cdn.wccftech.com/wp-content/uploads/2016/04/NVIDIA-Tesla-P100-GP100-GPU_3.jpg

TAIWAN 1540A1
N6P800 L8(?)P
LR21 ( Upward )
http://cdn.wccftech.com/wp-content/uploads/2016/04/NVIDIA-Tesla-P100-GP100-GPU_4.jpg

nvgpu · Apr 6, 2016

http://nvidianews.nvidia.com/news/n...ouble-speed-of-europe-s-fastest-supercomputer

CSCS plans to upgrade the system later this year with 4,500 Pascal-based GPUs.

Nvidia will be selling every GP100 they can get from TSMC, demand is high.

NVIDIA Pascal Thread

Lifer

Lifer

Golden Member

Senior member

Lifer

Golden Member

Senior member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Diamond Member

Senior member

Senior member

Senior member