NVIDIA Pascal Thread

Page 48 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Timmah!

Golden Member
Jul 24, 2010
1,463
729
136
So this is pretty much coming to this, whether that FP64 unit is basically allowing regular FP32 cuda core functions as 64bit one (half of them), or its the only thing needed for 64bit computing and the other 3840 cores have no role in that...

Regardless of Nvidias nomenclature, i am siding with Aten Ra its more likely the first option, although i may be totally wrong. Its worth mentioning though, in that SM diagram the 64bit parts are named DP Unit, not cores, unlike the regular cores. One would assume, if this was full CUDA core with 64bit capability, they would call it DP Core or something...

 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
Why would they do that? It doesn't make any sense when you will consider that GCN like architecture will give additional boost in games developed with GCN architecture in mind. It's win-win for Nvidia. Also why people say that consumer version of GP100(GP102?) will have 25% higher clocks than GM 200 while Tesla P100 offers 40% higher base clock than Tesla M40.

Because a 128 core SM could arguably achieve better shader/mm2 density than a 64/96 core SM, whilst still maintaining better efficiency than a 192 core SM

So this is pretty much coming to this, whether that FP64 unit is basically allowing regular FP32 cuda core functions as 64bit one (half of them), or its the only thing needed for 64bit computing and the other 3840 cores have no role in that...

Regardless of Nvidias nomenclature, i am siding with Aten Ra its more likely the first option, although i may be totally wrong. Its worth mentioning though, in that SM diagram the 64bit parts are named DP Unit, not cores, unlike the regular cores. One would assume, if this was full CUDA core with 64bit capability, they would call it DP Core or something...


Seriously, just go and read some of the tons of information that's available out there on the Kepler and Maxwell architecture (and the little info that's available for Pascal).

The FP64 units/cores whatever you want to call them, are distinct separate units and they do not make use of the normal FP32 cores. FP32 CUDA cores being capable of running FP64 operations (taking 2 cycles instead of 1), is something that was available in older architectures (i.e. Fermi), but was removed from Kepler (and Maxwell).
 

xpea

Senior member
Feb 14, 2014
449
150
116
There's more to strip with Pascal since its 1/2, so you might get better scaling there than in the Kepler generation. If AMD's experience with HBM is anything to go by, switching back to a GDDR5(X) interface for GP104 would cause an area increase vs GP100. I think anyone expecting a 4 GPC 2560 CC GP104 to come in at or less than 250mm² is being quite optimistic. That's 63% more FP32 cores per mm² of chip than GP100, over the whole chip.
Pascal has also very costly NVlink interface. From anandtech DGX1 article:
In terms of construction, as hinted by in NVIDIA’s sole diagram of a component-level view of the DGX-1, the Tesla cards sit on their own carrier board, with the Xeon CPUs, DRAM, and most other parts occupying their own board. The carrier board in turn serves two functions: it allows for a dedicated board for routing the NVLink connections – each P100 requires 800 pins, 400 for PCIe + power, and another 400 for the NVLinks, adding up to nearly 1600 board traces for NVLinks alone – and it also allows easier integration into OEM designs, as OEMs need only supply an interface for the carrier board. It’s a bit of an amusing juxtaposition then, as the carrier board essentially makes for one massive 8 GPU compute card, being fed with PCIe lanes from the CPU and power from the PSU.
I'm sure that Nvlink + HBM interfaces take more space than GDDR5 alone
 

antihelten

Golden Member
Feb 2, 2012
1,764
274
126
Pascal has also very costly NVlink interface. From anandtech DGX1 article:

I'm sure that Nvlink + HBM interfaces take more space than GDDR5 alone

HBM interface actually takes up relatively little space. For Fiji 4096 bit of HBM1 interface used roughly 2/3 the amount of space that Hawaii used on 512 bits of GDDR5 interface. The reason for this largely being the fact that a lot of the control logic is moved to the HBM memory modules themselves.

No idea about NVlink.
 

gamervivek

Senior member
Jan 17, 2011
490
53
91
Im with you but here people were talking about 3840 + 1920 = 5760 CUDA Cores. Which of course is not true.

You're right, though I was accepting the 5760 cores version because just 3840 cores on 610mm2 of chip real estate sounded too low.

The Pascal architecture’s computational prowess is more than just brute force: it increases performance not only by adding more SMs than previous GPUs, but by making each SM more efficient. Each SM has 64 CUDA cores and four texture units, for a total of 3840 CUDA cores and 240 texture units.

https://devblogs.nvidia.com/parallelforall/inside-pascal/#_ftn1

They've upped the frequency to make up for it and are hitting the 300W TDP limit. The return of Fermi?
 

MrTeal

Diamond Member
Dec 7, 2003
3,587
1,748
136
HBM interface actually takes up relatively little space. For Fiji 4096 bit of HBM1 interface used roughly 2/3 the amount of space that Hawaii used on 512 bits of GDDR5 interface. The reason for this largely being the fact that a lot of the control logic is moved to the HBM memory modules themselves.

No idea about NVlink.

NVlink's a bit of a wildcard since it's not been used before. Each link is 20GB/s (compared to 16GB/s for PCIe 3.0 x16), and each GP100 appears to have four of them. On Kepler, the x16 PCIe interface appears to be around 5.5mm², and I think it's probably reasonable to assume that a 20GB/s NVlink interface on 16nm and on a die designed for an interposer isn't larger than that. Somewhere in the vicinity of 20-25mm² seems a reasonable upper limit on the size of NVlink, IMO.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
The potentially below average improvement in perf/mm^2 and perf/watt is awfully reminiscent of Fermi ...

The Pascal SM is still very far away from the GCN CU. GCN has a scalar unit, Pascal does not. GCN has a wavefront size of 64, Pascal's warp size is still 32. (Most people claiming that Pascal's warp width being 64 is wrong.)
 
Feb 19, 2009
10,457
10
76
Nvidia has been using dedicated FP64 units ever since Kepler. Fermi (and earlier) used multiple FP32 cores to simulate a FP64 core.

Yep, because Fermi had a hardware scheduler that is complex enough for that. The same as GCN. There's no dedicated FP64 SP. Each FP32 can combine to handle FP64 workloads and they can also handle 2x FP16.

Basically Pascal adds the latter feature, 2x FP16 and call it a revolution for computing.

People thought GCN was old, obsolete tech, we're seeing it hold up very well today, 290/X non-ref is 390/X and that's beating 970/980 lately. With the DX12 era in full swing in a few months, the lead only grows.

Now Pascal is arranging it's SM clusters to be similar to GCN as well.

All they need is a real hardware scheduler and several ACEs and it's going to be awesome for this new era. Almost like a timewarp.
 
Feb 19, 2009
10,457
10
76
The potentially below average improvement in perf/mm^2 and perf/watt is awfully reminiscent of Fermi ...

The Pascal SM is still very far away from the GCN CU. GCN has a scalar unit, Pascal does not. GCN has a wavefront size of 64, Pascal's warp size is still 32. (Most people claiming that Pascal's warp width being 64 is wrong.)

Pascal GP100 has 2 Warp Schedulers per SM.





^ Unless I'm blind, that's one Warp scheduler per 32x FP32 CC.

GCN loves warp sizes of 64 as it lights up all of it's SP in that SM. It will be the same for Pascal.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Pascal GP100 has 2 Warp Schedulers per SM.





^ Unless I'm blind, that's one Warp scheduler per 32x FP32 CC.

GCN loves warp sizes of 64 as it lights up all of it's SP in that SM. It will be the same for Pascal.

That doesn't mean their warp width changed to 64, that just means a Pascal SM has two warp schedulers. How would combining warp schedulers work when the register files aren't shared ?! (It's almost like the reverse hyper threading nonsense.)

Another downside is that pascal will need more thread parallelism like GCN did compared to VLIW ...
 
Feb 19, 2009
10,457
10
76
That doesn't mean their warp width changed to 64, that just means a Pascal SM has two warp schedulers. How would combining warp schedulers work when the register files aren't shared ?! (It's almost like the reverse hyper threading nonsense.)

Another downside is that pascal will need more thread parallelism like GCN did compared to VLIW ...

Warp width per WS is 32, but it hits all 32 FP32 CC in that block. The other WS and block will be the same. As long as they can split up the wavefront 64 into 2x 32 for each WS they will hit peak utilization instantly of all the 64x CC.

I don't know what the TPC is in their GP100 diagram, but it occurs after the Gigathread engine for each SM cluster.

I looked it up, TPC = Thread Processing Cluster.

It wasn't on GM200 or GK110 chip diagrams.
 

gamervivek

Senior member
Jan 17, 2011
490
53
91
As long as these cards are powerful I don't care whether it consumes 250W or 350W.

Yeah, best performance was Fermi's saving grace in the gaming space but then AMD didn't have 400mm2+ chips back then. As I said earlier, it's relative and just how 'powerful' these chips look in the gaming market would depend on AMD's releases as well or lack of them.
 

Kris194

Member
Mar 16, 2016
112
0
0
Warp width per WS is 32, but it hits all 32 FP32 CC in that block. The other WS and block will be the same. As long as they can split up the wavefront 64 into 2x 32 for each WS they will hit peak utilization instantly of all the 64x CC.

I don't know what the TPC is in their GP100 diagram, but it occurs after the Gigathread engine for each SM cluster.

I looked it up, TPC = Thread Processing Cluster.

It wasn't on GM200 or GK110 chip diagrams.

TPC is a normal part of Nvidia chips.
 
Last edited:
Feb 19, 2009
10,457
10
76
I'm asking one more time. Where do you see TPC on diagram?



GPC block of 10 SM.

Each SM is controlled by 1 TPC. In terms of flow, it's directly after the Gigathread Engine.

TPC was absent in Kepler & Maxwell.

GM200:



GK110:



Fermi, doesn't show it either.



Have to go back to GT200 to see TPC on NV's GPU diagrams.



^ But back then it's referenced as a Texture Processing Cluster that includes the TMUs.

In the newer diagram, the TMUs are separate. In GP100, TMUs are also separate in the diagram, but the TPC returns.
 
Last edited:

Kris194

Member
Mar 16, 2016
112
0
0
Thanks, didn't notice it earlier. TPC is a normal part of Tesla products like you can see in specifiication.
 
Last edited:

Aristotelian

Golden Member
Jan 30, 2010
1,246
11
76
Will you care if your high-end Pascal doesn't overclock so well? Because that's the same thing as 250W vs 350W.

I would imagine that the interest in overclocking seeks to push the performance limits of the video card. If the card is already at that limit (clocked highly from the beginning) and overclocking is not really possible, I'm not sure what the problem is.

For me, overclocking has always been about getting the maximum performance I can out of the product I buy. The TDP is not that relevant - more about the effort in the cooling solution, trying to keep things quiet.
 

Adored

Senior member
Mar 24, 2016
256
1
16
I would imagine that the interest in overclocking seeks to push the performance limits of the video card. If the card is already at that limit (clocked highly from the beginning) and overclocking is not really possible, I'm not sure what the problem is.

For me, overclocking has always been about getting the maximum performance I can out of the product I buy. The TDP is not that relevant - more about the effort in the cooling solution, trying to keep things quiet.

Yes but presumably you also overclock your current card to the limit as well. Is that how you will compare your current card to the new cards?

This is an interesting point actually, given that for years Nvidia has managed to get their cards overclocked for benchmarks. I wonder if we'll see superclocked or stock 980 Ti's when the 1080 is up for benching.
 
Feb 19, 2009
10,457
10
76
Yes but presumably you also overclock your current card to the limit as well. Is that how you will compare your current card to the new cards?

This is an interesting point actually, given that for years Nvidia has managed to get their cards overclocked for benchmarks. I wonder if we'll see superclocked or stock 980 Ti's when the 1080 is up for benching.

NV's boost is a great feature, it's about time AMD did auto-OC too. Nobody (unless they be enthusiasts!) wants to mess with manual vcore adjustment or whatever. Just a slider and let the hardware do itself.
 

Adored

Senior member
Mar 24, 2016
256
1
16
NV's boost is a great feature, it's about time AMD did auto-OC too. Nobody (unless they be enthusiasts!) wants to mess with manual vcore adjustment or whatever. Just a slider and let the hardware do itself.

Yes but I'm talking about factory OC'd cards being used instead of the reference in certain benchmarks. I wonder if that'll be seen when the 1080 is launched, or if suddenly there will be a whole lot of reference 980 Ti's around in the press.
 
Feb 19, 2009
10,457
10
76
Yes but I'm talking about factory OC'd cards being used instead of the reference in certain benchmarks. I wonder if that'll be seen when the 1080 is launched, or if suddenly there will be a whole lot of reference 980 Ti's around in the press.

If they were smarter they wouldn't release the reference model to the press, but only custom models with huge OC.

That's the nice thing about NV's boost, it's just so easy. No tweaks required, move the slider to the right, viola!
 

Timmah!

Golden Member
Jul 24, 2010
1,463
729
136
Yep, because Fermi had a hardware scheduler that is complex enough for that. The same as GCN. There's no dedicated FP64 SP. Each FP32 can combine to handle FP64 workloads and they can also handle 2x FP16.

Basically Pascal adds the latter feature, 2x FP16 and call it a revolution for computing.

People thought GCN was old, obsolete tech, we're seeing it hold up very well today, 290/X non-ref is 390/X and that's beating 970/980 lately. With the DX12 era in full swing in a few months, the lead only grows.

Now Pascal is arranging it's SM clusters to be similar to GCN as well.

All they need is a real hardware scheduler and several ACEs and it's going to be awesome for this new era. Almost like a timewarp.

So Nvidia´s architecture is becoming similar to GCN. I wonder if they wont lose their superior efficiency/power consumption in the process. We may see return of "Thermi" eventually :-D Full circle and all that.
 
Feb 19, 2009
10,457
10
76
So Nvidia´s architecture is becoming similar to GCN. I wonder if they wont lose their superior efficiency/power consumption in the process. We may see return of "Thermi" eventually :-D Full circle and all that.

GP100 Tesla, cut-down chip (56/60), 300W TDP.

Last time that was Fermi.

A strong compute uarch takes a hit on perf/w.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |