Question 'Ampere'/Next-gen gaming uarch speculation thread

Page 48 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ottonomous

Senior member
May 15, 2014
559
292
136
How much is the Samsung 7nm EUV process expected to provide in terms of gains?
How will the RTX components be scaled/developed?
Any major architectural enhancements expected?
Will VRAM be bumped to 16/12/12 for the top three?
Will there be further fragmentation in the lineup? (Keeping turing at cheaper prices, while offering 'beefed up RTX' options at the top?)
Will the top card be capable of >4K60, at least 90?
Would Nvidia ever consider an HBM implementation in the gaming lineup?
Will Nvidia introduce new proprietary technologies again?

Sorry if imprudent/uncalled for, just interested in the forum member's thoughts.
 

uzzi38

Platinum Member
Oct 16, 2019
2,702
6,405
146
The both of you forgot the asterisk of "when performing matrix calculations".

It is not a universal perf/w improvement. A100 is unparalleled for AI workloads but for raw number crunching... well there's a reason AMD won over those exascale supercomputers
 

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
I wonder if AT and other benchmarking sites will have to report different power consumption figures for usage cases that do and do not leverage the tensor cores, e.g. DLSS. The majority of perf/W improvements come from the efficient usage of the tensor cores so I can see it skewing the power consumption figures and, to the untrained eye, it can lead to unintended deception or confusion.
 
Reactions: ozzy702 and maddie

Glo.

Diamond Member
Apr 25, 2015
5,761
4,666
136
The both of you forgot the asterisk of "when performing matrix calculations".

It is not a universal perf/w improvement. A100 is unparalleled for AI workloads but for raw number crunching... well there's a reason AMD won over those exascale supercomputers
For anything else A100 is worse in perf/watt versus Volta. That is how simple it is.

Lets start discussion about the die configs not only on the highest end, but those dies that people actually care about.

107 die: 24 SM, 192 bit GDDR6 memory bus, 2-2.1 GHz boost clock, RTX 2060-2060 Super performance levels
106 die: 36 SM, 256 bit GDDR6 memory bus, 2.1 GHz boost clock, RTX 2070 Super-RTX 2080 performance levels.

First die TDP would be around 100-110W second: 150-160W TDP. Makes sense?
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
I wonder if AT and other benchmarking sites will have to report different power consumption figures for usage cases that do and do not leverage the tensor cores, e.g. DLSS. The majority of perf/W improvements come from the efficient usage of the tensor cores so I can see it skewing the power consumption figures and, to the untrained eye, it can lead to unintended deception or confusion.
You sir, are nominated for the diplomatic corp.
 
Reactions: Saylick

DXDiag

Member
Nov 12, 2017
165
121
116
The both of you forgot the asterisk of "when performing matrix calculations".
Yes, because that's the intended workloads of the chip, it's not called A100 Tensor Core GPU for nothing. NVIDIA has improved the throughput of each tensor cores 4 times.



It is not a universal perf/w improvement. A100 is unparalleled for AI workloads but for raw number crunching... well there's a reason AMD won over those exascale supercomputers
Because of one thing only: CPU-GPU heterogeneous access, and cache sharing, between Ryzen and CDNA.
 

uzzi38

Platinum Member
Oct 16, 2019
2,702
6,405
146
Because of one thing only: CPU-GPU heterogeneous access, and cache sharing.
Bzzzt, that's the wrong answer.

It's an excellent selling point for AMD, but not enough were CDNA's raw performance not good enough.

That and if AMD hadn't committed to a proper software stack. You'd be surprised how high a priority it is there now, the times they are a changing.
 

uzzi38

Platinum Member
Oct 16, 2019
2,702
6,405
146
Nope, even AMD officials have clearly mentioned that reason.
Oh yes, because AMD was going to declare the performance details for their custom CPU and GPU combination, weren't they?

If they talked about the performance and featureset in more detail, they'd be giving too many hints on the absolute monsters that are to come.

(Psst, I probably gave a bit more of a hint than I should have about Frontier).
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
The both of you forgot the asterisk of "when performing matrix calculations".

It is not a universal perf/w improvement. A100 is unparalleled for AI workloads but for raw number crunching... well there's a reason AMD won over those exascale supercomputers

Nope. 19,5 TFLOPs DP performance.

BTW: Japan is building a 750 Petaflops supercomputer with ARM cores. It doesnt matter for nVidia because goverments involvement doesnt mean anything for the broader market.
 

Glo.

Diamond Member
Apr 25, 2015
5,761
4,666
136
Nope. 19,5 TFLOPs DP performance.

BTW: Japan is building a 750 Petaflops supercomputer with ARM cores. It doesnt matter for nVidia because goverments involvement doesnt mean anything for the broader market.
"Cough" 19.5 TFLOPs of FP32(Single Precision) performance "Cough"
 

xpea

Senior member
Feb 14, 2014
449
148
116
"Cough" 19.5 TFLOPs of FP32(Single Precision) performance "Cough"
it's 19.5 TFLOPS IEEE-compliant FP64 Tensor Core instructions for HPC
And I see how convenient the nay sayers forget that A100 400W TDP includes 600GB/sec 12 NVLINK (double from V100), which is already nearly 100W of TDP... (hint, look at the power consumption of a 100GB/sec network card).
GeForce Ampere are much more efficient. But this time Nvidia has competition so they up their game to stay ahead.
 
Reactions: french toast

Glo.

Diamond Member
Apr 25, 2015
5,761
4,666
136
it's 19.5 TFLOPS IEEE-compliant FP64 Tensor Core instructions for HPC
And I see how convenient the nay sayers forget that A100 400W TDP includes 600GB/sec 12 NVLINK (double from V100), which is already nearly 100W of TDP... (hint, look at the power consumption of a 100GB/sec network card).
GeForce Ampere are much more efficient. But this time Nvidia has competition so they up their game to stay ahead.
Even their own Spec site for GA100 chip shows that FP64 is 1/2 rate versus FP32.

So no, its 9.7 TFLOPs FP64.
 

Saylick

Diamond Member
Sep 10, 2012
3,385
7,151
136
it's 19.5 TFLOPS IEEE-compliant FP64 Tensor Core instructions for HPC
And I see how convenient the nay sayers forget that A100 400W TDP includes 600GB/sec 12 NVLINK (double from V100), which is already nearly 100W of TDP... (hint, look at the power consumption of a 100GB/sec network card).
GeForce Ampere are much more efficient. But this time Nvidia has competition so they up their game to stay ahead.
Even their own Spec site for GA100 chip shows that FP64 is 1/2 rate versus FP32.

So no, its 9.7 TFLOPs FP64.
You guys are both right. It just depends whether or not you want to include the acceleration afforded by the tensor cores: A100 Blogpost

I recall reading somewhere in the past that Volta tensor cores were not separate, discrete execution units but were basically ganged up FP32 cores to do matrix math, and it was suspected since you can't use the FP32 cores within the same clock cycle as the tensor cores. I imagine this is the same with Ampere but they've simply added in more math formats to the cores, including FP64. Can someone remind me if the FP64 units are discrete as well? Or are they the same as paired up FP32 units? I am leaning towards the former, in which case Ampere tensor cores probably gang up the FP32 units to do FP64 math somehow in conjunction with the dedicated FP64 units.
 

DXDiag

Member
Nov 12, 2017
165
121
116
Even their own Spec site for GA100 chip shows that FP64 is 1/2 rate versus FP32.
Nope, Tensor FP64 is 20TF, it encompasses IEEE FP64 code as well. This is a substantial upgrade for AI HPC workloads.

And I see how convenient the nay sayers forget that A100 400W TDP includes 600GB/sec 12 NVLINK (double from V100), which is already nearly 100W of TDP... (hint, look at the power consumption of a 100GB/sec network card).
I explained that above, but the fantasy narrative must be maintained so that Ampere looks less power efficient.

Once more, A100 is an SXM4 form factor, it upps power to 400w to accommodate 40GB of HBM2 memory, which is probably anywhere from 110w to 125w (16GB HBM2 consumes 50W). It also accommodates the huge 600GB/s NVLink, while also allowing for more stable boost clock speeds. So in fact A100 is hugely more power efficient than V100, which had to be pushed for 450W to accommodate just 32GB of HBM2, 300GB/s NVLink and more stable boost clocks.

None of the usual suspects here understands any of that because they lack the necessary data center knowledge.

I recall reading somewhere in the past that Volta tensor cores were not separate, discrete execution units but were basically ganged up FP32 cores to do matrix math
That's nonsense, they are physically present on the die shots.
 

Krteq

Senior member
May 22, 2015
993
672
136
According to recent rumors, something seems to be really fishy with GA10x GPUs.

- RTX 3080, as a top mainstream card, based on GA102 (biggest GPU for customer segment)
- Reported power consumption for a top high-end model in range of 300 - 375W

So, something went wrong with customer Ampere GPUs or NV is overreacting due to "Big Navi" for some reason.

... or all those "leaks" are just a bunch of nonsense
 
Last edited:

pepone1234

Member
Jun 20, 2014
36
8
81
Why did nvidia select samsung 10nm for this generation of huge DIE graphics cards?
I thought tsmc and nvidia had a better relationship since tsmc made that 12nm process for nvidia graphics cards.
Now that huawei must stop using tsmc, I dont't think there will be a huge problem allocating tsmc resources at the 7nm process.
 

sontin

Diamond Member
Sep 12, 2011
3,273
149
106
Why are you conflating tensor/matrix multiply with general FMA/FMAC operations? How much Tensor throughput to i get if i only have lots of A*B or A+B or A*B+C?

No difference with FMA. And Ampere has a new instruction set replacing 8x DMA instructions. With TensorCores A100 has 2,5x higher throughput than V100.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |