Linus Torvalds: Discrete GPUs are going away

AtenRa · Jul 8, 2014

Lets take it from the start, today.

Today Intels fastest iGPU is the Iris pro 5200. The Quad Core + GT3 die size is 264mm^2 of which 90-100mm^2 is the iGPU cores. The GT3e is close to NVIDIAs GT640 in performance (118mm^2 including 128bit memory controller).

Now, there will be no new Iris Pro in 2014 and Broadwell Iris Pro will arrive in Q2-Q3 2015.

If we double the iGPU to 80 EUs at 14nm the die size of a Quad Core will be close to 200mm^2. Now I highly doubt Broadwell Iris Pro with 80 EUs will even match the GT650Ti. So even in 2015 the GT-750Ti will be faster and cheaper.

In 2016 NVIDIA will start to release 16nm FF products. They will be able to have a 70mm^2 dies with GT-750Ti performance and 50% less power consumption. They will also be able to have 100-110mm^2 dies with GTX-760 performance and 50% less power consumption.

In 2016 Intel will still be at 14nm with Skylake. Even if they will raise Iris Pro to 120EUs and die size to 260mm^2 , they will at best match the GT750 performance with an enormous die size(iGPU) of more than 140-150mm^2.

At the same time NV will have a 70mm^2 die as an entry level Chip at 35W TDP that will still be a little faster than Skylake but much much cheaper to produce. Also in the $100+ segment they will have a 100-110mm^2 die with 70-80W TDP with 50-60% more performance than GT-750Ti or Skylake.

Things will continue to be like that in the future but iGPUs will never be able to reach dGPU performance above the 100W TDP unless they will produce dies like the ones in consoles of more than 100W. But I highly doubt Intel or AMD will ever produce that big dies for the consumer market.

_Rick_ · Jul 8, 2014

njdevilsfan87 said:
Kepler is ahead in parallel performance compared to Knight's Landing. And, Maxwell's feature is going to be to give it the same "functionality" as Knight's Landing (as in be able to run by itself without the need of a CPU). But Xeon Phi would probably funciton better as the "all around device" as its has far less cores, thus each core would handle the generic single threaded tasks much better.

For Nvidia, their concerns are
1) Maintain their performance lead over Intel
2) Make CUDA easier to use (one of the advantages of Knight's Landing will be ease of use). It's already really easy to setup, but needs some work in the coding, which is what the current CUDA 6.X is already working on.
3) Look into building one super strong core alongside everything else in a future architecture if we are indeed heading toward these running entire systems.

Nvidia as of right now seems very well off here. Intel is still obviously at the end of the R&D phase and has plenty of room to improve, and likely will. I really have no idea where AMD is falling into all of this. Their cards are certainly compute capable, but they rely too much on OpenCL which is not an advantage when CUDA itself doesn't cost anything. AMD needs to offer something as robust as CUDA, or officially support OpenCL (for example "OpenCL Toolkit") or offer some other incentive to use their GPUs over Nvidia or Intel, because pricing won't cut it in professional development where all of these GPUs are aiming.

The thing is, nVidia is losing out on the platform side. Even in an HPC compute node, the CPU will remain king, and the way it can schedule/distribute/collect calculations is what ultimately affects how far you can scale.
Scalability is important, because it allows potential better performance/money at the different HPC performance points that the vendors/integrators/clients aim for. Having a native GPU-CPU interconnect with shared memory etc. puts AMD and Intel ahead. Especially socketed GPUs and multi-socket nodes where a client can spec how many CPUs and GPUs and memory they want are a game-changer, where nVidia will have trouble competing, since their GPU will forever be a second class citizen, unless someone licenses their fabric protocol, or they develop their own platform from scratch. But since they lack any form of CPU credentials, this is unlikely to happen, except in extreme cases, where a high-end ARM core is good enough to run the IO/OS/interconnect. Or they put a separate external port on the cards and run their fabric in a side channel Probably too expensive, and you still need enough integer computing power to actually do all the scheduling, problem pre-treatment etc.

Essentially, your point 3 is where I have little confidence, that nVidia can pull it off, seeing that even AMD isn't able to, and they were in that game for decades, while nV has done CPUs for barely five years, and hasn't had any products that were really what the market wanted.

I'm not too worried about front-end (CUDA, OpenCL) development at this point, since it's bound to be flawed, since the backend is only now barely getting to the point, where hybrid computing becomes relevant. Ideally this will be solved by vector libraries in the big compilers, much like it is done for SSE these days. Maybe Intel could even benefit from the Itanium experience here, since a wrapped VLIW approach might actually work pretty well to effectively encode the hybrid instructions.
AMD hopefully is working on a dev platform at the same time, as they are working on the hybridization of their iGPU. If they can pull it off directly in hardware, and just dispatch x86 instructions to shaders, then they've got the holy grail, and their APUs may at last become interesting, purely because the weak x86 part doesn't matter as much anymore.

Currently, with CUDA nVidia are still in good health, but they need to work hard to maintain that advantage, both on the software and the hardware side.

On the other hand, I'm not sure this has a lot of impact (beside cost) on the video card market. But maybe the iGPU in the CPU will become useful for people running video cards. Just think back to the likes of PhysX. With properly hybridized CPUs, we could unlock quite a bit of performance/eye-candy for gaming.

witeken · Jul 8, 2014

njdevilsfan87 said:
Kepler is ahead in parallel performance compared to Knight's Landing.

For Nvidia, their concerns are
1) Maintain their performance lead over Intel

Knights Landing has a DP performance >3TFLOPS, while Nvidia's is half of that.

njdevilsfan87 · Jul 8, 2014

witeken said:
Knights Landing has a DP performance >3TFLOPS, while Nvidia's is half of that.

http://www.intel.com/content/dam/ww...xeon-phi-product-family-performance-brief.pdf

7120X = 1.2 TFLOPS of performance for DP. K40 is 1.4 TFLOPS and Titan Black is 1.7TFLOPS (due to running at a higher clock than K40). Where the is >3TFLOPS for Intel figure coming from?

AFAIK, right now Nvidia is slightly better for DP, and a lot better for SP. And SP does matter too because it could cut down development time substantially. If SP gives end results that are close enough, one could develop using SP and then just run final production using DP.

f1sherman · Jul 8, 2014

njdevilsfan87 said:
http://www.intel.com/content/dam/ww...xeon-phi-product-family-performance-brief.pdf

7120X = 1.2 TFLOPs of performance for DP. Where is the greater than 3 figure coming from?

From the future.

Notice that everyone mentioning Knights Landing 3 TFLOPS DP compare it with 2012 K20x.

It's getting pretty old by now. Comparing theoretical DP peak of future with yesteryear chips :yawn:

witeken · Jul 8, 2014

AtenRa said:
Lets take it from the start, today.

Today Intels fastest iGPU is the Iris pro 5200. The Quad Core + GT3 die size is 264mm^2 of which 90-100mm^2 is the iGPU cores. The GT3e is close to NVIDIAs GT640 in performance (118mm^2 including 128bit memory controller).

Now, there will be no new Iris Pro in 2014 and Broadwell Iris Pro will arrive in Q2-Q3 2015.

If we double the iGPU to 80 EUs at 14nm the die size of a Quad Core will be close to 200mm^2. Now I highly doubt Broadwell Iris Pro with 80 EUs will even match the GT650Ti. So even in 2015 the GT-750Ti will be faster and cheaper.

In 2016 NVIDIA will start to release 16nm FF products. They will be able to have a 70mm^2 dies with GT-750Ti performance and 50% less power consumption. They will also be able to have 100-110mm^2 dies with GTX-760 performance and 50% less power consumption.

In 2016 Intel will still be at 14nm with Skylake. Even if they will raise Iris Pro to 120EUs and die size to 260mm^2 , they will at best match the GT750 performance with an enormous die size(iGPU) of more than 140-150mm^2.

At the same time NV will have a 70mm^2 die as an entry level Chip at 35W TDP that will still be a little faster than Skylake but much much cheaper to produce. Also in the $100+ segment they will have a 100-110mm^2 die with 70-80W TDP with 50-60% more performance than GT-750Ti or Skylake.

Things will continue to be like that in the future but iGPUs will never be able to reach dGPU performance above the 100W TDP unless they will produce dies like the ones in consoles of more than 100W. But I highly doubt Intel or AMD will ever produce that big dies for the consumer market.

Your estimates are way off. All estimates about any upcoming IGP of Intel are doomed to fail because we know literally nothing about Gen8, even less about Gen9 or 10.

We know that Gen8 will be a huge improvement. Does that mean 40% faster? Twice as fast (per EU)? Even more? And what about power consumption, is it a 2x improvement like Maxwell or more or less?
We also know that every Broadwell GT will have 20% more EUs, so that would be 96 instead of 80 (although there won't be a GT4 Broadwell SKU). And lastly we know that 14nm will be something like 1.4x more efficient and ~2.2x more dense.

So we know 3 things for sure but 1 thing that could totally change all estimates. You can use as much bold text as much as you want, but that won't make it real, and I can confidently say it won't be.

Here are some corrections:

*20 Haswell EUs weigh in at 260-177 = 83mm² at most. A short calculation shows that the 20 extra EUs are 22% of the full die, which means 58mm² for 20EUs.

Everything we can now do is make a Gen7 IGP with 120EUs which comes in at 160mm² at the 14nm process. This GPU has 1920 GFLOPS at 1GHZ, comparable to the GTX 760.

*However, we can't comments on power, performance and die area because we don't know how much Gen9 will change compared to Gen7.5 in performance/watt, performance/area, performance/EU and area/EU.

*You now have to try to calculate Maxwell performance, area and power consumption. Good luck, but I'm sure you will come up with wrong answers. Your 50% lower power consumption is quite optimistic and not backed up by real world products.

*Even if it's true, your Maxwell calculations are realistic and you somehow magically managed to get a good estimate of Gen9 performance, price, area and power:

Congratulations, you've now compared a Q2 2015 product to a product that will be released somewhere in 2016. It will compete against an Intel 10nm Gen10 architecture. So even if Gen10 is still 2X less efficient than Maxwell (because your FinFETs use 50% as much energy), the transistor innovation of 10nm (Germanium and/or III-V) will surely make up for that, if that's still necessary because you forgot the improvements of 14nm.
And even if for some magical reason Intel is still behind, Intel has loads more die area available because of another 2x higher density than 14nm, which was already ahead anyway.

TL;DR: Any estimate you see or make will be wrong, for sure. The best thing possible is to compare the performance, power, area, transistor cost and release schedule of the process nodes, and neglecting any difference in architecture or other uncertainties. Such estimates, thanks to Occam's Razor, are for more reliable.

witeken · Jul 8, 2014

njdevilsfan87 said:
http://www.intel.com/content/dam/ww...xeon-phi-product-family-performance-brief.pdf

7120X = 1.2 TFLOPS of performance for DP. K40 is 1.4 TFLOPS and Titan Black is 1.7TFLOPS (due to running at a higher clock than K40). Where the is >3TFLOPS for Intel figure coming from?

AFAIK, right now Nvidia is slightly better for DP, and a lot better for SP. And SP does matter too because it could cut down development time substantially. If SP gives end results that are close enough, one could develop using SP and then just run final production using DP.

Did you miss the recent news?

Intels "Knights Landing" Xeon Phi Coprocessor Detailed

ShintaiDK · Jul 8, 2014

njdevilsfan87 said:
AFAIK, right now Nvidia is slightly better for DP, and a lot better for SP. And SP does matter too because it could cut down development time substantially. If SP gives end results that are close enough, one could develop using SP and then just run final production using DP.

Be careful to compare flops for example. A flop doesnt equal a flop. A K40 for example could need say 4 flosp to calculate the same that a Xeon Phi only uses 2 flops on. A classic example is the bitmining with nVidia vs AMD due to extra instructions.

SP isnt used much.

Anyway, we can see the Xeon Phi is moving forward fast in the HPC crowd.
http://www.top500.org/list/2014/06/

njdevilsfan87 · Jul 8, 2014

witeken said:
Did you miss the recent news?

Intels "Knights Landing" Xeon Phi Coprocessor Detailed

I did miss it. However, important to note that Nvidia's road map shows a performance per watt of 15 GFLOPS per watt on Maxwell, versus 5 on Kepler. If that holds true, Nvidia could be at 3.75 TFLOPS next year with their high end Maxwell parts, and the performance gap will be similar to what it is today.

Regardless, for Intel to be coming along like this is very impressive. I think it's fantastic they are.

IHateMyJob2004 · Jul 8, 2014

Does anyone remember math coprocessors?

The same thing is going to happen to GPUs. they will get integrated into the CPU.

witeken · Jul 8, 2014

Roadmap:

Not really a 3X improvement.

f1sherman · Jul 8, 2014

witeken said:
Roadmap:

Not really a 3X improvement.

So back there you missed only what... couple of hundred %

Not to mention that NV roadmap is not even DGEMM

witeken · Jul 8, 2014

Double precision roadmap:

Is the same.

Exophase · Jul 8, 2014

ShintaiDK said:
Be careful to compare flops for example. A flop doesnt equal a flop. A K40 for example could need say 4 flosp to calculate the same that a Xeon Phi only uses 2 flops on. A classic example is the bitmining with nVidia vs AMD due to extra instructions.

For the most part today, a FLOP represents the same thing across the board: part of an FMA. Most FP workloads are dominated by FADD/FSUB, FMUL, and FMA/FMS. There are special units to acceleration FDIV, FSQRT, or transcendental function calculation, but if these dominate your workload you're probably better off calculating them with Newton/Raphson iteration or series approximations, which use FMAs.

Bitcoin mining doesn't even use FLOPs at all, it uses integer instructions, mainly shifts and XORs.

ams23 · Jul 8, 2014

witeken said:
Roadmap:

Not really a 3X improvement.

The graph you linked to here is single precision normalized measured GFLOPS per watt. That has nothing to do with double precision peak theoretical GFLOPS per watt.

According to the peak theoretical DP GFLOPS per watt roadmap, Maxwell will have anywhere between 8-16 DP GFLOPS per watt. Realistically this means that we should expect somewhere ~ 12-14 DP GFLOPS per watt at either end of 2014 or sometime in 2015 (a previous non-logarithmic graphical version even listed Maxwell as up to 15 DP GFLOPS per watt). Using simple math, this means that theoretical DP throughput will easily be over 3000 GFLOPS throughput within ~ 215-250w power envelope.

Maxwell may not achieve exactly 3x improvement in DP perf. per watt vs. Kepler, but it may be in the 2-2.5x improvement range based on these graphs, which is more than enough to achieve > 3 TFLOPS theoretical DP throughput.

AtenRa · Jul 8, 2014

witeken said:
Your estimates are way off.

Your 50% lower power consumption is quite optimistic and not backed up by real world products.

TL;DR: Any estimate you see or make will be wrong, for sure.

40nm to 28nm gave ~2x density and 40% or more lower power consumption.

From 40nm HD6970 of 389mm^2 and 250W TDP we have 28nm HD7870 of 212mm^2 and 190W TDP. Not only that but HD7870 is faster than HD6970.

28nm to 16nm FF will bring ~2x density and 55% lower power consumption.

That means from 28nm HD7870 of 212mm^2 and 190W TDP we can go down to ~110mm^2 and ~80-100W TDP.

And that is without even consindering the Architectural gains in performance and efficiency.

So, at 16nm AMD will be able to have a HD7870 performance or higher at 100-110mm^2 and NVIDIA will have GTX660 performance or higher at the same 100-110mm^2.

Do you actually believe that Intel will have the same performance at 110mm^2 space at 14nm ??? Not a chance in Hell.

witeken · Jul 9, 2014

AtenRa said:
Do you actually believe that Intel will have the same performance at 110mm^2 space at 14nm ??? Not a chance in Hell.

Yes, 14nm will be 1.5x more dense (1.3x vs FF+) and consume ~1.4x less power. Intel's Gen8 must be substantially worse (like >3x worse performance/watt for the pure architecture) to be unable to compete.

Edit: don't forget that FinFETs are best for low performance SoCs, so I expect the benefits of FF for 300W GPUs to be substantially less.

Cookie Monster · Jul 9, 2014

ShintaiDK said:
There is nothing pointing to that dGPUs get ARM chips.

Also the fastest supercomputer uses Xeon Phi. And if you check the list, you see the Xeon Phi gaining momentum rather quickly.

Yes gaining momentum mainly because their intel but they have butt load of work ahead of them. The Tesla stronghold is not easily crack-able especially when the current Phi offerings offer less performance/W (just look at the top500 list) while using a process node one generation ahead of their competition.. This is on top of a solid 5 year CUDA eco-system that I have admit is pretty impressive given the timeframe. A K40 on intel's 22nm would blow the doors off any Xeon Phi available and this is a fact.

Heres one review comparing a K20X to a Xeon Phi 7120P
http://blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/

And the K40 is about 20% faster.

itsmydamnation · Jul 9, 2014

witeken said:
Edit: don't forget that FinFETs are best for low performance SoCs, so I expect the benefits of FF for 300W GPUs to be substantially less.

thats incorrect. finfets operate better at lower voltages. GPU's run at lower voltages but are very wide thus the low clocks speeds ( 1ghz) with high throughput and high total power consumption.

NTMBK · Jul 9, 2014

ShintaiDK said:
There is nothing pointing to that dGPUs get ARM chips.

That was the original Denver dream:

As you may have seen, NVIDIA announced today that it is developing high-performance ARM-based CPUs designed to power future products ranging from personal computers to servers and supercomputers.

Known under the internal codename Project Denver, this initiative features an NVIDIA CPU running the ARM instruction set, which will be fully integrated on the same chip as the NVIDIA GPU.

http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computing/

Integrated CPU and GPU for supercomputers. A standalone Tesla driven by Denver cores.

witeken · Jul 9, 2014

But they run nowhere near the voltages concerned in your graph.

itsmydamnation · Jul 9, 2014

they can if designed to. why pay a price now for no benefit until finfet.

AtenRa · Jul 9, 2014

witeken said:
Yes, 14nm will be 1.5x more dense (1.3x vs FF+) and consume ~1.4x less power. Intel's Gen8 must be substantially worse (like >3x worse performance/watt for the pure architecture) to be unable to compete.

14nm will have 2.2x density and up to 30% lower power consumption at the same performance as 22nm, according to Intel.

So lets say you can double the EUs of the Haswell GT3 to 80-90EUs for Broadwell GT3 and keep the iGPU die size the same at 100-110mm^2. Do you actually believe a 80-90 EU Broadwell GT3 (Gen 8) even with 128mb eDRAM will be able to match the HD7870 or GTX660 ??
Broadwell GT3e will not even reach GT-650Ti levels of performance. GT-750ti will still be way faster and cheaper.

witeken said:
Edit: don't forget that FinFETs are best for low performance SoCs, so I expect the benefits of FF for 300W GPUs to be substantially less.

FinFet highest efficiency is sub 1Volt, that is where dGPUs can also operate. TDP or Wattage has nothing to do with the FinFets highest efficiency curve. You can have a 300W TDP dGPU operating at sub 1Volt .

witeken · Jul 9, 2014

AtenRa said:
14nm will have 2.2x density and up to 30% lower power consumption at the same performance as 22nm, according to Intel.

So lets say you can double the EUs of the Haswell GT3 to 80-90EUs for Broadwell GT3 and keep the iGPU die size the same at 100-110mm^2. Do you actually believe a 80-90 EU Broadwell GT3 (Gen 8) even with 128mb eDRAM will be able to match the HD7870 or GTX660 ??
Broadwell GT3e will not even reach GT-650Ti levels of performance. GT-750ti will still be way faster and cheaper.

If Intel borrows Maxwell architecture and puts it on their 14nm process, it will easily outperform Nvidia's 16FF Maxwell. The question really is how much worse is Gen8/9 in comparison to Maxwell (performance/watt, performance/area)? If it's as good as Maxwell or slightly worse (up to ~2x), Broadwell/Skylake will be able to compete in those 2 areas (because 14nm gives Intel extra headroom). If Gen8/9 is even worse, than you'll be right that Maxwell will be better.

Except for time to market.

But like I said, we don't know anything about Gen8/9's competitiveness, even less about pricing (GTX 750 Ti is $150).
Broadwell won't have 80-90EUs. It will have 48EUs, so the total die area of the APU with the uncore, 8MB of cache and 4 cores will be well within 130mm² (GTX 750 Ti alone is already 148mm²).

FinFet highest efficiency is sub 1Volt, that is where dGPUs can also operate. TDP or Wattage has nothing to do with the FinFets highest efficiency curve. You can have a 300W TDP dGPU operating at sub 1Volt .

It's possible, but the die area would increase significantly.

itsmydamnation · Jul 9, 2014

witeken said:
It's possible, but the die area would increase significantly.

thats your assumption, simpler circuits, higher densities , different trade offs in "storage" vs compute etc. complex devices are never defined by one metric. Have you every tried to undervolting a GPU? all mine have pretty big head room downwards, likely about maximizing yields, so even changes in binning could have a big impact.

pretty much all amd coin miners undervolted for example.
also generally speaking your going to use less voltage on the smaller nodes anyway.

so add all the possibilities up......

edit: you will find everyone's fav andrew from intel talking about how removing complexity and having a drop in efficiency can and does end up being more power efficient for the same throughput in many cases. lots of things to tweak......

Linus Torvalds: Discrete GPUs are going away

Lifer

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Lifer

Platinum Member

Lifer

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Lifer

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member