The benefit of PCIe bandwidth for GPUs is greatly exaggerated. Knowing that the theoretical bandwidth of PCIe 3.0x4 = PCIe 1.0x16, and that for single card performance it makes no difference, and that an m.2 GPU would be size/heat constrained anyway, I think that there could be a niche within a niche somewhere.
My point exactly. Also, given that this would be for relatively low power cards (<50W), you're not talking the kind of power where PCIe bandwidth matters much any way. Laptop gaming at >1080p is as of yet more or less a pipe dream (it exists, but at massive cost and with little to no upgradeability, i.e. no longevity). What I'm proposing here is a compact solution for ~1080p gaming in small form factor devices.
Let's use the GTX 980m that was brought up before as a baseline for a thought experiment. And yes, I'm entirely unqualified for this. But I'll give it a go any way.
The 980m performs at a certain level - more than enough for gaming at 1080p. It does that at ~100W average power (given that it's a Maxwell card, I suspect it spikes far higher than that, but drops far lower as well).
What's a reasonable expectation for power reductions moving from 28nm to 16nm? Not a 50% drop, obviously, but quite a lot. After all, Intel dropped the TDP of their mainstream laptop CPUs from M class (35W) to U class (17W) moving from 32nm to 22nm, only sacrificing base clocks slightly. Let's say this move reduces GPU power consumption by 25%. That gives us 100% performance at 75W average power.
Now, let's talk about that 400mm2 GPU. If area scaling worked like simple maths, moving from 28nm to 16nm would make a 20x20mm chip into a 12x12mm chip. Of course, it's not that simple. But would I be wrong thinking that something like 16x16mm would be doable? After all, Samsung's Exynos SoCs shrank by 40% (
http://www.anandtech.com/show/9330/exynos-7420-deep-dive) moving from 28nm to 14nm (although a straight comparison here is impossible, isn't it a reasonable pointer? After all, this was while moving to "larger" and more complex CPU cores).
The GTX 980m has 1536 cores and runs at "1038 +boost" MHz. For additional area and power savings, let's drop that down to, say, 1280 cores and 900 MHz + boost. (This would of course shrink the core even further.) That's a 17% drop in cores and a 13% drop in clockspeed. Given that power doesn't scale linearly with clockspeed, along with architectural improvements, the power usage should drop far more than the performance from this. I'd go out on a limb and say you could get perhaps 80% of GTX 980m performance at 50W of power.
Now, add in the reduced power usage of HBM. According to Anandtech, 4 stacks of HBM1 should use ~14W (
http://www.anandtech.com/show/9266/amd-hbm-deep-dive/4). A single stack of HBM2, even if it's 4GB instead of 1GB, should use less power than this. Also, the figures show this being roughly half the VRAM power usage of both the Titan X and the R9 290X, both using GDDR5. As the 980m has fewer GDDR5 chips running slower and on a narrower bus than these, it would use less power for RAM, but a 10W drop doesn't seem unreasonable to me. All the while the bus width would quadruple to 1024 bits and effective bandwidth would increase to 256GB/s (numbers from this news post about HBM2:
http://www.extremetech.com/gaming/2...specifics-up-to-16gb-of-vram-1tb-of-bandwidth). Heck, you could even squeeze an 8GB stack in there if you really wanted to, but bandwidth would stay the same, and power usage would increase. Also, as Anandtech noted, HBM saves area by requiring less complex power delivery to RAM chips.
With all of this combined, I'm pretty sure that with all of this put together, you'd be able to get 80% of GTX 980m performance at half the power usage and significant area savings.
How significant area savings?
Memory chips, GPU and memory traces in red, rough outline of an m.2 22110 in green for comparison.
We now have a 16x16mm GPU. HBM1 stacks are 5x7mm (I haven't been able to find any information regarding HBM2 die size, but I'd guess it would increase at least some - a 4x density increase sounds unlikely, no?).
Now I'm speculating quite a bit, but I'm guessing a lot of the package size outside of the die is to fit traces for the VRAM. If I'm right in thinking this, the package should shrink quite a lot with HBM. Reducing the interface to PCIe 3.0 x4 from the package should shrink it further. And, for small form factors like this, why not integrate the substrate into the board itself, i.e. attach the interposer directly to the PCB? I guess this would have to be done at the chip manufacturing plant, but would it be any more difficult than what they're doing today? Using HBM takes the choice of RAM chips away from the board OEMs anyway, so why not integrate production accordingly?
Would making the package 20x40mm be impossible? If not, this leaves more than half the length of an m.2 22110 card for power delivery.
I'd also suggest standardizing "external" 12V power connectors for modues like this (pads or some form of connector at the far end of the card?), removing the limitation of the m.2 connector's 3.3V power supply. This would also make for a very tidy board layout: Power pins -> power circuitry -> chip -> m.2 connector. This also removes the need for routing large power traces under/around the chip.
I know I'm probably wrong in at least half of the assumptions and estimations I make in this post. But is it really such a stretch?
Tl;dr: Wild, more or less unfounded speculations resulting in ~80% of GTX 980m performance at 50W in an m.2 card.