Yeah, Arm is obviously cheaper for hyperscalers.
You mean to say doing your own designs and having them fabbed to get exactly what you need plus cut out the middleman's profit (whether Intel, AMD, or Ampere) is cheaper for hyperscalers.
Yeah, Arm is obviously cheaper for hyperscalers.
Haha. Yes, Doug.You mean to say doing your own designs and having them fabbed to get exactly what you need plus cut out the middleman's profit (whether Intel, AMD, or Ampere) is cheaper for hyperscalers.
Haha. Yes, Doug.
Correct. But this is extremely pedantic, because (and even as a shill I will say this) merchant Arm servers are a joke, and I don’t anticipate that changing.
Ampere indeed is a meme.
Also Oracle.But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
If that diagram represents an accurate accounting of silicon spent on NPUs it augurs bad tidings for mobile GPU going forward.Samsung is also poised to put their own chips in PCs next year.
According to rumours, there will be two versions of their flagship Exynos 2500 chip. The 2500-A and 2500-B.
The 2500-B is the version intended for Windows AI PCs. It will most likely be used in Samsung's own Galaxybooks.
Sure. But it’s also that the advantages Arm (mobile) vendors that traditionally focus on low power fabrics have for a PC (notice it’s mostly laptops) don’t seem to be as big of a deal in servers for obvious reasons, but with hyperscalers they could literally be the exact same core and design as AMD/Intel and they’d still save money from cutting out a middle man. I mean, power is still a big deal for TCO, but not in the same way.But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
Not at all. Lol.If that diagram represents an accurate accounting of silicon spent on NPUs it augurs bad tidings for mobile GPU going forward.
And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2 on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell TensorNot at all. Lol.
The 34 TOPS NPU in Apple A17 uses only 5 mm²
130 TOPS per Watt per "nm²" ?And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2
So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell Tensor
I was talking about next Nvidia N3P SoC for these new shiny Microsoft AI PCs.130 TOPS per Watt per "nm²" ?
nanometer squared??
So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?
That's insane numbers.Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon
Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?...
All in one, the important fact is that Nvidia will bring a NPU magnitude faster and power efficient than anything right now in the market (and I expect to still be true compared to 2025-26 competition). It was one point that put Microsoft on notice to make Nvidia the next preferred vendor for AI PCS on ARM
INT4 needs very low bandwidth but yes it's LPDDR6 on packageThat's insane numbers.
How are they going to feed it?
LPDDR6 to the rescue?
Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?
Okay, this is something I have been wondering for a long time.INT4 needs very low bandwidth but yes it's LPDDR6 on package
For comparison;Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon...
Ampere don't really make desktop systems.But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.
He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.Still, these are far behind the numbers you are quoting for Nvidia, which sounds unbelievable. 300 TOPS in 3 mm²!?
no, he's very intentionally shilling.He's reading the graph and paper wrong.
Apple used INT8 on A17 Pro and M4.34 TOPS of FP16/BF16
Apple used INT8 on A17 Pro and M4.
M3 is FP16.
The classic story of the half empty glass...He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.
Due to the low frequency at that point, you'd need somewhere around 30mm^2 of silicon to get to 50 tops. Going by another table in the paper, they're at 11.7 TOPS/mm^2 at Vmax (INT8), so that puts them at ~4mm^2 to get 45 TOPS, not including interconnect.
So potentially quite good, but nowhere close to the 300 TOPS in 3mm^2 figure he cites.
View attachment 99762
No way.No, Windows on ARM PCs aren't likely to feature Thunderbolt ports, but that does not mean you can't use Thunderbolt accessories
The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.The new wave of Windows on ARM Copilot+ Windows PCs may seem at risk of losing access to Thunderbolt because of this, but we've already seen new Windows on ARM PCs boasting USB4 ports, which can do all the same things in theory.
Think of it this way, having access to TB would not have stopped OEMs from using the lowest USB4 implementation. As always, it's up to them to bring the appropriate connectivity based on their asking price.The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.