Question What will the CPU industry marketshare look like in the future? The Rise of ARM?

Doug S · May 24, 2024

SpudLobby said:
Yeah, Arm is obviously cheaper for hyperscalers.

You mean to say doing your own designs and having them fabbed to get exactly what you need plus cut out the middleman's profit (whether Intel, AMD, or Ampere) is cheaper for hyperscalers.

SpudLobby · May 24, 2024

Doug S said:
You mean to say doing your own designs and having them fabbed to get exactly what you need plus cut out the middleman's profit (whether Intel, AMD, or Ampere) is cheaper for hyperscalers.

Haha. Yes, Doug.
Correct. But this is extremely pedantic, because (and even as a shill I will say this) merchant Arm servers are a joke, and I don’t anticipate that changing.

Ampere indeed is a meme.

FlameTail · May 24, 2024

Samsung is also poised to put their own chips in PCs next year.

https://twitter.com/x/status/1793820656494346258

According to rumours, there will be two versions of their flagship Exynos 2500 chip. The 2500-A and 2500-B.

The 2500-B is the version intended for Windows AI PCs. It will most likely be used in Samsung's own Galaxybooks.

Doug S · May 24, 2024

SpudLobby said:
Haha. Yes, Doug.
Correct. But this is extremely pedantic, because (and even as a shill I will say this) merchant Arm servers are a joke, and I don’t anticipate that changing.

Ampere indeed is a meme.

But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.

Mahboi · May 24, 2024

Doug S said:
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.

Also Oracle.
I'm convinced that buying from Oracle or being bought by Oracle is some kind of Seal of Doom.

soresu · May 24, 2024

FlameTail said:
Samsung is also poised to put their own chips in PCs next year.

https://twitter.com/x/status/1793820656494346258

According to rumours, there will be two versions of their flagship Exynos 2500 chip. The 2500-A and 2500-B.

The 2500-B is the version intended for Windows AI PCs. It will most likely be used in Samsung's own Galaxybooks.

If that diagram represents an accurate accounting of silicon spent on NPUs it augurs bad tidings for mobile GPU going forward.

SpudLobby · May 24, 2024

Doug S said:
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.

Sure. But it’s also that the advantages Arm (mobile) vendors that traditionally focus on low power fabrics have for a PC (notice it’s mostly laptops) don’t seem to be as big of a deal in servers for obvious reasons, but with hyperscalers they could literally be the exact same core and design as AMD/Intel and they’d still save money from cutting out a middle man. I mean, power is still a big deal for TCO, but not in the same way.

I also think you’re underestimating that Ampere just actually sucks. Their core IP sucks. If Qualcomm were doing Nuvia servers it might be a bit different.

Generally agree about the squeeze though.

FlameTail · May 24, 2024

soresu said:
If that diagram represents an accurate accounting of silicon spent on NPUs it augurs bad tidings for mobile GPU going forward.

Not at all. Lol.

The 34 TOPS NPU in Apple A17 uses only 5 mm²

xpea · May 25, 2024

FlameTail said:
Not at all. Lol.

The 34 TOPS NPU in Apple A17 uses only 5 mm²

And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2 on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell Tensor

no profanity in tech

esquared
Anandtech Forum Director

FlameTail · May 25, 2024

xpea said:
And that's trashcan level NPU. New NV NPU block does 130 TOPS/W/nm2

130 TOPS per Watt per "nm²" ?

nanometer squared??

xpea said:
on N3P for the basic CoPilot redacted app and AI PC moniker. Serious AI will be one on Blackwell Tensor

So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?

xpea · May 25, 2024

FlameTail said:
130 TOPS per Watt per "nm²" ?

nanometer squared??

So are you talking about Nvidia's upcoming AI PC SoC, or their Blackwell GPU?

I was talking about next Nvidia N3P SoC for these new shiny Microsoft AI PCs.
And sorry my memory failed me (a bit) after checking. So let me explain quickly:
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon...
All in one, the important fact is that Nvidia will bring a NPU magnitude faster and power efficient than anything right now in the market (and I expect to still be true compared to 2025-26 competition). It was one point that put Microsoft on notice to make Nvidia the next preferred vendor for AI PCS on ARM

FlameTail · May 26, 2024

xpea said:
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon

That's insane numbers.
How are they going to feed it?
LPDDR6 to the rescue?

xpea said:
...
All in one, the important fact is that Nvidia will bring a NPU magnitude faster and power efficient than anything right now in the market (and I expect to still be true compared to 2025-26 competition). It was one point that put Microsoft on notice to make Nvidia the next preferred vendor for AI PCS on ARM

Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?

xpea · May 26, 2024

FlameTail said:
That's insane numbers.
How are they going to feed it?
LPDDR6 to the rescue?

Do you have any numbers for the NPU efficiency of competitors (Intel/AMD/Qualcomm) ?

INT4 needs very low bandwidth but yes it's LPDDR6 on package
marketing slides I saw says 10+ times efficiency gain but I guess they don't compare with same precision...

FlameTail · May 26, 2024

xpea said:
INT4 needs very low bandwidth but yes it's LPDDR6 on package

Okay, this is something I have been wondering for a long time.

How much memory bandwidth does an NPU use? Can you give a ballpark figure like "1 TOPS of INT8 = 1 GB/s"?

Nobody has been able to give me a satisfactory answer to this question, yet.

FlameTail · May 26, 2024

xpea said:
but yes it's LPDDR6 on package

Bus width?

FlameTail · May 26, 2024

xpea said:
Test chip was on TSMC 5nm and reached nearly 100 TOPS/W (see attached slides). On next N3P SoC, efficiency should be 30% higher, thus my 130 TOPS/W number. So this was correct.
But I remembered wrong on the area. The test chip is 0.153mm2 (still on 5nm) without the interconnect. I expect double the area for the complete block. But final SoC is on N3P so it should be smaller. Real number is difficult to estimate but NV will dedicate less than 3mm2 for the NPU (with local cache) for 300~500 TOPS (I don't have the final number) and less than 5W. By the way, for the story, this VS-Quant INT4 DL NPU test chip is present in a corner of each NVSwitch silicon...

For comparison;

Apple A17 Pro (N3B)

Neural Engine
~5mm²
34 TOPS of FP16/BF16

Qualcomm Snapdragon 8 Gen 3 [N4P]

NPU
~7 mm²
45 TOPS* of INT8

If the Snapdragon NPU were ported to N3P, it would be about 5 mm² and ~50 TOPS.
I suppose, theoretically it could do 100 TOPS of INT4 then.

Still, these are far behind the numbers you are quoting for Nvidia, which sounds unbelievable. 300 TOPS in 3 mm²!?

What sorcery is this?

I suppose that's what comes of being the market leader in AI, and billions of dollars of R&D.

______

*Qualcomm hasn't officially disclosed the TOPS figure for Snapdragon 8 Gen 3. This number comes from the leaker Revegnus.

soresu · May 26, 2024

Doug S said:
But that's exactly because the hyperscalers are able to do everything the merchant ARM vendors could do. That leaves Ampere etc. chasing smaller customers who aren't big enough for it to be worth it to roll their own, but are big enough that the savings from switching from "industry standard" x86 to ARM is worth the hassle. Its hard to make a living in a niche that's squeezed on both sides.

Ampere don't really make desktop systems.

Outside of that you have servers and datacenters where a lot of the "industry standard" software for that market segment has since been ported to ARM far in advance of the consumer, DCC and video game software that we are still waiting on to fill the void in the native ARM market.

trexfromouterspace · May 26, 2024

FlameTail said:
Still, these are far behind the numbers you are quoting for Nvidia, which sounds unbelievable. 300 TOPS in 3 mm²!?

He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.

Due to the low frequency at that point, you'd need somewhere around 30mm^2 of silicon to get to 50 tops. Going by another table in the paper, they're at 11.7 TOPS/mm^2 at Vmax (INT8), so that puts them at ~4mm^2 to get 45 TOPS, not including interconnect.

So potentially quite good, but nowhere close to the 300 TOPS in 3mm^2 figure he cites.

adroc_thurston · May 26, 2024

trexfromouterspace said:
He's reading the graph and paper wrong.

no, he's very intentionally shilling.

poke01 · May 26, 2024

FlameTail said:
34 TOPS of FP16/BF16

Apple used INT8 on A17 Pro and M4.
M3 is FP16.

Doug S · May 27, 2024

poke01 said:
Apple used INT8 on A17 Pro and M4.
M3 is FP16.

I wonder what drove that change, aside from "others are using INT8 we have to use it too so we don't look like we're behind"? Maybe M3's NPU didn't support INT8 so it had to report on FP16, but a new NPU was used in A17P and M4 that did doubled the score?

Use of INT4 for inference is becoming more common, since it hasn't been shown to be all that much different than INT8 (or INT16/FP16 for that matter) and there's research looking at results below INT4 that supposedly still holds up. So NPUs that today are reporting TOPS based on INT8 might double tomorrow if they support INT4.

For many years the acronyn MIPS (millions of instructions per second) was derided as "meaningless indicator of processor speed". We need something similar for TOPS. Terribly Overblown Performance Specification? Any other ideas?

xpea · May 27, 2024

trexfromouterspace said:
He's reading the graph and paper wrong. He's taking the efficiency at Vmin, which is always going to be the most efficient point in operation, and applying that to the whole operating range.

Due to the low frequency at that point, you'd need somewhere around 30mm^2 of silicon to get to 50 tops. Going by another table in the paper, they're at 11.7 TOPS/mm^2 at Vmax (INT8), so that puts them at ~4mm^2 to get 45 TOPS, not including interconnect.

So potentially quite good, but nowhere close to the 300 TOPS in 3mm^2 figure he cites.

View attachment 99762

The classic story of the half empty glass...
First, The efficiency I mentioned is true and measured. I didn't twisted anything.
Second, the NPU will have it's dedicated power island. It's obvious.
Third, Vmax (1760MHz) is only to know how the circuit reacts at it's limit. Final speed of production silicon will be well under. So you can't use Vmax to estimate the efficiency, as it's far from the optimal point on the curve.
The final number is obviously in between. And we still must add that N3P brings 40% power improvement and 35% density improvement over N5.

PS: Obviously I don't remember well this marketing slide (from a live presentation, not a document) but NV rep was very bullish on the NPU. I will check and will come back later

FlameTail · May 27, 2024

As the OP of this thread, I just wanted to make an adjustment and clarificaton.

This thread is to discuss the future of the CPU market- particularly in the context of ARM in general. I have changed the thread title to reflect that. I also added to the first post; (1) Samsung's upcoming Exynos SoC for ARM PCs and (2) Rumours that Huawei is working on their own ARM PC processor.

For those who want to discuss exclusively about Nvidia's ARM SoC, another thread already exists for that purpose.

Page 3 - Discussion - NV Re-Enter ARM PC market in 2025!

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Cheers

FlameTail · May 27, 2024

Do Windows on ARM PCs support Thunderbolt 4 or 5?

Windows on ARM is making a big comeback, but will we lose access to Thunderbolt?

www.windowscentral.com

No, Windows on ARM PCs aren't likely to feature Thunderbolt ports, but that does not mean you can't use Thunderbolt accessories

No way.

The new wave of Windows on ARM Copilot+ Windows PCs may seem at risk of losing access to Thunderbolt because of this, but we've already seen new Windows on ARM PCs boasting USB4 ports, which can do all the same things in theory.

The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.

coercitiv · May 27, 2024

FlameTail said:
The issue with USB4 is that OEMs might put in the lowest spec version of USB4 and call it a day. Even if it is branded as 40 Gbps USB4, it may not have features like PCIe tunneling or Displayport alt-mode. Those are stuff that is mandatory for the Thunderbolt 4 specification.

Think of it this way, having access to TB would not have stopped OEMs from using the lowest USB4 implementation. As always, it's up to them to bring the appropriate connectivity based on their asking price.

Question What will the CPU industry marketshare look like in the future? The Rise of ARM?

Platinum Member

Senior member

Diamond Member

Platinum Member

Senior member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Attachments

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Member

Diamond Member

Golden Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Diamond Member