Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

roger_k · Jun 1, 2024

FlameTail said:
Still, it is not untrue that Apple's P-core IPC gains have levelled off in the last few years, with huge power consumption increments due to pushing frequency.

It is true, to a certain degree. We do observe some power inflation since A12. For example, M4 cores seems to use ~7 watts compared to M1's 5 watts. Some have interepret this as stagnation, others see it as a transition from a smartfone-focused core to a more desktop-focused core.

Also, it's important to keep in mid that while the power consumtpion is indeed increasing, we are still looking at sub 10 watt per-core power draw (at least according to instruments). I'd say that Apple still has a lot of headroom here.

adroc_thurston · Jun 1, 2024

roger_k said:
What happens in two more days?

Fun stuff.

roger_k said:
SME is a fairly usable vector SIMD

It is not, very low throughput and high latency.

roger_k said:
How come?

Coll-P was a perf regression due to the new (and bad) L1.

roger_k · Jun 1, 2024

adroc_thurston said:
It is not, very low throughput and high latency.

You mean high throughput and high latency, right?

adroc_thurston said:
Coll-P was a perf regression due to the new (and bad) L1.

That’s interesting! Do you have more info on this?

adroc_thurston · Jun 1, 2024

roger_k said:
You mean high throughput and high latency, right?

No, sSVE2 is kinda slow in many places versus just doing NEON on the CPU actual.

roger_k said:
That’s interesting! Do you have more info on this?

It's 4clk L1 now, you need to bench it in spec iso compiler (ergo, iso XCode).

FlameTail · Jun 1, 2024

adroc_thurston said:
No, sSVE2 is kinda slow in many places versus just doing NEON on the CPU actual.

Sure, but the main purpose of implementing SME is to do matrices. Doing vectors is a secondary purpose.

adroc_thurston said:
Coll-P was a perf regression due to the new (and bad) L1.

What did they do to the L1?

Nothingness · Jun 1, 2024

adroc_thurston said:
No, sSVE2 is kinda slow in many places versus just doing NEON on the CPU actual.

That's a peculiarity of Apple implementation, not a property of the SME ISA.

roger_k · Jun 1, 2024

adroc_thurston said:
No, sSVE2 is kinda slow in many places versus just doing NEON on the CPU actual.

I am getting 250 GFLOPS (FP32) doing vector FMA on M4's SME unit. The hardware is, in principle, capable of 2TFLOPS, but there is not enough register file bandwidth, which means that you are limited to 2x 512b SIMD slices out of the available 16x. If Apple cares about vector performance, they might increase this in the future.

A caveat is that using SVE registers as a destination is slow on Apple hardware. Those are likely microcoded operations that read back from the tile storage and prevent effective pipelining. If you want to get good throughput, you need to use ZA as an accumulator.

FlameTail said:
Sure, but the main purpose of implementing SME is to do matrices. Doing vectors is a secondary purpose.

While that is true, vector processing is still fast on M4 using SME. They give you 2x 512b additional SIMD units running at 3.9 Ghz, which improves the vector throughput of the M4 cluster by ~ 50%.

The Hardcard · Jun 1, 2024

roger_k said:
I am getting 250 GFLOPS (FP32) doing vector FMA on M4's SME unit. The hardware is, in principle, capable of 2TFLOPS, but there is not enough register file bandwidth, which means that you are limited to 2x 512b SIMD slices out of the available 16x. If Apple cares about vector performance, they might increase this in the future.

A caveat is that using SVE registers as a destination is slow on Apple hardware. Those are likely microcoded operations that read back from the tile storage and prevent effective pipelining. If you want to get good throughput, you need to use ZA as an accumulator.

While that is true, vector processing is still fast on M4 using SME. They give you 2x 512b additional SIMD units running at 3.9 Ghz, which improves the vector throughput of the M4 cluster by ~ 50%.

More important for AI workloads is the 16 TOPS of int8. Model inferencing is maintaining nearly 99% accuracy at eight bits, and above 95% accuracy even down to four bits.

You bring up an important question I have regarding Apple‘s plans. It concerns Apple’s GPU with an integrated CPU as opposed to vice versa design. Apple’s CPU clusters have limited bandwidth and the Pros, Maxes + derivatives can’t saturate the chip memory bus. it’s not clear that the neural engine can either. (Very little is clear about the neural engine sadly, though it seems to be designed for small tasks at very low power rather than big boy inferencing.)

Especially given the strong rumors the Apple’s datacenters are going to eat their own dog food, it would seem beneficial to put some type of matrix or at least outer product units into the GPU.

FlameTail · Jun 1, 2024

The Hardcard said:
More important for AI workloads is the 16 TOPS of int8. Model inferencing is maintaining nearly 99% accuracy at eight bits, and above 95% accuracy even down to four bits.

Source? M4's SME unit does 16 TOPS of INT8? That's as much as a dedicated NPU.

Nothingness · Jun 1, 2024

FlameTail said:
Source? M4's SME unit does 16 TOPS of INT8? That's as much as a dedicated NPU.

https://github.com/tzakharko/m4-sme-exploration

FlameTail · Jun 1, 2024

16 TOPS is for M4 CPU, is across both P core cluster and E core cluster?

I believe the P core and E core cluster both have 1 SME unit each. That's how it was with AMX.

roger_k · Jun 1, 2024

FlameTail said:
16 TOPS is for M4 CPU, is across both P core cluster and E core cluster?

I only tested the P-core SME, single-threaded. The E-core should add another 25% I think (don’t quote me on that though).

SME also has this weird 1-bit “matmul”, haven’t had the chance to test it yet.

FlameTail · Jun 1, 2024

So that would be about 20 TOPS from the CPU alone.

What is Apple's hardware AI strategy going forward? Right now, it seems there is a lot of overlap between there CPU, GPU and Neural Engine.

On the other side of the river, Microsoft is going all in on NPUs.

name99 · Jun 1, 2024

roger_k said:
It is true, to a certain degree.

name99 · Jun 1, 2024

roger_k · Jun 2, 2024

FlameTail said:
What is Apple's hardware AI strategy going forward? Right now, it seems there is a lot of overlap between there CPU, GPU and Neural Engine.

Different functional blocks, optimized for different purposes. The GPU is least specialized (and least efficient), the NPU is most specialized.

FlameTail · Jun 2, 2024

I think this slide from Nvidia sums up Apple's dilemma

Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?

eek2121 · Jun 2, 2024

so I had a spare moment, bounced over to this thread, and saw the quotes below:

reggie_fils_aime said:
What? It's night and day for sustained loads, man. Lower wattage doesn't matter in this case, because this behavior is universal for every M1-4 chip. Every single one of them is equivalent on/off the wall, which would only ever be the case in a windows machine that is engineered to be as low TDP as possible. The second the PC had an all-core load, it would fold. This has not changed radically in the time since, not in a way that would make someone who owns a MacBook Pro say "dang, should've bought that Core Ultra 7 155H".
View attachment 100126

reggie_fils_aime said:
This is quibbling around the point, everyone knows that Windows PCs cannot compare to ARM Macs when they're off AC power. If that WERE the case, we'd never hear the end of it. A Macbook Air/Pro can run through it's entire battery without dropping an ounce of performance vs. being on the wall. Show me a single windows machine that can do this in a real world workload.

You guys do understand the whole “throttle on battery” is a software thing, right? I disabled it on my laptop. My system performs the same way on battery or plugged in.

oak8292 · Jun 2, 2024

FlameTail said:
I think this slide from Nvidia sums up Apple's dilemma
View attachment 100239
Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?

Nvidia is a company with a ‘hammer’. Nvidia’s mobile offerings had great promise but too much of the power budget was allocated to the GPU. Great for handheld games but not mobile phones. Power matters and you have to use what the technology of the day gives you. ‘Heavy AI’ is apparently going to be fairly limited in scope if it needs 200-1300 TOPS. Limited in scope means millions and not billions. Apple’s market is 200 million plus smartphones (<10 wattts) and 10’s of millions mobile (10 - 50 watts) and maybe a million devices greater than 50 watts.

Blackwell on the N4 node needs about 50 watts for the minimum 200 TOPS of ‘heavy AI’. Is that feasible for Apple given their market? How much engineering effort do you expend to be at the bottom of the ‘heavy AI’ market as defined by Nvidia.

My guess is that Apple is focused on what they can do for 200 million or the 10s of million iPad/laptop users and leave the 200+ watt desktop AI to others.

Nothingness · Jun 2, 2024

eek2121 said:
You guys do understand the whole “throttle on battery” is a software thing, right? I disabled it on my laptop. My system performs the same way on battery or plugged in.

Yes, but how long will it last on battery in that case?

name99 · Jun 2, 2024

FlameTail said:
Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

Models that are designed for dGPU run on M-series GPU. Big surprise.

That doesn't mean a model designed for ANE can't run on ANE...

GitHub - smpanaro/coreml-llm-cli: CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.

CLI to demonstrate running a large language model (LLM) on Apple Neural Engine. - smpanaro/coreml-llm-cli

github.com

I think Apple's OpenELM also runs on ANE but I can't get definitive confirmation.
The other Apple stuff created with CoreNet, like the MobileOne vision net definitely runs on ANE, and if you read between the lines the papers describing the model are mostly describing changes made to better fit it to ANE.

Apple releases eight small AI language models aimed at on-device use

OpenELM mirrors efforts by Microsoft to make useful small AI language models that run locally.

arstechnica.com

Doug S · Jun 2, 2024

The Hardcard said:
More important for AI workloads is the 16 TOPS of int8. Model inferencing is maintaining nearly 99% accuracy at eight bits, and above 95% accuracy even down to four bits.

You bring up an important question I have regarding Apple‘s plans. It concerns Apple’s GPU with an integrated CPU as opposed to vice versa design. Apple’s CPU clusters have limited bandwidth and the Pros, Maxes + derivatives can’t saturate the chip memory bus. it’s not clear that the neural engine can either. (Very little is clear about the neural engine sadly, though it seems to be designed for small tasks at very low power rather than big boy inferencing.)

Especially given the strong rumors the Apple’s datacenters are going to eat their own dog food, it would seem beneficial to put some type of matrix or at least outer product units into the GPU.

I don't buy that Apple would use their existing chips, unchanged, if they were going to build their own dedicated hardware for AI cloud clusters. They might use something existing for a pilot to give the software guys something to work with (which I'm willing to bet is where the M2 Ultra rumors come from) but they'd design a new chip for something they'd deploy for real. It is too inefficient to use existing chips that have a lot of die area devoted to stuff that isn't helping the AI cause, plus LPDDR just isn't appropriate to their needs for dedicated AI clusters.

poke01 · Jun 2, 2024

https://twitter.com/x/status/1796414625053020649

if you want advanced AI features you need an M1 or later Mac or ipad and for iPhone 15 pro or later. This is for on-device AI only.

mikegg · Jun 3, 2024

FlameTail said:
I think this slide from Nvidia sums up Apple's dilemma
View attachment 100239
Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?

They'll likely continue to increase the NPU at a higher rate than the GPU. But it will never be able to inference the best models locally. So they'll put something like "Siri Pro" on the server while normal Siri will be local.

reggie_fils_aime · Jun 3, 2024

eek2121 said:
so I had a spare moment, bounced over to this thread, and saw the quotes below:

You guys do understand the whole “throttle on battery” is a software thing, right? I disabled it on my laptop. My system performs the same way on battery or plugged in.

And you, in turn, understand that 3 hours of full-tilt battery usage is less than a full day's charge? Which is what you get on any MacBook?

Discussion Apple Silicon SoC thread

Lifer

Member

Diamond Member

Member

Diamond Member

Diamond Member

Platinum Member

Member

Member

Diamond Member

Platinum Member

Diamond Member

Member

Diamond Member

Senior member

Senior member

Member

Diamond Member

Diamond Member

Member

Platinum Member

Senior member

Platinum Member

Golden Member

Golden Member

Member