Discussion Apple Silicon SoC thread

Page 290 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,749
1,281
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

roger_k

Member
Sep 23, 2021
102
215
86
Still, it is not untrue that Apple's P-core IPC gains have levelled off in the last few years, with huge power consumption increments due to pushing frequency.

It is true, to a certain degree. We do observe some power inflation since A12. For example, M4 cores seems to use ~7 watts compared to M1's 5 watts. Some have interepret this as stagnation, others see it as a transition from a smartfone-focused core to a more desktop-focused core.

Also, it's important to keep in mid that while the power consumtpion is indeed increasing, we are still looking at sub 10 watt per-core power draw (at least according to instruments). I'd say that Apple still has a lot of headroom here.
 
Reactions: darkswordsman17

roger_k

Member
Sep 23, 2021
102
215
86
No, sSVE2 is kinda slow in many places versus just doing NEON on the CPU actual.

I am getting 250 GFLOPS (FP32) doing vector FMA on M4's SME unit. The hardware is, in principle, capable of 2TFLOPS, but there is not enough register file bandwidth, which means that you are limited to 2x 512b SIMD slices out of the available 16x. If Apple cares about vector performance, they might increase this in the future.

A caveat is that using SVE registers as a destination is slow on Apple hardware. Those are likely microcoded operations that read back from the tile storage and prevent effective pipelining. If you want to get good throughput, you need to use ZA as an accumulator.

Sure, but the main purpose of implementing SME is to do matrices. Doing vectors is a secondary purpose.

While that is true, vector processing is still fast on M4 using SME. They give you 2x 512b additional SIMD units running at 3.9 Ghz, which improves the vector throughput of the M4 cluster by ~ 50%.
 

The Hardcard

Member
Oct 19, 2021
124
177
86
I am getting 250 GFLOPS (FP32) doing vector FMA on M4's SME unit. The hardware is, in principle, capable of 2TFLOPS, but there is not enough register file bandwidth, which means that you are limited to 2x 512b SIMD slices out of the available 16x. If Apple cares about vector performance, they might increase this in the future.

A caveat is that using SVE registers as a destination is slow on Apple hardware. Those are likely microcoded operations that read back from the tile storage and prevent effective pipelining. If you want to get good throughput, you need to use ZA as an accumulator.



While that is true, vector processing is still fast on M4 using SME. They give you 2x 512b additional SIMD units running at 3.9 Ghz, which improves the vector throughput of the M4 cluster by ~ 50%.
More important for AI workloads is the 16 TOPS of int8. Model inferencing is maintaining nearly 99% accuracy at eight bits, and above 95% accuracy even down to four bits.

You bring up an important question I have regarding Apple‘s plans. It concerns Apple’s GPU with an integrated CPU as opposed to vice versa design. Apple’s CPU clusters have limited bandwidth and the Pros, Maxes + derivatives can’t saturate the chip memory bus. it’s not clear that the neural engine can either. (Very little is clear about the neural engine sadly, though it seems to be designed for small tasks at very low power rather than big boy inferencing.)

Especially given the strong rumors the Apple’s datacenters are going to eat their own dog food, it would seem beneficial to put some type of matrix or at least outer product units into the GPU.
 

FlameTail

Diamond Member
Dec 15, 2021
3,145
1,792
106
16 TOPS is for M4 CPU, is across both P core cluster and E core cluster?

I believe the P core and E core cluster both have 1 SME unit each. That's how it was with AMX.

 

FlameTail

Diamond Member
Dec 15, 2021
3,145
1,792
106
So that would be about 20 TOPS from the CPU alone.

What is Apple's hardware AI strategy going forward? Right now, it seems there is a lot of overlap between there CPU, GPU and Neural Engine.

On the other side of the river, Microsoft is going all in on NPUs.
 

roger_k

Member
Sep 23, 2021
102
215
86
What is Apple's hardware AI strategy going forward? Right now, it seems there is a lot of overlap between there CPU, GPU and Neural Engine.

Different functional blocks, optimized for different purposes. The GPU is least specialized (and least efficient), the NPU is most specialized.
 

FlameTail

Diamond Member
Dec 15, 2021
3,145
1,792
106
I think this slide from Nvidia sums up Apple's dilemma

Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?
 
Reactions: Mopetar

eek2121

Diamond Member
Aug 2, 2005
3,043
4,264
136
so I had a spare moment, bounced over to this thread, and saw the quotes below:

What? It's night and day for sustained loads, man. Lower wattage doesn't matter in this case, because this behavior is universal for every M1-4 chip. Every single one of them is equivalent on/off the wall, which would only ever be the case in a windows machine that is engineered to be as low TDP as possible. The second the PC had an all-core load, it would fold. This has not changed radically in the time since, not in a way that would make someone who owns a MacBook Pro say "dang, should've bought that Core Ultra 7 155H".
View attachment 100126

This is quibbling around the point, everyone knows that Windows PCs cannot compare to ARM Macs when they're off AC power. If that WERE the case, we'd never hear the end of it. A Macbook Air/Pro can run through it's entire battery without dropping an ounce of performance vs. being on the wall. Show me a single windows machine that can do this in a real world workload.
You guys do understand the whole “throttle on battery” is a software thing, right? I disabled it on my laptop. My system performs the same way on battery or plugged in.
 
Reactions: igor_kavinski

oak8292

Member
Sep 14, 2016
87
69
91
I think this slide from Nvidia sums up Apple's dilemma
View attachment 100239
Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?
Nvidia is a company with a ‘hammer’. Nvidia’s mobile offerings had great promise but too much of the power budget was allocated to the GPU. Great for handheld games but not mobile phones. Power matters and you have to use what the technology of the day gives you. ‘Heavy AI’ is apparently going to be fairly limited in scope if it needs 200-1300 TOPS. Limited in scope means millions and not billions. Apple’s market is 200 million plus smartphones (<10 wattts) and 10’s of millions mobile (10 - 50 watts) and maybe a million devices greater than 50 watts.

Blackwell on the N4 node needs about 50 watts for the minimum 200 TOPS of ‘heavy AI’. Is that feasible for Apple given their market? How much engineering effort do you expend to be at the bottom of the ‘heavy AI’ market as defined by Nvidia.

My guess is that Apple is focused on what they can do for 200 million or the 10s of million iPad/laptop users and leave the 200+ watt desktop AI to others.
 

name99

Senior member
Sep 11, 2010
443
333
136
Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.
Models that are designed for dGPU run on M-series GPU. Big surprise.

That doesn't mean a model designed for ANE can't run on ANE...



I think Apple's OpenELM also runs on ANE but I can't get definitive confirmation.
The other Apple stuff created with CoreNet, like the MobileOne vision net definitely runs on ANE, and if you read between the lines the papers describing the model are mostly describing changes made to better fit it to ANE.

 
Reactions: igor_kavinski

Doug S

Platinum Member
Feb 8, 2020
2,479
4,035
136
More important for AI workloads is the 16 TOPS of int8. Model inferencing is maintaining nearly 99% accuracy at eight bits, and above 95% accuracy even down to four bits.

You bring up an important question I have regarding Apple‘s plans. It concerns Apple’s GPU with an integrated CPU as opposed to vice versa design. Apple’s CPU clusters have limited bandwidth and the Pros, Maxes + derivatives can’t saturate the chip memory bus. it’s not clear that the neural engine can either. (Very little is clear about the neural engine sadly, though it seems to be designed for small tasks at very low power rather than big boy inferencing.)

Especially given the strong rumors the Apple’s datacenters are going to eat their own dog food, it would seem beneficial to put some type of matrix or at least outer product units into the GPU.


I don't buy that Apple would use their existing chips, unchanged, if they were going to build their own dedicated hardware for AI cloud clusters. They might use something existing for a pilot to give the software guys something to work with (which I'm willing to bet is where the M2 Ultra rumors come from) but they'd design a new chip for something they'd deploy for real. It is too inefficient to use existing chips that have a lot of die area devoted to stuff that isn't helping the AI cause, plus LPDDR just isn't appropriate to their needs for dedicated AI clusters.
 

mikegg

Golden Member
Jan 30, 2010
1,815
445
136
I think this slide from Nvidia sums up Apple's dilemma
View attachment 100239
Nvidia designates AI in the edge, as 'Light AI' and 'Heavy AI'. Light AI is powered by NPUs (~45 TOPS) and Heavy AI is powered by dGPUs (upto 1300 TOPS). Now, this classification perfectly suits them, because they are a GPU company, They do not make CPUs or NPUs for PCs (atleast for now).

Circling back to Apple, Apple does not make dGPUs. Their Apple Silicon combines the CPU, GPU and NPU into one chip. This then raises the question of how Apple is going to approach AI hardware

Right now, in the M3 series used in Macs, the 17 TOPS NPU is evidently not powerful enough for heavy on-device AI. People running Heavy AI models such as Llama3 on Macbooks have found that it's the GPU that's getting utilised when running such models. This makes sense, because the GPU is more powerful than the NPU.

But in the M4 unveiled recently, the SME got SME, which in it's own gas 20 TOPS of INT8. The NPU also got a leg up, to 38 TOPS of INT8.

This, then raises the question of what is Apple's future AI hardware strategy. Will they scale up the NPU to larger sizes/TOPS in the Pro/Max/Ultra SoCs? Or will the GPU remain as the first class citizen for running on-device Heavy AI? Where does the CPU with SME belong in this tapestry?
They'll likely continue to increase the NPU at a higher rate than the GPU. But it will never be able to inference the best models locally. So they'll put something like "Siri Pro" on the server while normal Siri will be local.
 
Mar 8, 2024
61
165
66
so I had a spare moment, bounced over to this thread, and saw the quotes below:




You guys do understand the whole “throttle on battery” is a software thing, right? I disabled it on my laptop. My system performs the same way on battery or plugged in.

And you, in turn, understand that 3 hours of full-tilt battery usage is less than a full day's charge? Which is what you get on any MacBook?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |