Discussion Apple Silicon SoC thread

Eug · Nov 10, 2020

M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:

M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:

Page 78 - Discussion - Apple Silicon SoC thread

Page 78 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M1 Ultra discussion here:

Page 109 - Discussion - Apple Silicon SoC thread

Page 109 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M2 discussion here:

Page 127 - Discussion - Apple Silicon SoC thread

Page 127 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:

Page 215 - Discussion - Apple Silicon SoC thread

Page 215 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

M4 Family discussion here:

Page 263 - Discussion - Apple Silicon SoC thread

Page 263 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

MS_AT · Nov 17, 2024

poke01 said:
Intel build used 64GB and I doubt using 64GB on 9950X will make it beat the M4 Max let alone make up for the power consumption difference.

Thanks for pointing it out, I have took another look at the data. Since I am not using this benchmarking framework myself I am not sure, do I read this right that Intel and AMD cpus were using gcc 13.2 to compile the codebases in the test?

M4Max entry is listing gcc 16, clang 16 and Xcode 16. This is curious because gcc 16 does not exist yet. Little digging and it seems that MacOS will alias gcc on clang if gcc itself is not installed on the system.

So if I read it right then M4Max was compiling with clang. And x64 with GCC. At work clang vs gcc on the same machine can be from as fast to almost twice as fast as gcc.

So tl:dr it seems to me different compilers were used and from my experience M4Max was using the faster one.Unless I misread the data that is, then feel free to ignore my rambling

Of course at iso compiler comparison M4Max might win, and compiler choice will not reduce the power draw, but I would like to see such an apples to apples comparison first

poke01 · Nov 17, 2024

MS_AT said:
Thanks for pointing it out, I have took another look at the data. Since I am not using this benchmarking framework myself I am not sure, do I read this right that Intel and AMD cpus were using gcc 13.2 to compile the codebases in the test?

M4Max entry is listing gcc 16, clang 16 and Xcode 16. This is curious because gcc 16 does not exist yet. Little digging and it seems that MacOS will alias gcc on clang if gcc itself is not installed on the system.

So if I read it right then M4Max was compiling with clang. And x64 with GCC. At work clang vs gcc on the same machine can be from as fast to almost twice as fast as gcc.

So tl:dr it seems to me different compilers were used and from my experience M4Max was using the faster one.Unless I misread the data that is, then feel free to ignore my rambling

Of course at iso compiler comparison M4Max might win, and compiler choice will not reduce the power draw, but I would like to see such an apples to apples comparison first

Good catch.

I really hope Intel has something great with Nova Lake I can get the longer compile times but power consumption difference is too much. I do think Artic Wolf will diminish this if the rumoured IPC increase is true.

Eug · Nov 18, 2024

Inside M4 chips: E and P cores

Less glamorous than the P cores, E cores are used to run background threads. Details of their architecture, how threads are managed on them and their efficiency.

eclecticlight.co

StinkyPinky · Nov 18, 2024

Education pricing for the Mac M2 Max Studio base is $1800, really hoping it's the same for the M4 Max version. Hurry up already Apple. What's the delay?

Eug · Nov 18, 2024

StinkyPinky said:
Education pricing for the Mac M2 Max Studio base is $1800, really hoping it's the same for the M4 Max version. Hurry up already Apple. What's the delay?

M4 Ultra
+/- M4 Extreeeeeeeme

Will it be made on the same process as M4 Max?

Doug S · Nov 18, 2024

Eug said:
Inside M4 chips: E and P cores

Less glamorous than the P cores, E cores are used to run background threads. Details of their architecture, how threads are managed on them and their efficiency.

eclecticlight.co

The one part that really stood out to me:

Total energy used to complete one thread is therefore over 23 J when run on P cores, and less than 1.7 J when run on E cores. E cores therefore use only 7% of the energy that P cores do performing the same task.

Eug · Nov 19, 2024

Doug S said:
The one part that really stood out to me:

I remember back in the day there was an ad/demo from Cyrix/VIA showing they could run a game on their chip with no heatsink without it being fried to an crisp, and that was considered absolutely crazy and shocking. However, that chip was way, way slower than the other x86 chips at the time, maybe >5 years behind.

Contrast that to M4.

FlameTail · Nov 19, 2024

Eug said:
M4 Ultra
+/- M4 Extreeeeeeeme

Will it be made on the same process as M4 Max?

Same N3E as M4 Max, or could be N3P.

name99 · Nov 19, 2024

Doug S said:
The one part that really stood out to me:

At low QoS, and so running at 1GHz…
Running at full speed (ie as an augmentation to P-cores) it’s not quite so impressive, but still pretty good.

moinmoin · Nov 19, 2024

name99 said:
At low QoS, and so running at 1GHz…

And that's what actually makes it worth calling a background job. Everything and everybody else just does race to idle across the board instead.

name99 · Nov 19, 2024

moinmoin said:
And that's what actually makes it worth calling a background job. Everything and everybody else just does race to idle across the board instead.

Of course.
But what does that have to do with the point I made?

johnsonwax · Nov 19, 2024

name99 said:
At low QoS, and so running at 1GHz…
Running at full speed (ie as an augmentation to P-cores) it’s not quite so impressive, but still pretty good.

I don't think they were ever designed to augment P-cores as in a unix-based OS, there are always plenty of other processes to task the E-cores with, and the scheduler is designed to ensure that those processes don't touch the P-cores, which is probably the bigger performance benefit.

name99 · Nov 20, 2024

johnsonwax said:
I don't think they were ever designed to augment P-cores as in a unix-based OS, there are always plenty of other processes to task the E-cores with, and the scheduler is designed to ensure that those processes don't touch the P-cores, which is probably the bigger performance benefit.

I am referring to when enough "P"-threads exist that some "P-work" overflows to E-cores.
It is primarily in those circumstances that the E-cores run at ~2.6GHz rather than 1GHz, delivering a rather better performance(!), but also at not quite so low an energy usage.

digitaldreamer · Nov 20, 2024

name99 said:
I am referring to when enough "P"-threads exist that some "P-work" overflows to E-cores.
It is primarily in those circumstances that the E-cores run at ~2.6GHz rather than 1GHz, delivering a rather better performance(!), but also at not quite so low an energy usage.

Which brings up a question I've always wondered: How does the core know when to run at full clock speed or not? And, how does the microprocessor know what threads to run on the P-cores and which to run on the E-cores?

mvprod123 · Nov 20, 2024

The Hardcard · Nov 20, 2024

digitaldreamer said:
Which brings up a question I've always wondered: How does the core know when to run at full clock speed or not? And, how does the microprocessor know what threads to run on the P-cores and which to run on the E-cores?

The frequency ramp is just a time function, if the code is not complete at particular intervals, the core will ramp to top frequency. Within milliseconds. macOS has several Quality of Service levels (I believe 4 are provided to developers), with all but the lowest going to an e core, quickly maxing it out, switching to the P cores and then filling the P cores. The e cores engage when there are more threads than P cores.

Apple’s OSes have a thread pool system originally called Grand Central Dispatch (now just Dispatch) that is the preferred way for developers to go multithreaded.

The exception is QoS 9, which allows a developer to declare their code as background level. These will stay on an e core at low speed. Background level code never leaves the e cores. If only background threads are running, as many as 30 or more of them will jam into the e cores, with the P cores remaining off. However, the frequency will rise to max if there are multiple background tasks.

name99 · Thursday at 4:25 PM

digitaldreamer said:
Which brings up a question I've always wondered: How does the core know when to run at full clock speed or not? And, how does the microprocessor know what threads to run on the P-cores and which to run on the E-cores?

The OS has some control, for example background tasks are marked as such (given a QoS flag) and will run on an E-core at lowest frequency. All well-written macOS/iOS code indicates a QoS (the role of the code, eg background code or UI code or whatever).

Other tasks indicate when they need to be completed, and give intermediate progress reports; frequency can be set based on this. This might seem a rare case, but think games (where you need to complete the frame, but being faster than 60Hz or whatever won't help) or media playback, or even UI animations.

Meanwhile at the CPU level itself, the hardware can track the type of code that is running and slow the clock if appropriate. For example if the code is primarily moving data around (so that most of the time the CPU is waiting on DRAM) after this happens a few times, power will be saved by reducing the CPU clock until we exit this DRAM-limited portion of the code.

It's an imperfect business but Apple, far more so than anyone else, has made it work well through a combination of
- aggressive tracking of many many metrics across every IP block in the SoC
- reporting those metrics to centralized controllers (eg first to a CPU, then upward to a cluster controller, then upward to a SoC frequency/power controller)
- giving the OS quick and easy access to some of those metrics
- aggressively encouraging developers (and making it easy to do so) to indicate the ROLE of each piece of code they provide to the OS, so that the OS knows how important it is.

mvprod123 · Friday at 8:34 AM

LOL

https://twitter.com/x/status/1859571556390936612

The Hardcard · Friday at 9:42 AM

mvprod123 said:
LOL

https://twitter.com/x/status/1859571556390936612

Don’t rule it out. In 2026, there will be a president willing to set sky high tariffs, even 200 to 500 percent to force what he thinks should happen and there is chatter that some on his team want US companies to use US fabs, for now that would be Intel.

name99 · Friday at 1:51 PM

The Hardcard said:
Don’t rule it out. In 2026, there will be a president willing to set sky high tariffs, even 200 to 500 percent to force what he thinks should happen and there is chatter that some on his team want US companies to use US fabs, for now that would be Intel.

Apple are probably endlessly engaged in talking to Intel about what they offer, and looking at their tools. This is just common sense - when I was there we, for example, spent some time talking to IBM about how Cell worked and what we could do with it. Then people see who's present at what parking lot (or even someone within Intel spills who was at a meeting last week) and the rumors begin.

But talking doesn't turn into action unless the item being talked about actually delivers some value. That wasn't true for Cell, and I suspect it won't be true for Intel as of 2026.
However I wouldn't be surprised if Apple try to "compile" the A20 design to Intel fab specs and (depending on what progress looks like, and maybe whether Intel will spring for the masks...) even try to run a test wafer and see what happens.

It's obviously in Intel's interests to go along with this as much as possible. Nothing would be more convincing that their foundry is not a joke than landing Apple. And even if the 2026 Apple test run fails, presumably the fab should learn from it what the most demanding customer in the world requires.

At the technical level, the next interesting step is BSPDN. GAA, sure, but that's more of the same, and everyone's doing the same thing. With BSPDN, Intel has chosen a path that is (probably) easier but also (probably) not as performant or powerful as TSMC. This (maybe) gets Intel bragging rights and the ability to cheer FRIST!!! for a year, but I suspect Apple will prioritize the delay, and the extra capabilities of the TSMC path over Intel. Apple are already filing patents for how to design denser SRAM based on backside signal routing, and as far as I can tell, PowerVIA, at least what we have seen of it, doesn't give them the functionality they need for this.

jdubs03 · Friday at 2:32 PM

The Hardcard said:
Don’t rule it out. In 2026, there will be a president willing to set sky high tariffs, even 200 to 500 percent to force what he thinks should happen and there is chatter that some on his team want US companies to use US fabs, for now that would be Intel.

Despite all the talk, and his prior statements going back over 30 years, his economic team will surely know that he’s about to tank the technology sectors and specifically Apple if he implements there sky high tariffs. 60% is what he originally said and that would be pretty bad as it is.

DrMrLordX · Friday at 3:03 PM

The Hardcard said:
Don’t rule it out. In 2026, there will be a president willing to set sky high tariffs, even 200 to 500 percent to force what he thinks should happen and there is chatter that some on his team want US companies to use US fabs, for now that would be Intel.

All that manufacturing is moving to Vietnam anyway to avoid high costs in China.

Doug S · Friday at 3:40 PM

name99 said:
However I wouldn't be surprised if Apple try to "compile" the A20 design to Intel fab specs and (depending on what progress looks like, and maybe whether Intel will spring for the masks...) even try to run a test wafer and see what happens.

Really doubt they'd use a full sized SoC like A20 for such testing. I'm willing to bet they already have a test chip design that has just enough functionality to verify performance and yield that they already use in early stages (pre risk) to characterize future TSMC processes. It would be a lot cheaper to port that design to Intel's foundry than a full featured SoC.

If Apple starts using Intel's foundry they'd start small with something lower volume that doesn't impact the timelines of their most critical products. Maybe make SoCs for the Watch - if a new model was delayed by a quarter or two it doesn't impact them too much. Or maybe make cellular modems for the Watch and non-Pro iPad, neither are critical to the functionality of the product since it is optional in those and at least in the Watch's case lower spec than the ones going into the phones.

oak8292 · Friday at 3:54 PM

Doug S said:
Really doubt they'd use a full sized SoC like A20 for such testing. I'm willing to bet they already have a test chip design that has just enough functionality to verify performance and yield that they already use in early stages (pre risk) to characterize future TSMC processes. It would be a lot cheaper to port that design to Intel's foundry than a full featured SoC.

If Apple starts using Intel's foundry they'd start small with something lower volume that doesn't impact the timelines of their most critical products. Maybe make SoCs for the Watch - if a new model was delayed by a quarter or two it doesn't impact them too much. Or maybe make cellular modems for the Watch and non-Pro iPad, neither are critical to the functionality of the product since it is optional in those and at least in the Watch's case lower spec than the ones going into the phones.

Apple probably isn’t doing test chips at Intel as ARM is working with Intel to provide physical IP on ARM cores. The relative performance of the ARM cores will give Apple the data they need and then they will deal with capacity issues.

“IFS and Arm will undertake design technology co-optimization (DTCO), in which chip design and process technologies are optimized together to improve power, performance, area and cost (PPAC) for Arm cores targeting Intel 18A process technology.”

Intel Foundry and Arm Announce Multigeneration Collaboration on...

Collaboration to bring chip designers a powerful combination of Arm core and Intel angstrom-era process technology advancements.

www.intel.com

johnsonwax · Saturday at 1:00 AM

The Hardcard said:
Don’t rule it out. In 2026, there will be a president willing to set sky high tariffs, even 200 to 500 percent to force what he thinks should happen and there is chatter that some on his team want US companies to use US fabs, for now that would be Intel.

That presumes Intel would be able to make an A20 that wasn't slower than a TSMC A19. If there were such a market disruption, I don't see how Apple could avoid a cycle without new silicon.

I mean, Apple's A/M series volume is roughly equal to Intel's total volume - including their stuff on older nodes. There is no universe in which even if Intel had a competitive node, they would remotely have the volume that Apple requires.

Discussion Apple Silicon SoC thread

Lifer

Senior member

Platinum Member

Lifer

Diamond Member

Lifer

Platinum Member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Member

Senior member

Junior Member

Member

Senior member

Senior member

Member

Senior member

Senior member

Golden Member

Lifer

Platinum Member

Member

Member