Discussion Apple Silicon SoC thread

Page 365 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,926
1,527
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

The Hardcard

Senior member
Oct 19, 2021
271
353
106
That presumes Intel would be able to make an A20 that wasn't slower than a TSMC A19. If there were such a market disruption, I don't see how Apple could avoid a cycle without new silicon.

I mean, Apple's A/M series volume is roughly equal to Intel's total volume - including their stuff on older nodes. There is no universe in which even if Intel had a competitive node, they would remotely have the volume that Apple requires.

If only technical considerations weren’t so far down the on the list when stands are taken for political righteousness. I don’t doubt Apple has numerous agents working hard to convince the incoming administration to leave them alone in picking where and how Apple Silicon is fabbed.

If they are really smart, they have made powerful, dazzling presentations that show how using TSMC will allow Apple to make the greatest, most glorious contributions to the new age of American supremacy.

The “buy American, build in America” faction has fire in the belly. With an indefinite hold on the Supreme Court, a 4 year hold on the White House, and a 2 year lock on both legislative houses, don’t rule anything out.
 

name99

Senior member
Sep 11, 2010
526
412
136
Really doubt they'd use a full sized SoC like A20 for such testing. I'm willing to bet they already have a test chip design that has just enough functionality to verify performance and yield that they already use in early stages (pre risk) to characterize future TSMC processes. It would be a lot cheaper to port that design to Intel's foundry than a full featured SoC.

If Apple starts using Intel's foundry they'd start small with something lower volume that doesn't impact the timelines of their most critical products. Maybe make SoCs for the Watch - if a new model was delayed by a quarter or two it doesn't impact them too much. Or maybe make cellular modems for the Watch and non-Pro iPad, neither are critical to the functionality of the product since it is optional in those and at least in the Watch's case lower spec than the ones going into the phones.
Of course you start with a test chip.

But at some point you need proof that Intel can make your actual big boy chip, at the yields required...

You don't want to learn *in production* something neither you nor Intel thought of (turns out powerVia interacts badly with top metal -- but only once you have 20+ metal layers -- or large MIM capacitors or whatever).
 

name99

Senior member
Sep 11, 2010
526
412
136
If only technical considerations weren’t so far down the on the list when stands are taken for political righteousness. I don’t doubt Apple has numerous agents working hard to convince the incoming administration to leave them alone in picking where and how Apple Silicon is fabbed.

If they are really smart, they have made powerful, dazzling presentations that show how using TSMC will allow Apple to make the greatest, most glorious contributions to the new age of American supremacy.

The “buy American, build in America” faction has fire in the belly. With an indefinite hold on the Supreme Court, a 4 year hold on the White House, and a 2 year lock on both legislative houses, don’t rule anything out.
Guys, can we move this elsewhere?
There are a million places on the internet to discuss opinions about Trump; let's try to keep this one of the few havens that is technical.
 

mvprod123

Member
Jun 22, 2024
186
198
76
Testing M4 Pro access latency curves for large/small cores

L1d: 128K for large core, 64K for small core, both 3 cycles (4 cycles for non-simple pointer chase)
For a 4.5 GHz large core, its L1 performance is at the top of the processor in terms of absolute latency, number of cycles, and capacity.

L2: 16+16 MB for large cores, ranging from 27 (near) to 90+ (far) cycles; 4MB 14-15 cycles for small cores.Large-core L2 is easier to understand the structure in terms of bandwidth
 

mvprod123

Member
Jun 22, 2024
186
198
76

Single threaded bandwidth on the M4 Pro and compared to x86. Unlike the latency test, in the bandwidth test it is easy to see that a single core can access all of the 32M L2 caches in both clusters at full speed, with the bandwidth basically staying near 120 GB/s.

It is also easy to see that Apple's big advantage over x86 is the 128bit SIMD throughput, whereas the Zen5 requires 256/512bit SIMDs to get the full power out of each cache level.
 
Reactions: name99 and Gideon

mvprod123

Member
Jun 22, 2024
186
198
76


Lastly, multi-core, the current generation M4 Pro can run 220+ GB/s memory bandwidth using a single cluster 5-core read-only, no longer the single-cluster bandwidth limitation of the M1 era.This is probably because P clusters can now not only use another P cluster's cache, but also read and write memory via another P cluster's data path

The memory bandwidth of 3 small cores is about 44 GB/s (32GB/s for a single core), and the cluster-level bottleneck is more obvious.
 

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,594
106
So M4 Pro (and presumably M4 Max) has same amount of L2 cache as M3 Max (16 MB + 16 MB + 4 MB).

It's interesting that M4 Pro's P-cores can use the L2 cache of another cluster. I wonder if this is a new behaviour introduced in the M4 generation, or was it there in M3?

Also the L1 cache sizes of both P and E cores are the same from M1 to M4.
P-coreE-core
L1i192 KB128 KB
L1d128 KB64 KB

Will M5 bring a change?
 
Last edited:

jdubs03

Golden Member
Oct 1, 2013
1,079
746
136
Some possibility of Apple using Intel 18A with A20 is a rumor from Weibo. Next year's iPhone with A19 chip is N2P.
N3P*

Count me highly skeptical that Intel will secure orders for the latest and greatest A and M series chips in 2026 or any other year. It would take a major event like an invasion of Taiwan to change that. They should have a backup plan, like others have suggested, but it’s unlikely to happen unless Taiwan is invaded. Apple would need to see the performance, yields, and cost of the 18A or A-P chips to be comparable to the N2 and A16 chips for them to consider switching.
 

johnsonwax

Member
Jun 27, 2024
96
160
66
N3P*

Count me highly skeptical that Intel will secure orders for the latest and greatest A and M series chips in 2026 or any other year. It would take a major event like an invasion of Taiwan to change that. They should have a backup plan, like others have suggested, but it’s unlikely to happen unless Taiwan is invaded. Apple would need to see the performance, yields, and cost of the 18A or A-P chips to be comparable to the N2 and A16 chips for them to consider switching.
Right, and because supply chains rule the world, if Intel was lining up contracts for equipment for 18A, what kind of volume are they planning for? Because you don't just tack Apple's 250M units onto your existing volume. Again, Apple's volume exceeds Intels current volume. How far out do they need to line up the many equipment suppliers to be prepared to survive Apple flying in from orbit (and convince Apple to do so).
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
I'm not even sure that the combined capacity of Samsung's latest process at their rumored sun 25% yields combined with all of Intel's capacity at 18 could even handle half of Apple's leading product volume...
 

name99

Senior member
Sep 11, 2010
526
412
136

Single threaded bandwidth on the M4 Pro and compared to x86. Unlike the latency test, in the bandwidth test it is easy to see that a single core can access all of the 32M L2 caches in both clusters at full speed, with the bandwidth basically staying near 120 GB/s.

It is also easy to see that Apple's big advantage over x86 is the 128bit SIMD throughput, whereas the Zen5 requires 256/512bit SIMDs to get the full power out of each cache level.
This statement is, in fact, misleading.
Apple can get that L1 load throughput (48B/cycle) with integer load-pair.

The paths from L1 into the core are 16B wide (so two 64b int registers, or ONE NEON register) and in fact if you submit NEON load-pair instructions they are "cracked"(kinda) into two successive 16B loads from the L1.

Point is that Apple really has the sweet spot here. With far fewer wires than x86 (none of these crazy 256b or 512b paths to L1) they get far higher bandwidth in the ways that matter, namely better bandwidth for latency code, ie up to 6 integer loads per cycle).

The vector throughput seems like a problem, but any real throughput task is larger than the L1D, so it doesn't matter how fast the L1D loads are, soon enough you're running at L2 speeds anyway.
My guess is that Apple have plenty of simulations as to the real code they run on NEON, and have good reasons to maintain the same 16B width to L1 that they had in the M1.
(And of course any really really serious throughput code will bypass L1D anyway, because it will be running as vectors and matrices on AMX/SME, tapping directly into L2).
 

johnsonwax

Member
Jun 27, 2024
96
160
66
I'm not even sure that the combined capacity of Samsung's latest process at their rumored sun 25% yields combined with all of Intel's capacity at 18 could even handle half of Apple's leading product volume...
Yeah, that's my sense too. Tracking manufacturing _volume_ is usually more important to predicting Apple's moves than tracking manufacturing _advancement_. And they trick us a lot. That secret DMG Mori plant that was cranking out mills just for Apple meant that people tracking the mill market to try and understand Apple's potential volume for aluminum unibody bodies got it all wrong because those were all out of view.

The dependencies in the semi manufacturing market make it hard to hide these moves. If TSMC is ordering equipment for volume that assumes Apple as a customer, I don't think there's enough equipment in the supply chain for leading nodes to allow for Apple to go anywhere else. Intel would have to buy that from TSMC or that decision would need to be made early enough to adjust those contracts. Seems to me that window has to be closing if not closed for 2026.
 
Reactions: moinmoin

name99

Senior member
Sep 11, 2010
526
412
136
Testing M4 Pro access latency curves for large/small cores

L1d: 128K for large core, 64K for small core, both 3 cycles (4 cycles for non-simple pointer chase)
For a 4.5 GHz large core, its L1 performance is at the top of the processor in terms of absolute latency, number of cycles, and capacity.

L2: 16+16 MB for large cores, ranging from 27 (near) to 90+ (far) cycles; 4MB 14-15 cycles for small cores.Large-core L2 is easier to understand the structure in terms of bandwidth
27 cycles for P L2 seems high! M1 was around 16 cycles.
But the actual access time is much the same in both cases, about 5.5 ns. In principle you could imagine this is a physical limit, but I'd expect almost every element of the lookup to scale with core frequency.

So I guess this is a deliberate choice? Allow things to take more cycles as some combination of
- 6 rather than 4-wide access
- probably more smarts in terms of tracking lines to decide which get replaced when replacement is necessary?

Also he seems to think the P-L2 is 32MB. I'm not at all convinced. What I see is that bandwidth is flat out to 32MB, then drops. But big latency jumps at 3M out to ~12..15M.
Suggests to me something like a per-core near L2, with (some fraction of) the rest of the L2 available at a few more cycles. (BTW having a large L2 provide faster or slower service depending on how far the data needs to move seems an obvious optimization. But it's not because you now have to worry about what happens when dat from both a far segment and a near segment arrive at the same time... You need to add in extra buffering and checks to deal with these sorts of collisions.)

One possible model is per-core L2 segment of about 2.5M, total L2 is 15M, with another 16M or so for SLC.
So latency jumps at ~2.5, ~15M, ~31M, but flat throughput all the way till we have to hit DRAM?

(My preference, for these latency plots, is to provide a linear/linear plot around each apparent jump point. When some of the data can be serviced from the L2, some from SLC, you'd expect linear-ish transitions around the jump-point, but once you to to log-log, it's harder to see the jump clearly :-( )
 

name99

Senior member
Sep 11, 2010
526
412
136
So M4 Pro (and presumably M4 Max) has same amount of L2 cache as M3 Max (16 MB + 16 MB + 4 MB).

It's interesting that M4 Pro's P-cores can use the L2 cache of another cluster. I wonder if this is a new behaviour introduced in the M4 generation, or was it there in M3?

Maybe, but I don't see definite proof in the graphs. Might be easier to disentangle w/ linear plots.

Since before M1, Apple has in principle had the possibility of using SRAM in unused IP blocks to augment the SLC. This works because the SLC has tags that cover every coherent IP block.
But it's going to be slower than the SLC, not faster, because the request goes from L2 to SLC, the tags hit the other P L2, and so the request has be routed downwards to that other P L2. You'd need a fine-grained linear-linear latency plot to be sure you understood what was happening.
The main reason to do this, I think, is less performance than power (as always...) If it's cheaper to move data around to that secondary L2 than to hit DRAM.

The question is, when does Apple do this? One use case that seems reasonable to me is when moving data around between throughput blocks (eg from camera to media encoder). If you can temporarily stash the data in an unused L2, you can avoid either
- cost of transiting through DRAM OR
- cost of throwing away valuable data in SLC
and you know how long you will need to hold onto the data, so you can do the sums about whether it is worth doing.

For CPU access it seems to me more iffy. You're less sure about whether the data will be reused again, and if so when. Keeping the L2 powered on indefinitely is not free.
So my guess is that, in M1 and still in M4, they play these games for DMA'd data, maybe even for GPU and ANE data, but not so much for CPU which is the case where you have the least ability to predict the likely future use of the data.
 
Reactions: FlameTail

Doug S

Platinum Member
Feb 8, 2020
2,888
4,912
136
Some possibility of Apple using Intel 18A with A20 is a rumor from Weibo. Next year's iPhone with A19 chip is N2P.

There is zero chance of that. Apple isn't going to want to dual source iPhone SoCs, especially not with what is essentially a "startup foundry". Nor would Intel want to impact their own production ability by taking on Apple as a customer until they have the Ohio fabs online - they learned that lesson the hard way with their 14nm shortages caused by producing modems for Apple. Even producing the non Pro versions of an iPhone SoC means running enough volume for at a guess 75 million SoCs in a year. The volume for the Pro line is even higher.
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
Again, I don't doubt that Apple does their due diligence and lays the groundwork at each competing fan for product should things go South in Taiwan. It would be foolish not to and the cost would be relative pocket change for them.
 

Doug S

Platinum Member
Feb 8, 2020
2,888
4,912
136
Again, I don't doubt that Apple does their due diligence and lays the groundwork at each competing fan for product should things go South in Taiwan. It would be foolish not to and the cost would be relative pocket change for them.

I think the TSMC fabs in Arizona are the backup plan, and they probably have some deal in place giving them first dibs of that output if Taiwan halts production. It would mean having an older node so instead of getting newer/better SoCs with the next iPhone/Mac you'd get more of the same, but that would still put them way ahead of everyone else who couldn't ship product at all.
 

jdubs03

Golden Member
Oct 1, 2013
1,079
746
136
I think the TSMC fabs in Arizona are the backup plan, and they probably have some deal in place giving them first dibs of that output if Taiwan halts production. It would mean having an older node so instead of getting newer/better SoCs with the next iPhone/Mac you'd get more of the same, but that would still put them way ahead of everyone else who couldn't ship product at all.
Egh. That’s a shit deal.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |