Discussion Apple Silicon SoC thread

Page 372 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,953
1,567
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
(a) They are not trivial to even think of. I saw NO-ONE in the academic literature suggesting these before Apple implemented them.
Wow, so these are stuff that even the GPU-lesder Nvidia doesn't have? That's impressive engineering by Apple.
(d) These improve performance on REAL code. But they generally don't do much for fake code (ie microbenchmarks), especially if the micro-benchmark doesn't even realize what it should be testing for. If marketing insists that it's important you look good on microbenchmarks...
Apple's philosophy of building processors targeting real world workloads instead of benchmarks.
 

SpudLobby

Golden Member
May 18, 2022
1,027
695
106
OK so you're turning the question around - why have bigger caches on the Pro? Power. The extra size may not contribute all that much extra performance since it is further down the long tail of cache hits, but it costs over an order of magnitude more power to fetch a line from LPDDR than it does to fetch a line that's already resident in the SLC. Gotta pay for that extra GPU core's power draw somehow.

That's why the SLC is smaller on Apple Silicon than it is on iPhone. Because even the Macs running on battery have an order of magnitude more battery than an iPhone, so the extra SLC isn't needed there.
Again, I know this, my point was just that it’s interesting it didn’t show very much on SpecInt, or early reports in gaming, or GB6, at least in test devices.

It’s there, but like 5-10%. Which is arguably expected except it really is impressive that Apple is not milking cache as much as people believe.
I have also used the base M chips ironically to demonstrate this point — 8MB SLC and 12-16MB of cache shared between 4 P cores since M1. Really not that crazy.

But yes the fact that the base cut A18 still has 50% more SLC than the M chip (8 vs 12MB) I suspect is from empirical tests internally where there’s some drop off point, that and / or internal budgeting.
 

SpudLobby

Golden Member
May 18, 2022
1,027
695
106
(a) They are not trivial to even think of. I saw NO-ONE in the academic literature suggesting these before Apple implemented them.

(b) Apple has patented them. And justifiably so. I see very few BS Apple patents, and these are certainly not amongst those.

(c) Even if you understand the idea, these are not trivial to implement.

(d) These improve performance on REAL code. But they generally don't do much for fake code (ie microbenchmarks), especially if the micro-benchmark doesn't even realize what it should be testing for. If marketing insists that it's important you look good on microbenchmarks...
To me the impressive thing is Apple manages a very agile GPU architecture that also has consistently clocked modest to even high relative to Qualcomm’s obvious wide and slow strategy but manages to match them anyway in raster stuff and blow them out elsewhere (though that might be changing with the 830’s arch, Apple’s GPUs are clearly A++ IP.
 

DrMrLordX

Lifer
Apr 27, 2000
22,368
12,175
136
They certainly need a LOT of FP compute to do all the matrix multiplies for the MIMO and the FEC.

Okay, that makes sense (thanks to other respondents who said essentially the same thing).

Calling this TOPs may be a high-speed tweeting mistake. It may be a misspeak/mispronunciation by the presenter.

Okay that makes sense too. I wasn't sure if it was a typo or if it was Dr. Cutress being sarcastic/making fun of AI hype. Though if they somehow figure out how to use the NPU to handle high speed Wifi 8 traffic then that would certainly be novel.

@Doug S

Most people don't need all the features of Wifi 8, and let's be honest, most home routers won't support those speeds anyway. But if it offers lower latency in consumer products then great!
 

johnsonwax

Member
Jun 27, 2024
118
195
76
My back of envelope calculations suggest that by time rumored 2026 MacBook Pro redesign comes out, anyone upgrading from M1-era machines should have a performance jump close to the initial Intel to Apple Silicon jump. Hopefully, Apple would have also improved SSD controller's random perf by then, to keep the disk IO subsystem proportional.
Given how garbage my top of the line i7 MBP performed, which I replaced a year later with a M1 Max, in compute-heavy python, I'm not sure that's possible. Sure, for a few minutes it was plenty fast and likely stacked up well in benchmarks, but there was nothing the i7 liked better than throttling. Throttling the M1 Max is damn near impossible. And I had borrowed some other i7 non-Apple laptops to make sure it wasn't just mine, and they weren't notably different.

Desktop users might have a different experience, but there's no universe you could make a MBP have that kind of performance gap unless it had the ability to go backward in time and give me results before I start the job.
 
Reactions: Nothingness

poke01

Diamond Member
Mar 8, 2022
3,035
4,008
106
Given how garbage my top of the line i7 MBP performed, which I replaced a year later with a M1 Max, in compute-heavy python, I'm not sure that's possible. Sure, for a few minutes it was plenty fast and likely stacked up well in benchmarks, but there was nothing the i7 liked better than throttling. Throttling the M1 Max is damn near impossible. And I had borrowed some other i7 non-Apple laptops to make sure it wasn't just mine, and they weren't notably different.

Desktop users might have a different experience, but there's no universe you could make a MBP have that kind of performance gap unless it had the ability to go backward in time and give me results before I start the job.
Its possible do that jump again, just need a 30-40% IPC increase.
 

The Hardcard

Senior member
Oct 19, 2021
300
386
106
Again, I know this, my point was just that it’s interesting it didn’t show very much on SpecInt, or early reports in gaming, or GB6, at least in test devices.

It’s there, but like 5-10%. Which is arguably expected except it really is impressive that Apple is not milking cache as much as people believe.
I have also used the base M chips ironically to demonstrate this point — 8MB SLC and 12-16MB of cache shared between 4 P cores since M1. Really not that crazy.

But yes the fact that the base cut A18 still has 50% more SLC than the M chip (8 vs 12MB) I suspect is from empirical tests internally where there’s some drop off point, that and / or internal budgeting.
I still say photography/video is the central point, any other gains are side benefits. Apple is touting superior video capture and every compute IP gets involved and must be performant, especially the A18 Pro is touted as can be relied on to capture high speed 4k log video in a production that someone has invested money in. That aura fades if people can demonstrate that it can’t be trusted.

I think the A18 Pro L2 cache is to make sure there are virtually never dropped frames in video capture processing.
 

SpudLobby

Golden Member
May 18, 2022
1,027
695
106
Chiplets are not some silver bullet. They have specific uses
- either optionality (the way AMD uses them for compute tiles) or
- put non-scaling features (IO, analong, SRAM) on a cheaper process (eg V-Cache)

Both of these are nuanced (eg N2, while not much denser than N5 for SRAM, will allow for FASTER SRAM - which you may care about...) And chiplet to chiplet transfers cost power and latency.

Apple already has a chiplet design in the Ultra's, essentially matching point 1 (optionality).

Then you get Intel which, in their usual fashion of design by marketing, decided that Tiles was a cool new buzz-word and decided to build everything on them. To what benefit? Unclear to me or anyone else. I don't see anything in Lunar Lake or Meteor Lake that couldn't have been done better, cheaper, and at slightly lower latency and power on a monolithic design.
Tell,
I still say photography/video is the central point, any other gains are side benefits. Apple is touting superior video capture and every compute IP gets involved and must be performant, especially the A18 Pro is touted as can be relied on to capture high speed 4k log video in a production that someone has invested money in. That aura fades if people can demonstrate that it can’t be trusted.

I think the A18 Pro L2 cache is to make sure there are virtually never dropped frames in video capture processing.
yes, this to me is what I think it probably is. Some internal teams weighed cost-benefits of it and very specific cases with the L2 & creatives did it.

Doug is repeating things I already know, not being rude here but I was specifically pointing out that we’ve seen gaming and real integer tests and the perf/W does differ but fairly marginally (5-8%), which I think might be different with more background processes going on to be fair, but it’s pretty obvious Apple’s architecture is not that dependent on cache gains *since the A14 and changes to the cache structure (they’ve changed things about the L2 and SLC iirc so maybe they get more out of less now). So the question was why bother? Video and a weird internal budgeting + pro justification post hoc makes sense.
 

mavere

Member
Mar 2, 2005
190
4
81
Most people don't need all the features of Wifi 8, and let's be honest, most home routers won't support those speeds anyway. But if it offers lower latency in consumer products then great!
There's already some initial features under investigation for WiFi 8. White paper here. TDLR is that most of it is aimed at improved median/worst-case experience for mesh / commercial deployments. The only thing I see that's useful for single router home use is dynamic-sub-channel where router can choose to split its 160-320mhz fat channel to smaller ones to simultaneously serve multiple lower end devices that may not be compatible with the entire fat channel.
 

name99

Senior member
Sep 11, 2010
565
463
136
Wow, so these are stuff that even the GPU-lesder Nvidia doesn't have? That's impressive engineering by Apple.
nVidia has its own impressive stuff, for example
- co-ordination across units larger than a threadgroup, spanning several cores OR
- guaranteed forward progress, even in the presence of tricky branches

As everyone knows I have contempt for Intel. I don't have that same contempt for nVidia; they're a worthy competitor and they're not sitting around doing nothing for five years while enjoying their semi-monopoly.
 

name99

Senior member
Sep 11, 2010
565
463
136
Okay, that makes sense (thanks to other respondents who said essentially the same thing).



Okay that makes sense too. I wasn't sure if it was a typo or if it was Dr. Cutress being sarcastic/making fun of AI hype. Though if they somehow figure out how to use the NPU to handle high speed Wifi 8 traffic then that would certainly be novel.
There certainly IS a place for "AI" in these new WiFi (and cellular, and video encoding) standards. The common thread is that each provides a wide range of options for encoding each unit; and while you can mathematically express the optimal encoding for each unit, that mathematical expression is practically useless. What you have is the equivalent of a search over a very wide possible space - so something similar to, eg, playing a game like chess or go...
And to the extent that one version of AI is techniques for finding good (if not absolutely optimal) solutions in such a vast space, that version of AI is useful for WiFi, cellular, and video.

But the actual AI part of the computation lives in the control plane, not the data plane, and doesn't require the massive compute of the data plane.
 

johnsonwax

Member
Jun 27, 2024
118
195
76
I still say photography/video is the central point, any other gains are side benefits. Apple is touting superior video capture and every compute IP gets involved and must be performant, especially the A18 Pro is touted as can be relied on to capture high speed 4k log video in a production that someone has invested money in. That aura fades if people can demonstrate that it can’t be trusted.

I think the A18 Pro L2 cache is to make sure there are virtually never dropped frames in video capture processing.
That's really likely. A lot of the tech jumps in the iPhone were there to enable the camera - NVMe to stream data off the sensor, ANE (which until M4 was larger on A series than M due to this).
 

Doug S

Diamond Member
Feb 8, 2020
3,005
5,167
136
I still say photography/video is the central point, any other gains are side benefits. Apple is touting superior video capture and every compute IP gets involved and must be performant, especially the A18 Pro is touted as can be relied on to capture high speed 4k log video in a production that someone has invested money in. That aura fades if people can demonstrate that it can’t be trusted.

I think the A18 Pro L2 cache is to make sure there are virtually never dropped frames in video capture processing.

That's a good point, though I think you meant to say SLC not L2 cache, since the SLC is where all the various units you mentioned "meet" while L2 is for the P cores only. Perhaps they did some testing and found they need a certain SLC size to optimize 4K video processing, and an even bigger SLC on a Pro model which has more/better cameras.
 

Mopetar

Diamond Member
Jan 31, 2011
8,200
7,027
136
Even with the minor node upgrade (N3E -> N3P), I think Apple engineers will try to achieve 5 GHz, just for the M5GHz memes

We'll know this was the case when they fall just a bit short and release the M4.9 instead.

Btw. Do we have any good idea of why Apple did 8/12 MB L2/SLC for the A18, and 16/24MB L2/SLC for the A18 Pro that isn’t just “marketing” (most people don’t know that or even see the slide about it, it’s an entirely different die too).

The Pro may be intended for people that do more multitasking and are using multiple apps in conjunction for some purpose.

The larger cache isn't beneficial for a single app or benchmarking as the L2 is already generally large enough to fit that single program's needs. Adding more cache does little for the benchmark, but in real world use where someone is switching between multiple apps and expects them the be performant, having the larger cache means the data for the other application isn't being evicted and swapping back and forth is considerably more snappy.

It's similar to the X3D CPUs from AMD where individual programs may not see much benefit, but the system as a whole does. I don't think that anyone tries to measure that in any meaningful way so it doesn't show up in the benchmarks.
 

johnsonwax

Member
Jun 27, 2024
118
195
76
It's similar to the X3D CPUs from AMD where individual programs may not see much benefit, but the system as a whole does. I don't think that anyone tries to measure that in any meaningful way so it doesn't show up in the benchmarks.
Which is kind of wild since the X3D CPUs are hailed as the best for gaming which is pretty much the definition of 'individual programs seeking benefit'.

I'll note that I was part of a gaming event about a year ago that was about pushing the game to the computational limits (single core CPU/memory bound) and I asked for some help with a design I had which was at the limit of my M1 Max and the 7950X3D folks (I think top of the line at the time) couldn't run it full speed and were shocked to learn I was sitting on my patio doing that on a laptop, and not some overclocked gaming PC (PCmasterrace comeuppance is fun even when it's just some corner case.)

I'm sure in benchmarks the M1 was slower, and the game was originally x86 and ported to ARM (in a couple of days according to devs) but that doesn't mean it couldn't have been intrinsically better suited to ARM. Most likely the game itself wasn't well represented by the benchmark.
 

The Hardcard

Senior member
Oct 19, 2021
300
386
106
That's a good point, though I think you meant to say SLC not L2 cache, since the SLC is where all the various units you mentioned "meet" while L2 is for the P cores only. Perhaps they did some testing and found they need a certain SLC size to optimize 4K video processing, and an even bigger SLC on a Pro model which has more/better cameras.
I meant the L2 specifically. A lot of extra work is placed on the CPU cores, from running all the other compute units to generating the final video file with timecoded and synced audio as well as necessary metadata. In addition, it has to supply realtime information to the user as well as incorporate realtime input from said user and adjust the parameters for the NPU, GPU, and ISP accordingly at up to 120 FPS.

I think the extra cache contributes similarly to how it does in gaming.
 

SpudLobby

Golden Member
May 18, 2022
1,027
695
106
We'll know this was the case when they fall just a bit short and release the M4.9 instead.



The Pro may be intended for people that do more multitasking and are using multiple apps in conjunction for some purpose.

The larger cache isn't beneficial for a single app or benchmarking as the L2 is already generally large enough to fit that single program's needs. Adding more cache does little for the benchmark, but in real world use where someone is switching between multiple apps and expects them the be performant, having the larger cache means the data for the other application isn't being evicted and swapping back and forth is considerably more snappy.

This is exactly what I had hypothesized, it might show more with more working sets swapped out constantly, which may not be possible to show in a benchmark very scientifically for multiple phones etc.
It's similar to the X3D CPUs from AMD where individual programs may not see much benefit, but the system as a whole does. I don't think that anyone tries to measure that in any meaningful way so it doesn't show up in the benchmarks.
Yeah.
 

Meteor Late

Senior member
Dec 15, 2023
266
291
96
I mean, Apple should be able to hit 5GHz easily with N3P. Easily doesn't mean without a big power penalty, though, it means they should be able to do it without special binning.
So, if an M4 core consumes 6W in a given workload, I have no reason to not believe with N3P it cannot do 5GHz at 10W honestly. It's not worth it of course, but I'm sure the capability is there.
My math would be that Apple can get 4.5GHz to 4.7 or 4.8GHz "for free" by jumping to N3P alone, a 5% increase to frequency with a 5% increase in power, then the extra 200-300MHz can be reached by considerably increasing power consumpion, an extra 5% frequency increase for like 40% power consumption increase. Not being able to do so would mean Apple is already at the very top of the voltage curve, which I really doubt considering they sip power in single core workloads.
 
Reactions: SpudLobby

jdubs03

Golden Member
Oct 1, 2013
1,155
799
136
I mean, Apple should be able to hit 5GHz easily with N3P. Easily doesn't mean without a big power penalty, though, it means they should be able to do it without special binning.
So, if an M4 core consumes 6W in a given workload, I have no reason to not believe with N3P it cannot do 5GHz at 10W honestly. It's not worth it of course, but I'm sure the capability is there.
My math would be that Apple can get 4.5GHz to 4.7 or 4.8GHz "for free" by jumping to N3P alone, a 5% increase to frequency with a 5% increase in power, then the extra 200-300MHz can be reached by considerably increasing power consumpion, an extra 5% frequency increase for like 40% power consumption increase. Not being able to do so would mean Apple is already at the very top of the voltage curve, which I really doubt considering they sip power in single core workloads.
Couple things.

Usually the frequency uplift is quoted at iso-power. But also to the additional increase in frequency, that would be at iso-architecture.
 
Reactions: SpudLobby

Meteor Late

Senior member
Dec 15, 2023
266
291
96
Couple things.

Usually the frequency uplift is quoted at iso-power. But also to the additional increase in frequency, that would be at iso-architecture.

Yeah of course, I am assuming you put an M4 P core and fab it with N3P, without architectural changes. I also am assuming that 5% uplift in performance at the same power is realized at a lower point of the voltage curve, that's why I'm assuming it's not as rosy at a higher point, which is where 4.5GHz sits, so I'm giving a 5% power penalty to get there, so no improvement in efficiency at max freq without improving the architecture.
 
Reactions: SpudLobby

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
Reactions: Mopetar

mvprod123

Senior member
Jun 22, 2024
203
229
76
Tumour: Apple cancels M4 Extreme


I just so want them to make an Extreme chip because it would be a beast performance wise, and a marvel in terms of engineering.
I think they are still in the process of testing the new packaging. So M5 Extreme is quite possible.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |