Discussion Apple Silicon SoC thread

Page 371 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,953
1,567
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

SpudLobby

Golden Member
May 18, 2022
1,027
695
106
Yes. Cost.

If you're going to do a different die on the exact same process, the only reason for it is cost since the non-Pro models sell for less than the Pro models. And don't forget the A18 'non'-P is going into the SE next spring, which will have a significantly lower price than the base iPhone 16.

The SE coming out next spring is probably the only reason they did separate dies for A18. It makes sense when they're jumping to a new process in the Pro line (like when they go N2 for Pro, the non-Pro will still be N3) but otherwise it is a fairly small difference. At least by chopping down the cache they recover some area, which is nowadays taking up a constant percentage of the die area instead of shrinking like cache did before.
I know this Doug, come on man.

My point is if the hit is so minimal on the cache, why even bother on the Pro? The only people who might care are obsessives like us and even then that’s probably only true if we can find valid discrepancies.

Granted the Pro has other distinguishing features too. It just strikes me as interesting, the cache specifically will have very little relevance to the common person’s understanding of the Pro moniker.

Why not just settle at 8-10MB and 12MB SLC for both? Who knows, may be an internal thing too with targets.

But yes obviously the A18 is about cost. What’s impressive though is it shows all these guys taking about their a&& about Apple and area area are just morons. It’s a very efficient setup in area and actual cache constraints.

Another thing on this note is Apple looks even better ever since the 6 core clusters for either P or E or both in M3/4. Excellent use of cache and great MT.


Anyway I suspect it’s an instrumental thing to justify the Pro -> justify the Pro series markups and some internal teams decided it helps with power and heat dissipation under niche scenarios for creatives in iPhones, and had targets or whatever and got it.
 

name99

Senior member
Sep 11, 2010
565
463
136
And yet it doesn't matter. No one but Apple has made a core competitive even with the flaming piles of x86 garbage. Weird for something so inherently better.

In some specific measurements it may have advantages but in the end SPECint still shows x86 is good enough despite all these weird limitations. And in the server market no one is even on the performance per watt level of Zen 5. Strange for something inherently better. The only one who will make ARM look great here is Apple.

Commodity ARM designers aren't doing any better than commodity x86 designers.
1) Nuvia/Qualcomm may be ex-Apple but they are not Apple -- which matters if Apple has reasons for not selling directly to the data center.

2) This is about the DATA CENTER. If you think SPEC is a good proxy for what matters in the data center, I don't know what to tell you.

3) Technically this is about Instruction Fetch dominated code, as opposed to loop-dominated code. The Data Center is a canary in a coal mine in this regard. but the same concepts hold for some current consumer code (most obviously browsers).
Browser benchmarks are a godawful mess right now, and almost impossible to parse for patterns, but it's *probably* the case that ever more code will be written to be interpreted, and via disjoint libraries calling each other, meaning large, difficult to predict I-footprints.
 
Reactions: Nothingness

name99

Senior member
Sep 11, 2010
565
463
136
What is this silly statement?
More cores means a better efficiency curve

(A) 8 GPU cores @ 1.2 GHz
(B) 6 GPU cores @ 1.6 GHz

A and B delivery the same performance theoretically, but A will consume less power due to the nature of V/F curves.

Agreed. Even ordinary flagship Android phones have sophisticated cooling solution with vapour chambers and the like.
A slightly better analysis is that what counts as a QC GPU core may not be what counts as an Apple GPU core. Regardless of what frequency they operate at, an Apple GPU core gets the equivalent of more IPC because of, eg
- it has multiple instruction-steering front-ends feeding the back-end, so there are fewer cycles where the datapath sits unused, waiting for an instruction
- it has dynamic remapping of SRAM to registers, threadblocks, or whatever else is needed, depending on the current needs of the code.
 

name99

Senior member
Sep 11, 2010
565
463
136
Faster yes, but still measuring the wrong thing. And so not especially interesting.
These people all look under the lamppost, not where the keys are. That is, they test usage patterns that are relevant to x86, not what are relevant to ARM. It's as dumb as testing AR for whether it has a fast path for rA EOR rA.

What they should be testing is things like remote atomics, ie the preferred patterns on ARM.
 

Doug S

Diamond Member
Feb 8, 2020
3,005
5,167
136
I know this Doug, come on man.

My point is if the hit is so minimal on the cache, why even bother on the Pro? The only people who might care are obsessives like us and even then that’s probably only true if we can find valid discrepancies.

Granted the Pro has other distinguishing features too. It just strikes me as interesting, the cache specifically will have very little relevance to the common person’s understanding of the Pro moniker.

Why not just settle at 8-10MB and 12MB SLC for both? Who knows, may be an internal thing too with targets.

OK so you're turning the question around - why have bigger caches on the Pro? Power. The extra size may not contribute all that much extra performance since it is further down the long tail of cache hits, but it costs over an order of magnitude more power to fetch a line from LPDDR than it does to fetch a line that's already resident in the SLC. Gotta pay for that extra GPU core's power draw somehow.

That's why the SLC is smaller on Apple Silicon than it is on iPhone. Because even the Macs running on battery have an order of magnitude more battery than an iPhone, so the extra SLC isn't needed there.
 
Reactions: Nothingness

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
Regardless of what frequency they operate at, an Apple GPU core gets the equivalent of more IPC because of
Yes, I have noticed this too. Apple's GPU gets more performance-per-ALU.
FP32
ALUs
Clock Speed3DMark
Steel Nomad Light
Apple A18 Pro7681.45 GHz2100
Snapdragon 8 Elite15361.1 GHz2600
Dimensity 940015361.63 GHz2700

3DMark SNL is only a GPU benchmark. In actual games, A18 Pro often trades blows with 8 Elite and D9400, in terms of absolute performance (FPS).
- it has multiple instruction-steering front-ends feeding the back-end, so there are fewer cycles where the datapath sits unused, waiting for an instruction
- it has dynamic remapping of SRAM to registers, threadblocks, or whatever else is needed, depending on the current needs of the code.
Why don't Qualcomm and ARM implement these things in their GPU architectures?
 

name99

Senior member
Sep 11, 2010
565
463
136
Apple will move to a chiplet architecture according to TSMC’s roadmap and in the future go 3D on their phones.
Chiplets are not some silver bullet. They have specific uses
- either optionality (the way AMD uses them for compute tiles) or
- put non-scaling features (IO, analong, SRAM) on a cheaper process (eg V-Cache)

Both of these are nuanced (eg N2, while not much denser than N5 for SRAM, will allow for FASTER SRAM - which you may care about...) And chiplet to chiplet transfers cost power and latency.

Apple already has a chiplet design in the Ultra's, essentially matching point 1 (optionality).

Then you get Intel which, in their usual fashion of design by marketing, decided that Tiles was a cool new buzz-word and decided to build everything on them. To what benefit? Unclear to me or anyone else. I don't see anything in Lunar Lake or Meteor Lake that couldn't have been done better, cheaper, and at slightly lower latency and power on a monolithic design.
 

name99

Senior member
Sep 11, 2010
565
463
136
Wait, why would Wifi 8 need 30 TOPs to achieve maximum performance?
They certainly need a LOT of FP compute to do all the matrix multiplies for the MIMO and the FEC.

Calling this TOPs may be a high-speed tweeting mistake. It may be a misspeak/mispronunciation by the presenter. Or it may be a deliberate choice because the compute engines on these modems do not need to match IEEE FP16 or FP32 or anything else. They may well be using some custom 21-bit FP format, or three different FP formats at different stages of the compute pipeline. And saying FLOPs then gets people asking questions about what types of FLOP?, IEEE? eetc.
 

name99

Senior member
Sep 11, 2010
565
463
136
Yes, I have noticed this too. Apple's GPU gets more performance-per-ALU.
FP32
ALUs
Clock Speed3DMark
Steel Nomad Light
Apple A18 Pro7681.45 GHz2100
Snapdragon 8 Elite15361.1 GHz2600
Dimensity 940015361.63 GHz2700

3DMark SNL is only a GPU benchmark. In actual games, A18 Pro often trades blows with 8 Elite and D9400, in terms of absolute performance (FPS).

Why don't Qualcomm and ARM implement these things in their GPU architectures?
(a) They are not trivial to even think of. I saw NO-ONE in the academic literature suggesting these before Apple implemented them.

(b) Apple has patented them. And justifiably so. I see very few BS Apple patents, and these are certainly not amongst those.

(c) Even if you understand the idea, these are not trivial to implement.

(d) These improve performance on REAL code. But they generally don't do much for fake code (ie microbenchmarks), especially if the micro-benchmark doesn't even realize what it should be testing for. If marketing insists that it's important you look good on microbenchmarks...
 
Reactions: FlameTail

Eug

Lifer
Mar 11, 2000
23,953
1,567
126
This is not really applicable to the above WiFi 8 CPU performance discussion since this is old hardware and wired Ethernet, but I'll mention it anyway:

I was disappointed to learn that I can only get about 800 Mbps out of my D-Link USB-C 2.5 Gbps Ethernet dongle when connected to my 12-inch MacBook. That was using Safari to speed test. It was even worse with Chrome at about 500 Mbps. System Report confirmed the Mac was seeing the dongle as a 2.5 GbE device, but then I looked at Activity Monitor. It turns out Safari was using 150% CPU for networking alone, so I'm guessing this Mac's Core m3 is at least in part the bottleneck.

That same dongle works fine to max out my 1.9 Gbps internet on my M4 iPad Pro though.
 
Mar 23, 2007
29
16
81
This is not really applicable to the above WiFi 8 CPU performance discussion since this is old hardware and wired Ethernet, but I'll mention it anyway:

I was disappointed to learn that I can only get about 800 Mbps out of my D-Link USB-C 2.5 Gbps Ethernet dongle when connected to my 12-inch MacBook. That was using Safari to speed test. It was even worse with Chrome at about 500 Mbps. System Report confirmed the Mac was seeing the dongle as a 2.5 GbE device, but then I looked at Activity Monitor. It turns out Safari was using 150% CPU for networking alone, so I'm guessing this Mac's Core m3 is at least in part the bottleneck.

That same dongle works fine to max out my 1.9 Gbps internet on my M4 iPad Pro though.
I wonder how much is a macOS to iOS difference and a M3 to M4 difference.
 

LightningZ71

Platinum Member
Mar 10, 2017
2,018
2,455
136
If memory serves, that's a skylake Core m3 (M3-6Y30) dual core, four thread 1.1Ghz processor with LPDDR3 RAM. Just handling the overhead for the USB controller on top of the TCP/IP session processing for a gigabit class ethernet connection is going to keep at least one of those cores maxed out. I'm not shocked in the least that it struggled with a 2.5Gbps USB ethernet adapter!
 

Eug

Lifer
Mar 11, 2000
23,953
1,567
126
If memory serves, that's a skylake Core m3 (M3-6Y30) dual core, four thread 1.1Ghz processor with LPDDR3 RAM. Just handling the overhead for the USB controller on top of the TCP/IP session processing for a gigabit class ethernet connection is going to keep at least one of those cores maxed out. I'm not shocked in the least that it struggled with a 2.5Gbps USB ethernet adapter!
Very close. You described the 2016 model. Mine is the 2017 model with Kaby Lake m3-7Y32 dual-core, four-thread 1.2 GHz.

BTW, I waited specifically for this chip in order to get hardware h.265 HEVC 10-bit HDR acceleration.
 

mavere

Member
Mar 2, 2005
190
4
81
My back of envelope calculations suggest that by time rumored 2026 MacBook Pro redesign comes out, anyone upgrading from M1-era machines should have a performance jump close to the initial Intel to Apple Silicon jump. Hopefully, Apple would have also improved SSD controller's random perf by then, to keep the disk IO subsystem proportional.
 

Doug S

Diamond Member
Feb 8, 2020
3,005
5,167
136
Wait, why would Wifi 8 need 30 TOPs to achieve maximum performance?

More to the point, why do we care about Wifi 8?

Seriously, who is asking for even more wifi speed than Wifi 7 provides? I'm sure not - my router isn't even Wifi 6 and I see no reason to upgrade it as it is already faster than I need. The features like MU-MIMO are important for enterprise networks or large facilities like conference centers, but we don't need higher QAM to squeeze more bits per Hz out which is the only reason it would need "30 TOPS" (saying it like that is ridiculous, but more complex encoding needs more computational resources)

More features to reduce latency or improve behavior during congestion, great we like that. Giving us 10 Gbit (theoretical) wireless links, nope don't need that. The audience for that is minuscule. Let them pay for the extra computation for that in some wifi extension, not make it part of the default and force everyone to pay for it.

This is just something Big Networking is going to try to push, just like they are going to try to push 6G in a few years, which is another thing we don't need, at least not from a theoretical max speed perspective.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |