Discussion Apple Silicon SoC thread

Page 277 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,926
1,528
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:
Jul 27, 2020
20,921
14,496
146
As for old machines, they run GB on them because they can. Like why not, esp. as a comparison when they get new hardware? It’s free and easy and takes just a couple of minutes.
It's frustrating searching for results. Yesterday the oldest result was from 2022 and there were 300 pages and most of them were these old macbooks. GB browser badly needs filtering options.
 

Eug

Lifer
Mar 11, 2000
23,926
1,528
126
It's frustrating searching for results. Yesterday the oldest result was from 2022 and there were 300 pages and most of them were these old macbooks. GB browser badly needs filtering options.
That is a valid complaint. While the database is freely searchable, it needs more fine grained filtering options. Actually, it was easier to search in GB 4 than it is now with GB 6.
 

poke01

Platinum Member
Mar 8, 2022
2,584
3,410
106
I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

So it is confirmed that M4 has SME with streaming SVE. Vector length is 512-bit. A single core can reach 31 FP32 GFLOPS with SVE fmla (that's mul+add), 111 GFLOPS with NEON, and >2000 GLFOPS with SME fmopa.
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.
 

roger_k

Member
Sep 23, 2021
102
219
86
I don't know if this was previously posted: https://scalable.uni-jena.de/opt/sme/micro.html

So it is confirmed that M4 has SME with streaming SVE. Vector length is 512-bit. A single core can reach 31 FP32 GFLOPS with SVE fmla (that's mul+add), 111 GFLOPS with NEON, and >2000 GLFOPS with SME fmopa.

The SSVE result is surprisingly low. I wonder whether there is a problem with their code or whether the coprocessor is indeed not useful for vector operations.
 

roger_k

Member
Sep 23, 2021
102
219
86
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

Well, Apple shipped a 512-bit outer product engine in iPhones since 2018 if I remember correctly?
 
Last edited:
Jul 27, 2020
20,921
14,496
146
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.
Intel dropped the ball at the wrong moment, on their feet!

Intel just didn't want to bother finding a solution. Coz they knew that AVX-512 and E-cores working together would bring the performance down due to thermal throttling. It's not impossible to design software that queries which cores are available, then distribute its AVX-512 optimized threads among P-cores and non-AVX-512 threads to E-cores. But that could potentially cause not enough power available for the P-cores to properly accelerate the AVX-512 workload.
 
Reactions: Orfosaurio

roger_k

Member
Sep 23, 2021
102
219
86
Intel dropped the ball at the wrong moment, on their feet!

Intel just didn't want to bother finding a solution. Coz they knew that AVX-512 and E-cores working together would bring the performance down due to thermal throttling. It's not impossible to design software that queries which cores are available, then distribute its AVX-512 optimized threads among P-cores and non-AVX-512 threads to E-cores. But that could potentially cause not enough power available for the P-cores to properly accelerate the AVX-512 workload.

IMO, their problem was that they chose an approach that does not scale. Apple strategy of splitting up wide vector functionality into a separate hardware unit that feeds from L2 instead of L1 makes much more sense to me.
 
Jul 27, 2020
20,921
14,496
146
They benefit from not needing a wide data path to L1.
Interesting. If you don't mind, would you like to compare and contrast Intel/Apple's approach? How is AMD's different (since they are able to do AVX-512 more energy efficiently. Is it only because they don't have dedicated 512-bit units like Intel?)
 

Mopetar

Diamond Member
Jan 31, 2011
8,149
6,861
136
Funny, Intel gets rid of 512-bit vector extensions due to hybrid architecture in client cpus but Apple's like nah, here's 512-bit vectors in a base M chip with a hybrid architecture as well.

I think Intel did it because their hardware was a physical 512-but implementation, which was costly both in terms of transistors and power to run it. AMD used a 256-bit physical hardware unit to support the 512-bit operations.

There's nothing that requires Apple to have a particularly large hardware unit to execute the vector instructions. If they just had a regular 64-bit wide unit, they'd just need to run it 8 times to process a 512-bit vector operation. Or they could have four 64-bit wide execution units that the instruction gets split across over two cycles.

The only reason to have a full 512-bit hardware unit is that you want to support operations on 512-bit operands that are actually that large. That's definitely something that you'd want in certain scientific workloads, but consumer software is just using it to process 16x 32-bit floats with a single instruction.
 

roger_k

Member
Sep 23, 2021
102
219
86
There's nothing that requires Apple to have a particularly large hardware unit to execute the vector instructions. If they just had a regular 64-bit wide unit, they'd just need to run it 8 times to process a 512-bit vector operation. Or they could have four 64-bit wide execution units that the instruction gets split across over two cycles.

The only reason to have a full 512-bit hardware unit is that you want to support operations on 512-bit operands that are actually that large. That's definitely something that you'd want in certain scientific workloads, but consumer software is just using it to process 16x 32-bit floats with a single instruction.

They do want to have good performance for wide vector and matrix workloads, so using a wide outer product engine makes sense. For them apparently it made enough sense to even include it on an iPhone.

What makes Apple solution a bit more special though is that it is a coprocessor and not part of the CPU core.
 

Mopetar

Diamond Member
Jan 31, 2011
8,149
6,861
136
The vector width and the underlying hardware that backs it don't really matter. A 512-bit vector unit in hardware can crunch a 512-bit vector in one cycle (or whatever it takes) but you can use much smaller hardware units depending on what instructions you want to support. Vector instructions just let the hardware know that there's no data dependencies on the vector. You could do the whole thing at once or the individual pieces of data in any order.

I haven't looked at Apple's ISA to see what they support, but hasn't the overall trend been in the opposite direction where people don't care as much about operations on large data values, but instead want more granularity so that more operations in total can be done? If you just wanted to support INT8 operations you could have a single 8-bit execution unit chew through the entire 512-bit vector. If you only want to do a bunch of INT4 operations, having hardware that only allows as low as 8-bit granularity means that half of that hardware is effectively wasted. It doesn't matter if the Matrix itself is large if you just want all of the values to be 4-bit integers.

There's also no reason it has to be a part of the core either though. I think the only reason that Intel and AMD have done that is that they first added vector instructions when they only had single core CPUs so it made sense for them to carry that decision forward. The x86 code those CPUs execute is probably similarly structured around these same assumptions.

Apple at least knows to what extent their own code makes use of these vector instructions and they probably realized that it wasn't enough for each core to duplicate the same hardware resources. It's no different than having the NPU/GPU separate from the CPU cores. It probably also makes building a larger physical vector unit a lot easier since Intel ran into issues where any core that was actively using it had to drop frequency to stay within TDP limits, which is going to create a performance penalty for mixed workloads.
 
Reactions: igor_kavinski

Doug S

Platinum Member
Feb 8, 2020
2,890
4,914
136
The SSVE result is surprisingly low. I wonder whether there is a problem with their code or whether the coprocessor is indeed not useful for vector operations.

Doesn't SME require SSVE? I think Apple doesn't care about SSVE, they only implemented it because they had to but they don't intend for anyone to use it - with NEON being so much faster that's what they want you to use.

They implemented their proprietary AMX to get what they wanted when they wanted, and once they saw SME as a suitable replacement that could be "fully supported" from the ISA unlike AMX they made the switch. Probably very little changed in the AMX unit, other than a few additions I understand it to be almost identical to AMX. Heck for all we know Apple delivered AMX to ARM when they completed their internal spec and said "we'd like this" and ARM got feedback from others and added a few extras but in the end pretty much standardized what had been Apple proprietary instructions. SSVE coming along for the ride must not have been something they asked for. Maybe in the future they'll expand the "AMX" unit to better handle it, maybe they leave it as a red headed stepchild and expect people to continue using NEON.
 

roger_k

Member
Sep 23, 2021
102
219
86
Doesn't SME require SSVE? I think Apple doesn't care about SSVE, they only implemented it because they had to but they don't intend for anyone to use it - with NEON being so much faster that's what they want you to use.

They implemented their proprietary AMX to get what they wanted when they wanted, and once they saw SME as a suitable replacement that could be "fully supported" from the ISA unlike AMX they made the switch. Probably very little changed in the AMX unit, other than a few additions I understand it to be almost identical to AMX. Heck for all we know Apple delivered AMX to ARM when they completed their internal spec and said "we'd like this" and ARM got feedback from others and added a few extras but in the end pretty much standardized what had been Apple proprietary instructions. SSVE coming along for the ride must not have been something they asked for. Maybe in the future they'll expand the "AMX" unit to better handle it, maybe they leave it as a red headed stepchild and expect people to continue using NEON.

AMX unit also accelerates the vector algebra routines and it was previously much faster than these new results. Notably, their result is consistent with the use of single accumulator. I looked at the code and they use the Zx registers, maybe one gets better performance by using the ZA array as accumulator (this would fit the AMX theme). Here you have results for M1 Max, 180 GFLOPs for single thread in vector mode: https://github.com/corsix/amx/blob/main/fma.md

At any rate, all this tells us just how leaky all these abstractions can be. I was very enthusiastic about the idea of scalable vectors, now I become increasingly skeptical about how feasible it is to code hardware-agnostic high performance algorithms. Maybe RVV has the right idea after all with their model.
 
Last edited:

carancho

Member
Feb 24, 2013
54
44
91
This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.
 

carancho

Member
Feb 24, 2013
54
44
91
Regarding the discussion about IPC improvements... I am becoming increasingly convinced that most of the discourse is useless because of the deeply flawed methodology. In different Geekerwan videos the results reported for A17 Pro have a relative error of almost 5%. Combine this with the non-precise frequency estimation and you end up with huge relative error for the IPC estimates.

We need more clear methodology, results from multiple devices (to circumvent device bias), and most importantly, we need to start looking at the variance instead of point estimates. I tried to do this earlier for some of the GB6 data and I hope I could illustrate how much more useful this approach it.

Bottomline: the data is crap, methodology is crap, the relative error is crap, meaning that the results are crap.

This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.
I was replying to roger's message.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
This is always overlooked probably because the controversies created due to measuring error fuel the reviews and benchmarks content industry. It is amazing, however, that no one produces content based on scrapped GB scores. Imagine all that you can wrangle out of that database.
GB database has too many overclocked systems and plain wrong or fake results. Of course there are methods to detect and exclude outliers, but it's not trivial.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |