Discussion Apple Silicon SoC thread

Page 270 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Eug

Lifer
Mar 11, 2000
23,926
1,528
126
M1
5 nm
Unified memory architecture - LP-DDR4
16 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 12 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache
(Apple claims the 4 high-effiency cores alone perform like a dual-core Intel MacBook Air)

8-core iGPU (but there is a 7-core variant, likely with one inactive core)
128 execution units
Up to 24576 concurrent threads
2.6 Teraflops
82 Gigatexels/s
41 gigapixels/s

16-core neural engine
Secure Enclave
USB 4

Products:
$999 ($899 edu) 13" MacBook Air (fanless) - 18 hour video playback battery life
$699 Mac mini (with fan)
$1299 ($1199 edu) 13" MacBook Pro (with fan) - 20 hour video playback battery life

Memory options 8 GB and 16 GB. No 32 GB option (unless you go Intel).

It should be noted that the M1 chip in these three Macs is the same (aside from GPU core number). Basically, Apple is taking the same approach which these chips as they do the iPhones and iPads. Just one SKU (excluding the X variants), which is the same across all iDevices (aside from maybe slight clock speed differences occasionally).

EDIT:



M1 Pro 8-core CPU (6+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 14-core GPU
M1 Pro 10-core CPU (8+2), 16-core GPU
M1 Max 10-core CPU (8+2), 24-core GPU
M1 Max 10-core CPU (8+2), 32-core GPU

M1 Pro and M1 Max discussion here:


M1 Ultra discussion here:


M2 discussion here:


Second Generation 5 nm
Unified memory architecture - LPDDR5, up to 24 GB and 100 GB/s
20 billion transistors

8-core CPU

4 high-performance cores
192 KB instruction cache
128 KB data cache
Shared 16 MB L2 cache

4 high-efficiency cores
128 KB instruction cache
64 KB data cache
Shared 4 MB L2 cache

10-core iGPU (but there is an 8-core variant)
3.6 Teraflops

16-core neural engine
Secure Enclave
USB 4

Hardware acceleration for 8K h.264, h.264, ProRes

M3 Family discussion here:


M4 Family discussion here:

 
Last edited:

SpudLobby

Senior member
May 18, 2022
991
684
106
Shipping a ~10-15% faster chip 6 months after shipping another ~15% faster chip is pretty good in any case. Who knows why that happened, it couldn't have been the plan.
I know some people are disappointed it's mainly in clock rate instead of IPC but what's the difference when they claim to be keeping power draw down too?
Apple has a great foundation to work off of via IPC and energy efficiency, so clock gains with new nodes or tweaks while keeping power reasonable still works, but in the long run they’ll have to offer IPC upgrades (if they’re still possible for others) because others either in the performance class (AMD/Intel) or efficiency class (Arm and Qualcomm) will catch up to them on IPC if not surpass them, and then focus on their targets too.

For example, in Apple’s case the A14 is still drawing 30-40% less power than the 8 Gen 3 & it’s X4 on ST — even though they’re about the same speed in IPC and the X4 is slightly clocked higher but using N4p not N5— so an architecture lead is evident here.

But they can work on that, too. And later this year Qualcomm will use Nuvia cores. So what happens when the competitors narrow efficiency gaps but also reach parity or take the lead on IPC?

So they’re still doing fine in the sense that the M4 will be a very competitive chip, I think if power is reasonable which it probably will be, that’s great. This will still blow Lunar Lake and Zen 5 to shreds on ST perf/W while being ahead of the former at peak and closer to the latter (pending SME benchmark use).

But people are fretting because they see the direction Apple is headed. At best they will just be generic, and at worst some competitors are going to surpass them on performance in mobile profiles, efficiency, or both. Gap is smaller than ever.
 
Reactions: Vattila and Tlh97

Eug

Lifer
Mar 11, 2000
23,926
1,528
126
They are. It’s just SME added and support for SME from GB6.

I had the same reaction but when you look at integer messy subtests or GB5 which has no SME, the perf/GHz gains are just 8-9% since M1 — also known as roughly 0% since A17/M3 AKA the same 8-9% since M1 for those.

It’s a tweak for N3E design rules with higher clocks and SME instead of AMX.
Yeah, I was initially fooled by Geekbench 6.3. Then I saw the 3% IPC uplift claim sans SME in Geekbench 6.3, and then confirmed that with the Geekbench 5.5.1 comparison.

Anyhow, it's just academic for me anyway. I have no real use for more CPU speed vs even M2 on an iPad Pro. OTOH, hardware AV1 decode may make a meaningful usage and/or battery life difference for YouTube on this device though.

It's good to see Apple is able to increase clock speed moderately though.
 

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
The big L from this seems to be
A) people who always said Apple can’t tweak for higher clocks due to their width and size — iso-node it’s true they use denser cells, and they ride out node gains here instead of using less area efficient stuff, but in terms of timing they’ve been able to make tweaks and make the cutoff.

I’ve heard that cope line for years now and we went from Apple at the low 1GHz to 4.4GHz. Denser nodes ease timing constraints for wide designs and allow Apple to hit higher clocks while keeping power low. They just don’t push clocks as high as AMD/Intel, but looking here they’re going to be about 80% of the way to AMD’s top Zen 5 mobile product.
I was one of those who thought that Apple could not push frequency too much without paying too high a power price. Process helped more than I expected. And I'm glad I was wrong.

But to port to a new process and enjoy the benefits can be very costly. Definitely Apple implementation team rocks.

Similarly, AMD and Intel have gotten wider too as usual and soon Lion Cove and Zen 5 will look much more like an Apple P core (Golden Cove was already a step that way), minus the cache stuff.
I don't buy leaks and claims from anyone until independent measurements are done. So let's wait for Intel and AMD new wonders before drawing conclusions.

2. That Apple wouldn’t adopt SME or SVE and would stick with AMX. This is proven totally wrong now. They adopted and built AMX before Arm had their own solution, but worked with them (it’s rumored) to produce SME, and now they’ve adopted that, and probably ditched AMX.
I always found that stance stupid. No matter whether they are vertically integrated, it is their interest to converge as much as possible with Arm official architecture. But we still have no proof that Apple implemented SME or SVE; the speedup could just come from using their library that wraps the use of the AMX unit (what about that library using the NPU unit if possible, a unit which speed was doubled on M4?).
 

name99

Senior member
Sep 11, 2010
526
412
136
I think at this point I have to say GB6 is no longer a good benchmark.

It should have been called GB7 if something as important as SME was added. It skews the results.

I'm waiting for more workloads but nice to see Apple include SME.
I understand the logic that SME might be present because of the header discovery.
I don't fully understand the logic that it's speeding up Object Discovery.

Here's my issue: how does Object Discovery execute its neural nets?

I THOUGHT that GB6 CPU, deliberately, executes all such code (and similar things, like BLAS calls) as pure C/C++, NOT as API calls (like Accelerate), so that it could be sure that it was the CPU that was being tested, not some accelerator. (Obviously GB6 ML does perform API calls, since that is the point of that particular benchmark).

IF Object Discovery is calling into Accelerate Neural Net code, then we can't really know where it's executing. It COULD be executing on SME.
But it could also be executing on AMX.
Or even on ANE.
If it is executing on SME, why would SME be twice as fast as AMX? I'd expect the SME implementation to be much like the AMX implementation, with much the same performance (lots of tech details one can worry about, but that's the big picture - external accelerator attached to L2, not a particular core, performing 8*8 FP64 outer product).

IF Object Discovery is pure C/C++ code, then it seems highly unlikely that the existing, as-shipping, XCode, the one that compiled GB6, can generate SVE/SME. So SME won't execute even if present, not till after WWDC and new XCode and new GB6 compilation.

Having made the above claim, we then have the following!
Where do those strings come from?? Has XCode, for the past year, secretly been compiling code to SME and no-one noticed?

Thanks, Eug, for giving the 6.3 release notes that say that GB6 is compiled with SME, but that raises more questions! So there's an aarch64 binary in the GB6 package, presumably created by LLVM. BUT does Mac (let alone iPad) even run that binary (which is presumably the same binary that holds the SME strings). Surely what iPad runs is a MachO binary, created by XCode, and NOT created with SME support?

Update. @longhorn confirms that XCode 15 HAS SME support. Damn! It's been there for a year and no-one noticed?!?
 
Last edited:

SpudLobby

Senior member
May 18, 2022
991
684
106
I was one of those who thought that Apple could not push frequency too much without paying too high a power price. Process helped more than I expected. And I'm glad I was wrong.

But to port to a new process and enjoy the benefits can be very costly. Definitely Apple implementation team rocks.
Yeah. I never thought they were right based on how they went from 1.1 to 3.2GHz, but I was even more skeptical after the M2 and A15/A16.
I don't buy leaks and claims from anyone until independent measurements are done. So let's wait for Intel and AMD new wonders before drawing conclusions.
Well, Golden Cove alone is humongous and fairly wide, and it clocks to 6GHz. Irrespective of power or area whining I think it’s clear there are ways to make this stuff work either for AMD and Intel’s higher clocked designs (to a point) or for Apple and raising clocks.
I always found that stance stupid. No matter whether they are vertically integrated, it is their interest to converge as much as possible with Arm official architecture. But we still have no proof that Apple implemented SME or SVE; the speedup could just come from using their library that wraps the use of the AMX unit (what about that library using the NPU unit if possible, a unit which speed was doubled on M4?).
Agreed
 
Reactions: Nothingness

Nothingness

Diamond Member
Jul 3, 2013
3,137
2,153
136
Update. @longhorn confirms that XCode 15 HAS SME support. Damn! It's been there for a year and no-one noticed?!?
LLVM/clang has had support for SME for about a year in git. I guess Apple merges mainline on a regular basis so this might have come from such a merge. That's still no proof of silicon support IMHO I will expect people trying real SME opcodes on an M4.

EDIT: this is more convincing:

 

name99

Senior member
Sep 11, 2010
526
412
136
Yes and no. I posed that question earlier but there are some considerations. Just rambling here...

1. Apple has occasionally updated their laptops really quickly, in well under a year. This has more typically been the MacBook Pros though IIRC.

2. OTOH, in the past Apple has sometimes let the MacBook Air languish with no CPU updates for years, even though it was always a strong seller due to price. I guess they could get away with it, because a lot of the low end SKUs for the MBA get purchased by students and grandmas and such, who don't usually care much about performance.

3. Apple just announced that the 13" MBA and 15" MBA are both the world's bestsellers in those size tiers. Keeping it fresh might help it continue that sales momentum.

4. But what does "keeping it fresh" mean? They don't necessarily have to be updated within the year. Even if they don't update it until mid 2025, it would likely continue to sell well. And in fact, in terms of performance, for those who care, M3 CPU likely performs roughly as well as M4 9-core anyway.

5. If N3E yields at 4.4 GHz are an issue, and if Apple did ship M4 MacBook Airs, they'd also start at 9-core, probably for the 8 GB tier. They'd reserve 10-core for the higher end configs.

6. Is N3B yield bad enough that they want to scrap it altogether? Would they really kill off M3 MacBook Air, M3 MacBook Pro, M3 Pro MacBook Pro, and M4 Max MacBook Pro all this soon?

7. Does a CPU core binned M4 negate the need for M4 Pro? Probably not because of the display support and video encoding speed.

I could see a scenario where they update the MacBook Pro quickly, killing off the production of M3 Pro and M3 Max, but keeping the M3 MacBook Air around longer. The MacBook Pro would get M4 10c, M4 Pro, and M4 Max, whereas the MacBook Air would stay on M3. This wouldn't necessarily be cheaper, but would be useful for market segmentation.

Thus I could see something like this happening:

2024 Q4 / 2025 Q1
MacBook Pro M4 10c, M4 Pro, M4 Max
iMac M4 9c & 10c
Mac mini M4 9c & 10c, M4 Pro

2025 Q2
Mac Studio M4 Max, M4 Ultra
Mac Pro M4 Ultra

2025 H2
MacBook Air M4 9c & 10c

However, some of the pundits disagree and think the MBAs would get refreshed before the Mac Studio and Mac Pro. That does have some support, because if TSMC wants to stop all N3B production asap, then it would make sense to just kill off M3 at about the same time they kill off M3 Pro and M3 Max.

An analogous situation to M3 series Macs would be the iPad 3, sort of. iPad 3 was a performance gimped release with A5X, and then 7 months later they released an ungimped iPad 4 with A6X. A5X was a very short lived chip. M3 isn't gimped in terms of performance, but it's likely still problematic in terms of cost/yield and TSMC's possible reluctance in continuing its manufacturing on N3B.

If M3 is discontinued, what would they do for the "cheap" US$999 MacBook Air then? They could just keep selling the M2 version. So, in the scenario where they kill off all M3 series chips, they could just sell the M2 MacBook Air along side the M4 MacBook Air.
Supposedly TSMC has Arrow Lake GPU commitments to Intel using N3B. So in terms of that, I assume it will persist. (And Intel is probably not agile enough to easily move to N3E the way Apple can, especially since they NEED Arrow Lake to ship on schedule, otherwise their whole "4 Processes in 5 Years" claim becomes a joke).
So I think on the TSMC side, N3B persist for another year at least.

But if N3E is cheaper for Apple than N3B, a rapid switch of MBA might be worth doing. Obviously Apple wants developers to use SME (and SVE?) as much as possible, and ramping up the installed base that has it available is the best way to get that. Likewise if the M4 has improved LLM support (as *seems* to be the case) and Apple wants Macs to be thought of as the premier "AI PC".
 
Reactions: Tlh97 and Eug

name99

Senior member
Sep 11, 2010
526
412
136
I was one of those who thought that Apple could not push frequency too much without paying too high a power price. Process helped more than I expected. And I'm glad I was wrong.

But to port to a new process and enjoy the benefits can be very costly. Definitely Apple implementation team rocks.


I don't buy leaks and claims from anyone until independent measurements are done. So let's wait for Intel and AMD new wonders before drawing conclusions.


I always found that stance stupid. No matter whether they are vertically integrated, it is their interest to converge as much as possible with Arm official architecture. But we still have no proof that Apple implemented SME or SVE; the speedup could just come from using their library that wraps the use of the AMX unit (what about that library using the NPU unit if possible, a unit which speed was doubled on M4?).
There are two ways to push frequency higher.

Run the individual transistors faster, or
cut the pipeline into more stages that can each run faster because they do less sequential work in each stage.

My GUESS (given the power numbers) is that Apple has been concentrating on the second rather than the first. This is NOT easy to do (eg perform all the steps of executing 9-instruction dependency checks, then allocate their resources, in one cycle).

The general consensus has been that some pipeline stages (like Allocate and Issue) HAVE to be performed in one cycle, or else performance goes to hell. And this has determined a lot of, eg AMD and Intel thinking.
But this may be like many such results – If you demand PERFECTION in the stage (eg "perfectly OPTIMAL" issue) then you are forced to do it a certain, power hungry way. But if all you require is "good enough" heuristics, then you can view the problem in a different way, run it much wider (more transistors, but limited SEQUENTIAL transistor dependency, which is what limits your clock rate) and get great results. Maybe Apple issues .1% of instructions out of sub-optimal order, and maybe 1 in a thousand cycles Allocate hits a problem and has to stall for one cycle while it figures things out.

Overall, a great tradeoff – but one that requires lateral thinking, and the standard Apple willingness to use plenty of extra transistors because transistors are basically free.

This is similar to various "exponential" problems like optimization or traveling salesman. Yes, if you demand perfection, they are essentially impossible. But if all you demand is good enough, then heuristics actually work astonishingly well - as is most obviously the case right now with the point that massive NNs can in fact be optimized remarkably well, even if the optimization is not "total".
 

Doug S

Platinum Member
Feb 8, 2020
2,890
4,914
136
Supposedly TSMC has Arrow Lake GPU commitments to Intel using N3B. So in terms of that, I assume it will persist. (And Intel is probably not agile enough to easily move to N3E the way Apple can, especially since they NEED Arrow Lake to ship on schedule, otherwise their whole "4 Processes in 5 Years" claim becomes a joke).
So I think on the TSMC side, N3B persist for another year at least.

I'm still not convinced of that. Intel may have originally planned to use N3 (before it was known to be flawed and became known as "N3B") a few years ago when "Intel buying TSMC N3" was on the news - and note that those rumors had Intel getting in AHEAD OF Apple on N3. Which would have happened if they'd started shipping products containing N3B chips a year ago.

But Intel's schedule changed, and changed some time ago. Given that N3E is cheaper, faster, and lower power, and that Intel would have had plenty of time to port that GPU chiplet to N3E, I think that's exactly what they did. Heck TSMC might have helped them with that, since they have clear incentives to put N3B in the rear view mirror as well.
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,890
4,914
136
There are two ways to push frequency higher.

Run the individual transistors faster, or
cut the pipeline into more stages that can each run faster because they do less sequential work in each stage.

My GUESS (given the power numbers) is that Apple has been concentrating on the second rather than the first. This is NOT easy to do (eg perform all the steps of executing 9-instruction dependency checks, then allocate their resources, in one cycle).

If they're increasing pipeline stages it would show up in instruction latencies. You're doing plenty of low level testing on Apple Silicon hardware, have you seen instruction latency increasing?

I'm skeptical. I think they're just taking advantage of what the process gives them. When TSMC shows a process as e.g. 10% faster than an older one in a comparison, they are telling you the transistors will switch 10% faster at the same power. Between that, and other tweaks afforded by FinFlex, they don't need to do any major reorganization of pipeline stages to get the clock increases we've seen from them.

Now obviously we'd like to see some proper IPC increases from them as well, and it looks like this time around we didn't get any. But it is hard to complain about the fastest (non overclocked) GB result ever, in a passively cooled 5mm tablet. Given the short time gap between M3 and M4 it would be unsurprising if it was basically the same core, but hopefully M5 will be more than "clock bump + special instructions to accelerate a limited set of scenarios". But again, #1 in GB at the moment at least so hard to crap on them for that!
 

Doug S

Platinum Member
Feb 8, 2020
2,890
4,914
136
So if Apple is supporting SME now, do they still need their AVX instructions? Do those do anything that SME instructions can't? AFAIK SME requires ARMv9, so does that mean M4 is ARMv9?

Maybe it went like this - Apple was not ready to implement ARMv9 yet (and maybe that was dependent on the reported ALA extension with ARM that runs through 2040 or so?) but wanted matrix support without waiting on ARMv9. So they add AMX instructions, but hide them behind a library so they can pull those instructions later. When they do an ARMv9 core they add SME and drop AMX, but since all code out there is doing library calls that's transparent. The library on M4 Macs would use SME instructions instead of Apple's proprietary AMX instructions.

Because it doesn't make sense to me that Apple would implement SME if they are going to continue to support their AMX instructions as well. I mean, it would all map to basically the same unit since they're doing the same thing so it isn't a waste of silicon area, but it still creates unnecessary work for no visible gain unless AMX will always be a superset of SME.
 

Eug

Lifer
Mar 11, 2000
23,926
1,528
126

This time a more reputable analysis and supply chain Pu is saying Apple is building M2 Ultrs and later M4 (M4 Ultra probably) servers for AI
Now, Mark Gurman is repeating what Jeff Pu said. As in Apple plans to use M2 Ultra for AI datacenter cloud servers, and then eventually M4 series ones.

However, those statements are pretty vague. For all we know, they could just be existing Macs with hardware add-ons and custom software.
 
Reactions: Orfosaurio

SpudLobby

Senior member
May 18, 2022
991
684
106
If they're increasing pipeline stages it would show up in instruction latencies. You're doing plenty of low level testing on Apple Silicon hardware, have you seen instruction latency increasing?

I'm skeptical. I think they're just taking advantage of what the process gives them. When TSMC shows a process as e.g. 10% faster than an older one in a comparison, they are telling you the transistors will switch 10% faster at the same power. Between that, and other tweaks afforded by FinFlex, they don't need to do any major reorganization of pipeline stages to get the clock increases we've seen from them.
Yeah Doug I agree mostly. I think two caveats are in order:
They did increase L1 latency slightly it seems with M4? But that’s it. Don’t think they’re doing any major restructuring.


And then, I largely agree all they’re doing is taking advantage of dense cell performance increases — they’re not qualitatively jumping into AMD/Intel territory.

However, power has definitely increased more than the nodes could afford over time. The A11 to A12 was a small bump, A13 another small bump, around the same with A14. Then A15 bumped and then A17 again.

Their cores are still very efficient, (energy wise) and especially for the performance class — like even an LPDDR5 Phoenix is not going to match an M1’s 1730 GB5 performance at the same power, nor an M2’s 1880. They can get 1900+ but at like 20W. And tuning it down still wouldn’t get them close IMO.

But I do wish they wouldn’t push it as much, even if energy efficiency is still good. I suspect this is still going to be an ST power increase. Not 15+W territory, but I could see them hitting 9-11W.

Now obviously we'd like to see some proper IPC increases from them as well, and it looks like this time around we didn't get any. But it is hard to complain about the fastest (non overclocked) GB result ever, in a passively cooled 5mm tablet. Given the short time gap between M3 and M4 it would be unsurprising if it was basically the same core, but hopefully M5 will be more than "clock bump + special instructions to accelerate a limited set of scenarios". But again, #1 in GB at the moment at least so hard to crap on them for that!
Yeah I mean regardless of any power increase that they’re able to put it in a tablet is still impressive.

IPC is still a big ??? for the long term though.
 
Reactions: Vattila

SpudLobby

Senior member
May 18, 2022
991
684
106
So if Apple is supporting SME now, do they still need their AVX instructions? Do those do anything that SME instructions can't? AFAIK SME requires ARMv9, so does that mean M4 is ARMv9?
No I don’t think AMX has anything that SME doesn’t besides first-party support but that’s probably fixed by now
Maybe it went like this - Apple was not ready to implement ARMv9 yet (and maybe that was dependent on the reported ALA extension with ARM that runs through 2040 or so?) but wanted matrix support without waiting on ARMv9. So they add AMX instructions, but hide them behind a library so they can pull those instructions later. When they do an ARMv9 core they add SME and drop AMX, but since all code out there is doing library calls that's transparent. The library on M4 Macs would use SME instructions instead of Apple's proprietary AMX instructions.

Because it doesn't make sense to me that Apple would implement SME if they are going to continue to support their AMX instructions as well. I mean, it would all map to basically the same unit since they're doing the same thing so it isn't a waste of silicon area, but it still creates unnecessary work for no visible gain unless AMX will always be a superset of SME.
I think Apple had AMX before SME was finished and they would rather have adopted SME. The rumor is they had a lot of input to it too.

RE: visible gains — sure for existing stuff it won’t change but SME will be a standard part of Arm V9 toolchains so wherever it can be auto-emitted, that’s a boost, and Arm Macs will still use some stuff compiled for Arm V9 in a generic way, one way or another.
 

name99

Senior member
Sep 11, 2010
526
412
136
If they're increasing pipeline stages it would show up in instruction latencies. You're doing plenty of low level testing on Apple Silicon hardware, have you seen instruction latency increasing?
My finances do not allow me to buy every shiny new Apple device that appears .
And I'm leaving the low level testing game to younger folk. I have other projects I need to work on. My hope is that over the next few years at least a few such people will take up the challenge.

All the students here who are continually moaning "how do I stand out? how do I get a job at a tier one company?"
THIS is how -- by doing something that no-one else is doing, by showing you are capable of original thought! Investigate low level details of the latest Apple chip. Compare details of one GPU vs another. Explore what the ANE does or doesn't do well. Hell, people like myself, Chips and Cheese, or Corsix, or dougallj can't do everything. Take up just one of the projects we've started investigating and burrow deep into it.
 

Eug

Lifer
Mar 11, 2000
23,926
1,528
126
There are not a lot of M4 Metal scores yet, but FWIW:

M3 Metal 6.3.0 score 48101
M4 Metal 6.3.0 score 53877 (+12.0%)

M3 Metal 5.5.1 score 34810
M4 Metal 5.5.1 score 40933 (+17.6%)

BTW, Geekbench's servers are getting slammed, or else some of the servers have gone down or something. I kept on getting this message:

This website is under heavy load (queue full)

We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.
 

name99

Senior member
Sep 11, 2010
526
412
136
So if Apple is supporting SME now, do they still need their AVX instructions? Do those do anything that SME instructions can't? AFAIK SME requires ARMv9, so does that mean M4 is ARMv9?

Maybe it went like this - Apple was not ready to implement ARMv9 yet (and maybe that was dependent on the reported ALA extension with ARM that runs through 2040 or so?) but wanted matrix support without waiting on ARMv9. So they add AMX instructions, but hide them behind a library so they can pull those instructions later. When they do an ARMv9 core they add SME and drop AMX, but since all code out there is doing library calls that's transparent. The library on M4 Macs would use SME instructions instead of Apple's proprietary AMX instructions.

Because it doesn't make sense to me that Apple would implement SME if they are going to continue to support their AMX instructions as well. I mean, it would all map to basically the same unit since they're doing the same thing so it isn't a waste of silicon area, but it still creates unnecessary work for no visible gain unless AMX will always be a superset of SME.
I think SOMETHING like that occurred but (controversial claim), different ordering, and so there was a falling out between ARM and Apple over details.
If we look at SVE, the spec seems to have, uh, a troubled history. SVE, various obvious problems so SVE2, then a constant stream of minor updates fixing stupid issues that should never have been present in the first place. It seems like ARM, for whatever reason, was committed to shipping something like SVE before it was ready (agreement with Fujitsu?) and then spent years patching up the mistakes.

So in that time, presumably, we have some combination of Apple saying "**** this, here's how you do it correctly" as a demo to ARM, along with a whole lot of negotiation as to how to fix SVE and transform it into what Apple wants. Which finally achieve a unified vision maybe three or four years ago, and it takes that long to transform the spec into silicon?

When I look at SME vs AMX (which is a VERY quick scan) I can't see any compelling reason to keep AMX. The two give basically identical functionality and performance on the same hardware, and AMX is hidden behind Accelerate calls anyway. So drop it ASAP and move on. Apple can still always sneak in any weird extra functionality they want via specialized instructions, as they have done repeatedly, then negotiate with ARM to validate them (as they have also done repeatedly).
 

name99

Senior member
Sep 11, 2010
526
412
136
Yeah Doug I agree mostly. I think two caveats are in order:
They did increase L1 latency slightly it seems with M4? But that’s it. Don’t think they’re doing any major restructuring.

What evidence do we have for this (increased L1 latency)?

BTW if the pipeline grows longer, that will NOT show up in "per instruction" latencies, only in branch misprediction cost (which is EXTREMELY difficult to measure correctly on an Apple-level chip, I haven't seen any good analyses).
If the pipeline has not been lengthened but simply "restructured" as I suggested, so that the hottest stages are now less "accurate" but have fewer back-to-back sequential transistor stages, that again won't show up as any sort of latency.
 

SpudLobby

Senior member
May 18, 2022
991
684
106
What evidence do we have for this (increased L1 latency)?

BTW if the pipeline grows longer, that will NOT show up in "per instruction" latencies, only in branch misprediction cost (which is EXTREMELY difficult to measure correctly on an Apple-level chip, I haven't seen any good analyses).
If the pipeline has not been lengthened but simply "restructured" as I suggested, so that the hottest stages are now less "accurate" but have fewer back-to-back sequential transistor stages, that again won't show up as any sort of latency.
Post in thread 'Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)'
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=thread...ranite-ridge-ryzen-9000.2607350/post-41184216


There was discussion about this elsewhere IIRC e.g. realworldtech.
 

poke01

Platinum Member
Mar 8, 2022
2,584
3,412
106
PPW matters, let’s see that breakdown. Also this is not like the Skylake era like I seen some people on Twitter saying it is.

Unlike Intel, Apple is not staying on the same node.
 

gdansk

Diamond Member
Feb 8, 2011
3,276
5,186
136
I want the same response for Zen 5, if the some of IPC improvements come from improved AVX512 support then people should also deduct IPC.

Let’s see how they play it.
I find AVX 512's flexibility to operate on data types larger than 16 bits make it more applicable than SME. It's more akin to SVE (which no one is deducting?).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |