Question CPU Microarchitecture Thread

FlameTail · Dec 26, 2022

So I decided to finally make this thread so that one can find answers to minor queries regarding CPU micro architectures. Questions the like of which Google can't provide a good answer to, which is not surprising since this is a deep subject and there is a lot of inaccurate information out there.

DavidC1 · Nov 11, 2024

FlameTail said:
Oryon has a 192 KB L1i, and Lion Cove has a 192 KB L1d, rivalling the size of the Apple. They both use 4 KB page sizes iirc, so I don't think your statement is true.

Lion Cove's true L1 is what they call L0. It's basically marketing. L1i is only 64KB with 5 cycle latency. And it's "L1" has 9 cycle latency. It's L2 class latency of Nehalem days.

ARM's way of caches and clocks are far superior. Apple achieves better performance with lot lower power. The ARM chips aren't that far behind either. It is embarassing.

The scientists making these have long noted that power required for memory accesses is a big limiter in making it perform better. Having large L1 is a very good basic idea, which can't be done on 5.7GHz processors without insane latency.

The difference between a 19-stage processor and 9 stage one is only 27%. And x86 vendors have to pull out all the tricks to get there. Small, high latency cache levels, lowered uncore clocks, stability issues. It's not worth it. And that gap is going to shrink further. At what point you have to ask: Is it worth it?

adroc_thurston · Nov 11, 2024

DavidC1 said:
The difference between a 19-stage processor and 9 stage one is only 27%.

Not goddamn iso node comparison. Full stop. Don't.
It won't be a 5.7G core on N3e zoomzooms.

itsmydamnation · Nov 11, 2024

DavidC1 said:
Lion Cove's true L1 is what they call L0. It's basically marketing. L1i is only 64KB with 5 cycle latency. And it's "L1" has 9 cycle latency. It's L2 class latency of Nehalem days.

ARM's way of caches and clocks are far superior. Apple achieves better performance with lot lower power. The ARM chips aren't that far behind either. It is embarassing.

The scientists making these have long noted that power required for memory accesses is a big limiter in making it perform better. Having large L1 is a very good basic idea, which can't be done on 5.7GHz processors without insane latency.

The difference between a 19-stage processor and 9 stage one is only 27%. And x86 vendors have to pull out all the tricks to get there. Small, high latency cache levels, lowered uncore clocks, stability issues. It's not worth it. And that gap is going to shrink further. At what point you have to ask: Is it worth it?

Then you go look at ARM servers doing high core counts...... oh look far worse cache implementations......

so if i approach this like you ,

how many 144+ core apple SOC's are there , 0 so x86 cache implementation = infinity better
how many 144+ core AWS SOC's are there , 0 so x86 cache implementation = infinity better

lets look at graviton V4, only 32k L1D , only 2mb L2 per core with almost zero L3 which means lots of extra memory accesses then modern x86 server cores.......
lets look at ampere A192 64kb but its write thought ( great for power right ) 2mb L2 per core with zero L3 but memory control side cache. which means lots of extra memory accesses then modern x86 server cores, that totally doesn't show up in benchmarks like https://www.servethehome.com/ampere...permicro-nvidia-broadcom-kioxia-server-cpu/2/

man ARM is so far behind its EMBARASSING

Or lets not be idiots and actually evaluate things with consideration to their TAM/target markets, different vendors are making different trade offs.

FlameTail · Nov 11, 2024

itsmydamnation said:
Or lets not be idiots and actually evaluate things with consideration to their TAM/target markets, different vendors are making different trade offs.

That's a good point about TTM and target markets. AWS Graviton4, Nvidia Grace and Google Axion, all use the previous gen ARM Neoverse V2 cores. So it seems ARM IP takes a while to land in the market.

Speaking of ARM IP, ARM doesn't use very large L1 caches and shared L2 caches. Those are characteristics of Apple/Qualcomm's custom ARM CPU designs, which is what DavidC1 was talking about.

Neither Apple/Qualcomm make server chips, but that doesn't necessarily mean their L1/sL2 cache strategy is unsuitable for servers.

(1) According to rumours, Qualcomm/Nuvia had a 5nm 80 Oryon core server CPU in the works, which eventually was cancelled. Presumably, this thing had 10 of 8-core Oryon CPU clusters, with 12 MB of shared cache per cluster.

(2) Apple is deploying M2 Ultras in their own datacenters for internal use. So does that count as a server CPU?

Apple’s AI Strategy: Apple Datacenters, On-device, Cloud, And More

When to run on-device, in Apple datacenters, or in the cloud with OpenAI, Deal economics Nvidia continues to ramp their production to service the world’s insatiable demand for GPUs, and yet, our Ac…

semianalysis.com

itsmydamnation · Nov 11, 2024

FlameTail said:
That's a good point about TTM and target markets. Both AWS Graviton4 and Google Axion use the previous gen ARM Neoverse V2 cores. So it seems ARM IP takes a while to land in the market.

Speaking of ARM IP, ARM doesn't use very large L1 caches and shared L2 caches. Those are characteristics of Apple/Qualcomm's custom ARM CPU designs, which is what DavidC1 was talking about.

Yes and they pay a very large price for it, CPU's are a pretty small component in TCO servers of a server and that's ignoring software costs. So if i channel my inner DavidC1 then ARM are just Dumb Dumb Dumb..... yeah ?

FlameTail · Nov 11, 2024

Okay, this seems to be the perfect thread to discuss this tweet from Dr. Eric Quinnell, a respected industry veteran.

Hot takes summary:

* ISA does matter (ie var length)
* OoO SMP beats SMT always
* > 2-3GHz is negative perf/watt
* no DMA in transport protocols
* grep/bash > python at text parsing
* OoO brp > OoO predicates, pick one

Eric Quinnell - X

Some background about him;

Quinnell spent nearly four years at Tesla, after stints at Arm, Samsung Austin R&D, and AMD. At AMD, he worked on the low-power x86 architecture that went into the PS4 and Xbox One. He currently works at Amazon AWS.

FlameTail · Nov 11, 2024

ISA does matter (ie var length)

Fully agreed. There are some who argue that ISA doesn't matter, which is clearly wrong.

ISA does matter.

The discussion worth having is How much does it matter?

OoO SMP beats SMT always

I believe SMP stands for Symmetric Multi Processing;

Symmetric multiprocessing - Wikipedia

en.m.wikipedia.org

This hearkens back to what I posted 2 years ago;

FlameTail said:
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

That explains why Apple hasn't implemented SMT, and why they have such large Re-Order Buffers.

OoO SMP > SMT

OoO brp > OoO predicates, pick one

What does brp stand for?

OneEng2 · Nov 11, 2024

DavidC1 said:
Lion Cove's true L1 is what they call L0. It's basically marketing. L1i is only 64KB with 5 cycle latency. And it's "L1" has 9 cycle latency. It's L2 class latency of Nehalem days.

ARM's way of caches and clocks are far superior. Apple achieves better performance with lot lower power. The ARM chips aren't that far behind either. It is embarassing.

The scientists making these have long noted that power required for memory accesses is a big limiter in making it perform better. Having large L1 is a very good basic idea, which can't be done on 5.7GHz processors without insane latency.

The difference between a 19-stage processor and 9 stage one is only 27%. And x86 vendors have to pull out all the tricks to get there. Small, high latency cache levels, lowered uncore clocks, stability issues. It's not worth it. And that gap is going to shrink further. At what point you have to ask: Is it worth it?

LOL. Ok, all this theory sounds great on paper, but obveously there is something wrong with it. Here is what happens when ARM 192 core meets Zen 5c 192 core: https://www.phoronix.com/review/amd-epyc-9965-ampereone#google_vignette

ARM is beaten badly .... and nearly across the board in every benchmark .... and by significant margins.

I think too much time is spent on synthetic benchmarks, and theoretical performance that don't translate into real world performance.

I'm just saying, there's more to the picture than being painted here.

Hitman928 · Nov 11, 2024

@FlameTail Branch if positive (BRP) AFAIK.

naukkis · Nov 11, 2024

gai said:
There's no difference between VIPT and PIPT when the cache is small enough, and this is where the implementation becomes much nicer. Hence, the 48 KiB Zen 5 L1D$ and the 48 KiB Lion Cove "L0" D$, which use 4K page sizes, are both 12-way PIPT.

When the first-level cache uses VIVT, it requires some alias analysis hardware to handle duplicate instances of the same cache line, and this hardware does not come for free.

L1 cache is only accessed with VIVT tags - actually mictotags which are only partial as 64bit address space ain't needed to access few kilobytes of memory. Every L1 cache line has also PIPT tags which are used when cache line is loaded or checked for coherency- either in L2 when L2 is inclusive or in L1 tag directory. Only data access from core to L1 cache is accessed with virtual address - all other cache handling through PIPT tags and absolutely same as doing it with only PIPT.

IBM is documented extremely well their micto-tagged access with directory based full physical tags sceheme for their Z-series cpus, I recommend looking it if want to know more about it.

MS_AT · Nov 11, 2024

FlameTail said:
ISA does matter.

Well, the thing is that what matters is what people are usually brushing aside What mostly gets discussed is the decode [fixed vs variable instruction length] but things like memory ordering or architectural registers are brushed aside.

FlameTail said:
OoO SMP > SMT

I don't think they should be compared directly. The biggest promise of SMT is that when you have 8C8T CPU with little area cost you can make it 8C16T and get noticeable boost in PPA, not that SMT should be preferable to SMP.

FlameTail said:
That explains why Apple hasn't implemented SMT

We don't know if Apple will implement SMT in the future. It seems so far they did not feel a need to do so. But if gigantic ROB alone would be a key to higher performance then Intel should have stayed ahead of AMD, as every Intel P-core had bigger ROB than contemporary AMD equivalent if I am not mistaken.

naukkis · Nov 11, 2024

DavidC1 said:
Having large L1 is a very good basic idea, which can't be done on 5.7GHz processors without insane latency.

IBM Z15 says hi, 128KB L1 with 4-cycle latency @5.2Ghz on 14nm node. It's doable, x86 vendors are just not at leading edge of designs.

FlameTail · Nov 11, 2024

naukkis said:
IBM Z15 says hi, 128KB L1 with 4-cycle latency @5.2Ghz on 14nm node. It's doable, x86 vendors are just not at leading edge of designs.

And their huge L2 cache.

IBM Tellum II = 32 MB pL2 @ 3.6ns latency

For comparison;

X Elite = 12 MB sL2 @ 5.28ns

Source;

Telum II at Hot Chips 2024: Mainframe with a Unique Caching Strategy

Mainframes still play a vital role in today, providing extremely high uptime and low latency for financial transactions.

chipsandcheese.com

adroc_thurston · Nov 11, 2024

naukkis said:
IBM Z15 says hi, 128KB L1 with 4-cycle latency @5.2Ghz on 14nm node. It's doable, x86 vendors are just not at leading edge of designs.

you forgot to mention power and area numbers.

naukkis · Nov 11, 2024

adroc_thurston said:
you forgot to mention power and area numbers.

Power wise x86 cpus puts everything possible and more to drive those max boost clocks. And area is that problem with cache size versus latency - signal delays limits area available for cache. x86 small L1 caches aren't technologically limited but designs choices. And those design choices need to change as rivals are better.

gdansk · Nov 11, 2024

naukkis said:
Power wise x86 cpus puts everything possible and more to drive those max boost clocks. And area is that problem with cache size versus latency - signal delays limits area available for cache. x86 small L1 caches aren't technologically limited but designs choices. And those design choices need to change as rivals are better.

Actually as I recall Z15 goes even further beyond when it comes to power and area. It's even bigger and hotter than any x86 chip yet built. Only 12 cores in a massive 700mm² die.

naukkis · Nov 11, 2024

gdansk said:
Actually as I recall Z15 goes even further beyond when it comes to power and area. It's even bigger and hotter than any x86 chip yet built. Only 12 cores in a massive 700mm² die.

Intel tuned Alderlake beyond silicon limitations. I haven't heard that IBM mainframes are dying same way. Intel server chips are also similar sized to max what could be build. Core count is irrelevant - IBM just use massive cache amounts because that what their target software needs. CPU core size is yet again limited by signal delays - it's a balance with pipeline stages, signal delays and clock frequency. Intel would go bigger in core size if that will increase performance, but many times they have gone too far and result have been poor and they needed to redesign to smaller cpu core.

gdansk · Nov 11, 2024

naukkis said:
Intel tuned Alderlake beyond silicon limitations. I haven't heard that IBM mainframes are dying same way.

Intel doesn't deploy them all under 5 lb heatsinks with 20k rpm fans. And if IBM CPUs fail everyone using them has them under a maintenance plan, of course you'd never hear of it.
The fact is one can more easily have a massive L1 if the plan is to have 12 cores instead of say 60 cores in a reticle limit chip.

naukkis · Nov 11, 2024

gdansk said:
Intel doesn't deploy them all under 5 lb heatsinks with 20k rpm fans. And if IBM CPUs fail everyone using them has them under a maintenance plan, of course you'd never hear of it.
You can more easily have a massive L1 of your plan is to have 12 cores instead of say 60 cores in a reticle limit.

That L1 cache designs reasons aren't size limited. IBM z15 is 14nm soc where half of die is edram-L3 cache. That IBM z-core isn't any bigger excluding L2 cache than x86 cores manufactured on same kind of process targeting same frequencies.

And try to remember that phone chips have even bigger L1-caches.

gdansk · Nov 11, 2024

naukkis said:
That IBM z-core isn't any bigger excluding L2 cache than x86 cores manufactured on same kind of process targeting same frequencies.

14nm Z15 core without L2: 15.96mm²
14nm Skylake core with L2: 7.9mm²
It's not even close? I don't know what to say.

adroc_thurston · Nov 11, 2024

naukkis said:
Power wise x86 cpus puts everything possible and more to drive those max boost clocks

Oh hell no.
There's a good reason mainframes are watercooled lmao.

DavidC1 · Nov 11, 2024

itsmydamnation said:
Yes and they pay a very large price for it, CPU's are a pretty small component in TCO servers of a server and that's ignoring software costs. So if i channel my inner DavidC1 then ARM are just Dumb Dumb Dumb..... yeah ?

This is like arguing Core 2 was a bad server CPU versus Opteron. Then when they put a proper uncore with Nehalem, it left Opteron in the dust.

Just cause the ARM guys didn't yet make a good server CPU doesn't mean they have a crappy CPU architecture(in fact it's the other way around). They have a massive client market so all the focus and attention is there.

@FlameTail

naukkis said:
IBM Z15 says hi, 128KB L1 with 4-cycle latency @5.2Ghz on 14nm node. It's doable, x86 vendors are just not at leading edge of designs.

There's this too. Design and execution trumps ISA. Right now ARM vendors have both.

Doug S · Nov 11, 2024

OneEng2 said:
LOL. Ok, all this theory sounds great on paper, but obveously there is something wrong with it. Here is what happens when ARM 192 core meets Zen 5c 192 core: https://www.phoronix.com/review/amd-epyc-9965-ampereone#google_vignette

ARM is beaten badly .... and nearly across the board in every benchmark .... and by significant margins.

I think too much time is spent on synthetic benchmarks, and theoretical performance that don't translate into real world performance.

I'm just saying, there's more to the picture than being painted here.

That's a terrible take. You're comparing a state of the art x86 core against a core far worse than the state of the art in ARM land. Put 192 M4 cores together with the same amount of inter-core communication and memory controller resources that the Zen 5c Epyc has and you'd see a totally different story. Or for a comparison where both sides are fighting with one hand tied behind their backs put 192 Bulldozer cores up against that AmpereOne.

Your example has absolutely nothing to do with "ARM not being able to translate between synthetic benchmarks and real world performance" and everything to do with AmpereOne having a crappy core, and its marketers having cherry picked a few benchmarks that show it off in the best light.

Yes I know what I'm suggesting to compare with doesn't exist, but it isn't because Apple couldn't build that 192 core monster it is because they have no interest in competing in the high end server market (or for that matter in the low end server market)

I'm not suggesting here that ARM is inherently superior, but the take some have that ARM's performance is only applicable to mobile and laptops and synthetic benchmarks, but it is unsuited for "real world performance" in high end gaming/workstation/servers is laughable.

NostaSeronx · Nov 11, 2024

gdansk said:
14nm Z15 core without L2: 15.96mm²
14nm Skylake core with L2: 7.9mm²
It's not even close? I don't know what to say.

Not sure if "manufactured on same kind of process targeting same frequencies" covers cell height differences.

GF14HP Std Cell in Z15 = 18-track, 5 p-fin+5 n-fin. Which can't really be considered the same kind of process/std. cells with Intel's 14nm/Skylake.

itsmydamnation · Nov 11, 2024

Doug S said:
That's a terrible take. You're comparing a state of the art x86 core against a core far worse than the state of the art in ARM land. Put 192 M4 cores together with the same amount of inter-core communication and memory controller resources that the Zen 5c Epyc has and you'd see a totally different story. Or for a comparison where both sides are fighting with one hand tied behind their backs put 192 Bulldozer cores up against that AmpereOne.

Your example has absolutely nothing to do with "ARM not being able to translate between synthetic benchmarks and real world performance" and everything to do with AmpereOne having a crappy core, and its marketers having cherry picked a few benchmarks that show it off in the best light.

Yes I know what I'm suggesting to compare with doesn't exist, but it isn't because Apple couldn't build that 192 core monster it is because they have no interest in competing in the high end server market (or for that matter in the low end server market)

I'm not suggesting here that ARM is inherently superior, but the take some have that ARM's performance is only applicable to mobile and laptops and synthetic benchmarks, but it is unsuited for "real world performance" in high end gaming/workstation/servers is laughable.

But thats the point , you cant do that with good performance because you have to have a cache control protocol that actually works across that. So then when you go and look at the latest ARM large scale fabrics they are significantly worse for data locality then what AMD are doing let alone what the ARM client cores look like.

Also servers still really care about 4k page size performance so what happens then with your apple core?

Question CPU Microarchitecture Thread

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member