Question CPU Microarchitecture Thread

FlameTail · Dec 26, 2022

So I decided to finally make this thread so that one can find answers to minor queries regarding CPU micro architectures. Questions the like of which Google can't provide a good answer to, which is not surprising since this is a deep subject and there is a lot of inaccurate information out there.

gai · Nov 11, 2024

naukkis said:
L1 cache is only accessed with VIVT tags - actually mictotags which are only partial as 64bit address space ain't needed to access few kilobytes of memory. Every L1 cache line has also PIPT tags which are used when cache line is loaded or checked for coherency- either in L2 when L2 is inclusive or in L1 tag directory. Only data access from core to L1 cache is accessed with virtual address - all other cache handling through PIPT tags and absolutely same as doing it with only PIPT.

IBM is documented extremely well their micto-tagged access with directory based full physical tags sceheme for their Z-series cpus, I recommend looking it if want to know more about it.

Yes, you have briefly described how IBM has designed additional hardware to prevent physical aliasing across different sets in the first-level data cache. This additional hardware is not free and would not be required in a physically-indexed cache.

Virtually-indexed caches must provide alias resolution hardware, while physically-indexed caches must obey certain size constraints. These are the trade-offs.

DavidC1 · Nov 11, 2024

itsmydamnation said:
But thats the point , you cant do that with good performance because you have to have a cache control protocol that actually works across that. So then when you go and look at the latest ARM large scale fabrics they are significantly worse for data locality then what AMD are doing let alone what the ARM client cores look like.

We're talking about potential here. Revenue and marketshare wise the x86 vendors have more incentive to focus on servers, and ARM has less.

Doug S said:
I'm not suggesting here that ARM is inherently superior, but the take some have that ARM's performance is only applicable to mobile and laptops and synthetic benchmarks, but it is unsuited for "real world performance" in high end gaming/workstation/servers is laughable.

It's the same reason I gave Core 2 as an example. Core 2 was a very good client core. If Intel never developed Nehalem's P2P interconnect, and the memory controller you could have also given the same argument.

Yet when they did, it truly showed what the Core was capable of in servers.

OneEng2 · Nov 11, 2024

Doug S said:
That's a terrible take. You're comparing a state of the art x86 core against a core far worse than the state of the art in ARM land. Put 192 M4 cores together with the same amount of inter-core communication and memory controller resources that the Zen 5c Epyc has and you'd see a totally different story. Or for a comparison where both sides are fighting with one hand tied behind their backs put 192 Bulldozer cores up against that AmpereOne.

Your example has absolutely nothing to do with "ARM not being able to translate between synthetic benchmarks and real world performance" and everything to do with AmpereOne having a crappy core, and its marketers having cherry picked a few benchmarks that show it off in the best light.

Yes I know what I'm suggesting to compare with doesn't exist, but it isn't because Apple couldn't build that 192 core monster it is because they have no interest in competing in the high end server market (or for that matter in the low end server market)

I'm not suggesting here that ARM is inherently superior, but the take some have that ARM's performance is only applicable to mobile and laptops and synthetic benchmarks, but it is unsuited for "real world performance" in high end gaming/workstation/servers is laughable.

Fair, but where is this M4 192 core ARM server processor?

What I THINK is that ARM in its current form (yes, even the mighty M4) is likely not a good fit for a server processor in a highly threaded environment that has to deal with a per core annual licensing fee for the software it runs.

Fundamentally, modern CISC and RISC based processors are not all that different. The idea that somehow ARM with its RISC instruction set will somehow blow away every x86 server chip that came before it is just BS. As a few others have eluded to, there is much more going on in such a processor than just CISC or RISC. AMD and Intel have decades of experience making successful processors for DC. Everyone seems to think that there is something magical about ARM and that magic will work everywhere.

ARM has risen to supremacy in phones and tablets where ultra low power operation is key and multi-thread is an afterthought. I'm not saying that this is ALL ARM can do, I am saying that this is its roots.... and as such, a multi-hundred core processor is not its bread and butter. Can they get there? Sure, but the opposite is also true. AMD and Intel could change gears and target phone and tablet processors .... but that isn't where their bread and butter is, and this is not their roots.

The care and feeding of a massively parallel DC processor is more complex than just a high IPC single core design. It seems like lots of people are very willing to believe that AMD and Intel are simply full of stupid engineers and ARM companies are just so smart that of course they can make a core design originally targeting a cell phone that surpasses them with little to no effort just by using the current best in class ARM core.

I think that is pretty optimistic to say the least.

As I have said, lets start with getting an ARM processor that has decent SMT and a SIMD engine that can do more than 128bit paths (you know, where x86 was over a decade ago). Then we might have the very beginning of a decent DC chip .... but even then, there are lots more design considerations I am positive I know nothing about that ARM would be forced to learn tough lessons about before reaching where AMD and Intel are and will be in 2025/2026 time frame.

Still, ARM does lots of things right. It is good to have competition.

FlameTail · Nov 11, 2024

OneEng2 said:
Fundamentally, modern CISC and RISC based processors are not all that different. The idea that somehow ARM with its RISC instruction set will somehow blow away every x86 server chip that came before it is just BS.

OneEng2 said:
Everyone seems to think that there is something magical about ARM and that magic will work everywhere.

OneEng2 said:
It seems like lots of people are very willing to believe that AMD and Intel are simply full of stupid engineers and ARM companies are just so smart

Pretty sure most people here aren't naive enough to believe these things. Perhaps in other places they do.

The best x86 server CPU beats the best ARM server CPU right now. That is a fact.

The argument is that the best ARM server CPU doesn't use the latest and best ARM IP, whereas the best x86 server CPU at the moment does.

As a member of Arm’s highest performance core line, Neoverse V2 is a large core with width and reordering capacity on par with AMD’s Zen 4. However, it stops short of being a monster like Golden Cove, Oryon, or Apple’s Firestorm. As implemented in Graviton 4, Neoverse V2 runs at up to 2.8 GHz.

Arm’s Neoverse V2, in AWS’s Graviton 4

Amazon Web Services (AWS) is the largest cloud provider, and an early Arm server adopter.

chipsandcheese.com

Of course, there isn't an ARM server processor with 192 M4 cores or 192 Oryon-L cores. But what if there was? Will it beat Turin? We don't know. As you said, interconnects become very important when scaling to such high core counts, and the exceptional performance of these ARM cores in client workloads may not show in server workloads.

OneEng2 said:
As I have said, lets start with getting an ARM processor that has decent SMT

Several people have posited that an ARM core wouldn't benefit from SMT as much as an X86 core would. Not necessarily due to the ISA, but due to other microarchitectural design choices. Indeed, the fact that none of ARM's server cores (except the first gen Neoverse E1) have SMT, might be a testament to that.

DavidC1 · Nov 12, 2024

FlameTail said:
Several people have posited that an ARM core wouldn't benefit from SMT as much as an X86 core would. Not necessarily due to the ISA, but due to other microarchitectural design choices. Indeed, the fact that none of ARM's server cores (except the first gen Neoverse E1) have SMT, might be a testament to that.

ARM designs are wide, so yes it can take advantage of SMT.

SMT creates problems greater than just extra transistors/power/mm2. It increases difficulty in validation. That means extra risk, and/or extra time to market. Conversely that extra risk can be used elsewhere. Recent security issues make this worse.

OneEng2 said:
ARM has risen to supremacy in phones and tablets where ultra low power operation is key and multi-thread is an afterthought.

How is it an afterthought? Apple M4 does pretty damn well on Cinebench 2024, at fraction of the power.

OneEng2 said:
The care and feeding of a massively parallel DC processor is more complex than just a high IPC single core design. It seems like lots of people are very willing to believe that AMD and Intel are simply full of stupid engineers and ARM companies are just so smart that of course they can make a core design originally targeting a cell phone that surpasses them with little to no effort just by using the current best in class ARM core.

Do you know what was the excuse before this? "Oh ARM chips can never reach x86 level IPC in any way shape or form". Well they did. And it's beating in absolute single thread performance against desktop processors using 3-4x the power, and sometimes killing itself to reach those clocks.

AMD/Intel doesn't have "stupid engineers". They have faltered and fell flat on it's face numerous times, which set them back few years. Relatively easy to catch up against that.

Smartphone popularity caught x86 vendors with their pants down. And all the capable people and management went there. Big reason why they are doing better.

PCs are only nice-to-have now, while Smartphones are viewed as necessary. Apps like Instagram and Telegram can't be used without a phone. And the desktop equivalent of some of those are almost nonexistent.

OneEng2 said:
As I have said, lets start with getting an ARM processor that has decent SMT and a SIMD engine that can do more than 128bit paths (you know, where x86 was over a decade ago).

Have you thought maybe they shouldn't have done 512-bit vectors? AVX512 should have been AVX3-256.

ARM just added more 128-bit units, which is superior when you can do so because it benefits all FP code, even legacy, while wider vectors need recompiling.

moinmoin · Nov 12, 2024

In a way microarchitecture is meaningless for today's DC. What really matters there is the uncore and scaling that up to many cores. Currently it's AMD that does it best, offering scaling of cores whose capability are plenty competitive in the client market as well. At the moment no other company can offer that on the free market. Intel's Sierra Forest with 288 cores/threads may well be the closest but offers less capability.

Apple would likely offer both higher performance and higher efficiency. But alas neither do they scale up to 192 cores/384 threads, nor are their chips available on the free market. Will be interesting to see whether Nuvia as part of Qualcomm ever gets to enter that market.

soresu · Nov 12, 2024

moinmoin said:
Will be interesting to see whether Nuvia as part of Qualcomm ever gets to enter that market.

Well QC did dip their toes in with Falkor - but like several others at the time they bought in to a market that wasn't yet ready for ARM servers.

I think that now they do have a better chance, though the relationships ARM Ltd have with various ARM vendors might make it a hard sell given X925 is already competitive in scalar IPC at least, and more than in SIMD.

Still not sure what V3 is based on, but either way I don't think that QC will be able to just bull their way into that market.

MS_AT · Nov 12, 2024

DavidC1 said:
Have you thought maybe they shouldn't have done 512-bit vectors? AVX512 should have been AVX3-256.

ARM just added more 128-bit units, which is superior when you can do so because it benefits all FP code, even legacy, while wider vectors need recompiling.

AVX512 to AVX2 is what SVE is to Neon. You would need to recompile anyway. What was stupid was the artificial market segmentation where client Skylake had AVX512 disabled. It could either handle it as Zen1 did handle AVX2 or Zen4 did handle AVX512. Another not so wise thing, was making the 512b mandatory, and 256/128b optional, as this lacked foresight as we can now see.

FlameTail · Nov 12, 2024

An article about ARM's strategy for server CPUs;

Custom Arm CPUs Drive Many Different AI Approaches In The Datacenter

Sponsored Feature Arm is starting to fulfill its promise of transforming the nature of compute in the datacenter, and it is getting some big help from

www.nextplatform.com

soresu · Nov 12, 2024

MS_AT said:
Another not so wise thing, was making the 512b mandatory, and 256/128b optional, as this lacked foresight as we can now see.

Intel probably did not have plans to make heterogenous CPU configurations when AVX512 was first proposed.

MS_AT · Nov 12, 2024

soresu said:
Intel probably did not have plans to make heterogenous CPU configurations when AVX512 was first proposed.

I agree, but at the same time they wanted to keep Skylake die as small as possible to maximize profit so if they made 256b default they could provide new instructions and bigger architectural register pool on client too. And that would widen adoption. But its if and maybes now

naukkis · Nov 12, 2024

gai said:
Yes, you have briefly described how IBM has designed additional hardware to prevent physical aliasing across different sets in the first-level data cache. This additional hardware is not free and would not be required in a physically-indexed cache.

Virtually-indexed caches must provide alias resolution hardware, while physically-indexed caches must obey certain size constraints. These are the trade-offs.

There's no hardware other than physical tagging to solve aliasing and homonym problems. Every memory coherent multi-core systems needs that every cache line in system has physical tags. AMD has also documented how they do their L1 virtual tagging - L2 is inclusive of L1 and every L1 cache line has physical tags in L2, and L2 handles L1 loads and evictions. Doing TLB conversion for every load means that for every 64 bit cache load there's need to load and compare thousands of bits metadata(from TLB - which is cache for cpu MMU) which ain't needed when L1 is accessed from core side with linear address. IBM did also describe that doing physical tags and tlb conversions for many loads per cycle will drive those TLB accesses hottest part of the cpu limiting clock potential. For comparison doing 8 bit mictotags for L1 like AMD does only needs comparing 64 bits of L1 tags before L1 load.

gai · Nov 12, 2024

naukkis said:
There's no hardware other than physical tagging to solve aliasing and homonym problems. Every memory coherent multi-core systems needs that every cache line in system has physical tags. AMD has also documented how they do their L1 virtual tagging - L2 is inclusive of L1 and every L1 cache line has physical tags in L2, and L2 handles L1 loads and evictions. Doing TLB conversion for every load means that for every 64 bit cache load there's need to load and compare thousands of bits metadata(from TLB - which is cache for cpu MMU) which ain't needed when L1 is accessed from core side with linear address. IBM did also describe that doing physical tags and tlb conversions for many loads per cycle will drive those TLB accesses hottest part of the cpu limiting clock potential. For comparison doing 8 bit mictotags for L1 like AMD does only needs comparing 64 bits of L1 tags before L1 load.

Yes, after excluding the hardware and design choices that solve the aliasing problem in any particular processor, there's no additional hardware requirement. This is a tautology.

The cache systems that you are describing maintain physical indices to guarantee that synonyms cannot allocate to more than a single cache set. This is why Intel and AMD have 48 KiB, 12-way L1 data caches. In IBM's case, the L1 data cache is 32 KiB, 8-way. If these designs had larger caches, or if the associativities were lower, then two or more cache sets could contain synonyms. This would add complexity to the hardware that enforces the non-existence of synonyms when filling a new line into the L1 cache. Given this hardware, the L1 cache size can grow beyond the product of page size and way count. Perhaps Qualcomm has added such hardware to their latest processors.

Each memory operation that executes in the processor must check permissions in order to provide precise exceptions. Accordingly, it's not possible to fully escape the TLB. However, the TLB output is not always required to begin cache access. This is the critical advantage of virtual indexing, including cases where the virtual and physical indices are guaranteed to be identical, as in the caches discussed above.

The use of a smaller tag for fast, "good enough" hit/miss analysis does not mean that a full tag match is not required. A reduced-size tag can correctly detect the vast majority of cache misses, but it cannot prove that a cache hit has definitely occurred. This permits the full tag comparison to be delayed without significant harm to the performance of the processor, because it's rare for cache misses to be detected with the full tag.

naukkis · Nov 13, 2024

gai said:
Each memory operation that executes in the processor must check permissions in order to provide precise exceptions. Accordingly, it's not possible to fully escape the TLB. However, the TLB output is not always required to begin cache access. This is the critical advantage of virtual indexing, including cases where the virtual and physical indices are guaranteed to be identical, as in the caches discussed above.

The use of a smaller tag for fast, "good enough" hit/miss analysis does not mean that a full tag match is not required. A reduced-size tag can correctly detect the vast majority of cache misses, but it cannot prove that a cache hit has definitely occurred. This permits the full tag comparison to be delayed without significant harm to the performance of the processor, because it's rare for cache misses to be detected with the full tag.

Permissions and all other TLB data can be also added to cache tags so they can be checked same time as tag lookup. Delaying that check from cache load will just result to Meltdown. Of course full tag need to be checked because partial tags will have aliasing in it's own linear addressing. But there's no need to address translation to physical, they just need to check that full linear address is valid. Physical tags checking is needed from memory coherency way and when loading cache line, after that cache line is loaded cpu core can use it with linear address only - address translation is not needed.

For example Intel Crestmont. It seems to have per core virtual-tagged L2 cache too. Core-to-core loads goes through L3 because physical tags is only checked on that level. For Skymont they added linear address coherency logic for L2 so threads sharing same linear address can share data on L2-level. That needed logic where L2 keeps check on cache line state and will load dirty cache lines from other cores L1 when they are hit from other core on L2.

naukkis · Nov 13, 2024

gai said:
Yes, you have briefly described how IBM has designed additional hardware to prevent physical aliasing across different sets in the first-level data cache. This additional hardware is not free and would not be required in a physically-indexed cache.

Virtually-indexed caches must provide alias resolution hardware, while physically-indexed caches must obey certain size constraints. These are the trade-offs.

What they do is double-tagging. All cache lines are physically tagged which resolves all aliasing problems as cache lines are load with physical address. Also that physical tagging is needed for coherency mechanism, it cannot work efficiently without. But cpu core doesn't need to get data which is in it's cache through address translation. It could address it with linear address tag if cache lines have it. There was time somewhere in 90's when doing address translation - work that needs power - was better option than have double tags for every cache line but that ain't true anymore. Silicon space is pretty much free but power ain't. Specially silicon is free to L1 tag directory as that L1 cache needs to be in cpu critical path but directory doesn't. L1 hit has 3 cycle latency but pipeline length is about 20 cycles plus store queu so there's plenty of time to do full tag check after L1 load.

gai · Nov 13, 2024

The explanations above appear sound to me. Duplicating information in a CPU to reduce timing criticality very frequently worth the area cost. However, none of these techniques provide any explanation about the originating question on page 1: why is Qualcomm's 96 KiB, 6-way L1D cache so much larger than the competition? The fact remains that IBM (32 KiB, 8-way), Intel (48 KiB, 12-way), and AMD (48 KiB, 12-way) are not paying the tax to handle aliases across multiple cache sets in their first-level data caches.

Looking up the duplicate tags to prevent aliases, regardless of when the lookup is performed, has a power cost. That cost scales with the number of tags that have to be checked. All of the other techniques that you have mentioned are important parts of the state of the art of microprocessor design, but they do not clearly answer the original question.

It may simply be the case that the reduced associativity of the cache compensates for the additional coherency management logic. Coherency traffic has to check 4 cache sets * 6 ways = 24 tags per cache line, but normal cache access for loads and stores is limited to the 6 ways in a single cache set. The density of the cache itself could also be a relevant factor.

naukkis · Nov 14, 2024

gai said:
The explanations above appear sound to me. Duplicating information in a CPU to reduce timing criticality very frequently worth the area cost. However, none of these techniques provide any explanation about the originating question on page 1: why is Qualcomm's 96 KiB, 6-way L1D cache so much larger than the competition? The fact remains that IBM (32 KiB, 8-way), Intel (48 KiB, 12-way), and AMD (48 KiB, 12-way) are not paying the tax to handle aliases across multiple cache sets in their first-level data caches.

Looking up the duplicate tags to prevent aliases, regardless of when the lookup is performed, has a power cost. That cost scales with the number of tags that have to be checked. All of the other techniques that you have mentioned are important parts of the state of the art of microprocessor design, but they do not clearly answer the original question.

It may simply be the case that the reduced associativity of the cache compensates for the additional coherency management logic. Coherency traffic has to check 4 cache sets * 6 ways = 24 tags per cache line, but normal cache access for loads and stores is limited to the 6 ways in a single cache set. The density of the cache itself could also be a relevant factor.

Aliases can only occur to cache when cache line is loaded without simultaneously checking all other cache lines if data is already in there. Aliasing won't happen in any multi-cpu memory coherent system - at all. Memory access aliasing can happen when cache access is done with only partial address or hash from address.

AMD and Intel cache system ain't independent L1 L2 and L3 caches, L2 is inclusive of L1 and L3 gets L2 victims. So dirty line eviction policies are only needed on L3 eviction - L1 and L2 doesn't need to care about it. But caches can't handle it's capacity of any data as they are x-way indexed - 8-way cache no matter what it's size is can only hold 8 indexing aliased cache lines. So to get that kind of scheme working, so L1 eviction could always cached on upper levels, they have to index all cache levels with same address space - physical with limits L1 cache sets way size to 4KB which part of addresses are same for virtual and physical addresses.

But that cache scheme is design choice not technical limit. L1 cache ways could easily made larger than page size as Qualcomm - Apple and IBM demonstrates(IBM Z-series L1d is 128KB -8way and 4KB pages are supported). x86-vendors just seems to be sticking one scheme they have used and changing to other design seems to be hard uphill. But they need to as that cache solution they now use ain't power efficient - all data movement needs power and keeping more data at L1 instead of shuffling it back and forth between cache levels will be more power efficient. Having TSO-policy on memory coherency might have part of why x86 sticks to that kind of scheme - L1 ways aliased stores are big problem as there's always possibility that after L1 ways run out whole cpu could be halted. Increasing L1 capacity doesn't solve problem, more ways help as do possibility to swap way cache lines with upper cache levels. TSO doesn't do anything useful after all so it's just a handicap that x86 seems to not get rid of.

gai · Nov 14, 2024

naukkis said:
But that cache scheme is design choice not technical limit. L1 cache ways could easily made larger than page size as Qualcomm - Apple and IBM demonstrates(IBM Z-series L1d is 128KB -8way and 4KB pages are supported). x86-vendors just seems to be sticking one scheme they have used and changing to other design seems to be hard uphill. But they need to as that cache solution they now use ain't power efficient - all data movement needs power and keeping more data at L1 instead of shuffling it back and forth between cache levels will be more power efficient. Having TSO-policy on memory coherency might have part of why x86 sticks to that kind of scheme - L1 ways aliased stores are big problem as there's always possibility that after L1 ways run out whole cpu could be halted. Increasing L1 capacity doesn't solve problem, more ways help as do possibility to swap way cache lines with upper cache levels. TSO doesn't do anything useful after all so it's just a handicap that x86 seems to not get rid of.

I previously had searched for only the cache sizes in IBM z17 (reportedly 32K L1I and 32K L1D), but I didn't look at it closely enough: the snippets from Hot Chips are for the DPU, a separate part of the system architecture from the main CPU cores. I apologize for the error. Indeed, IBM zSeries has been paying the power cost of the extra tag lookups that enable a larger L1 cache for several generations.

Apple is a less compelling counterexample, due to the minimum 16 KiB page size, which sidesteps the issue--at least for native code.

DavidC1 · Nov 15, 2024

MS_AT said:
I agree, but at the same time they wanted to keep Skylake die as small as possible to maximize profit so if they made 256b default they could provide new instructions and bigger architectural register pool on client too. And that would widen adoption. But its if and maybes now

Has little to do with that. 256-bit should have been the maximum. In Skylake they skipped 512 bit because it needed too many compromises. Clocks, area. Why bother with it when vast majority is better off with a GPU? More and more so.

AVX-512 has lots of useful instructions, but the 512 bit part was unnecessary. Intel at that time thought they had leadership and changed their aim towards suppressing Nvidia. By increasing vector width, they thought they could stave Nvidia's advances. No, instead they should have focused on a dedicated GPU, but since up until very recently they viewed GPUs = HD Audio falling into relevance, they didn't know how.

I think even 256-bit was questionable. Stay at 128-bit and put more of them. Benefits everyone without recompiling.

soresu said:
Intel probably did not have plans to make heterogenous CPU configurations when AVX512 was first proposed.

Because the two x86 vendors are broken. Can't stay a decade without fumbling to the point where people start wondering whether the company can exist.

Question CPU Microarchitecture Thread

Diamond Member

Junior Member

Golden Member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Junior Member

Senior member

Senior member

Junior Member

Senior member

Junior Member

Golden Member