Question CPU Microarchitecture Thread

FlameTail · Dec 26, 2022

So I decided to finally make this thread so that one can find answers to minor queries regarding CPU micro architectures. Questions the like of which Google can't provide a good answer to, which is not surprising since this is a deep subject and there is a lot of inaccurate information out there.

FlameTail · Dec 26, 2022

(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?

Markfw · Dec 26, 2022

FlameTail said:
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?

Its not in Raptor Lake, so what Intel core are you referring to ? Server parts ?

Thunder 57 · Dec 26, 2022

FlameTail said:
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?

Yes. ARM uses it's own extensions such as NEON and SVE and SVE2 to do similar things.

FlameTail · Dec 26, 2022

Markfw said:
Its not in Raptor Lake, so what Intel core are you referring to ? Server parts ?

Yes, their server parts do have it. And upcoming Meteor Lake is rumoured to have it too. But didn't Alder Lake already have AVX 512 in the Golden Cove cores? Intel disabled it via a bios update.

FlameTail · Dec 26, 2022

(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

PS: I am not very knowledgeable with regards to the low level of CPU microarchitecture. I do have a hazy understanding how decoders, ALUs, ROBs and all the other stuff inside a core work though. So a simplified explanation would be much appreciated.

Exist50 · Dec 26, 2022

FlameTail said:
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?

AVX is an extension to the x86 ISA, so it's not possible for ARM to support it by definition. However, the operations themselves can (and in some cases, do) have parallels in ARM via NEON and SVE.

FlameTail said:
But didn't Alder Lake already have AVX 512 in the Golden Cove cores? Intel disabled it via a bios update.

Yes, in terms of hardware, Golden Cove and Raptor Cove have AVX-512 support. It's disabled by what's ultimately a firmware lock.

FlameTail said:
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

No, the argument is kinda a non-sequitur. Apple's simply chosen to focus their efforts on other areas (single thread performance, power, area) instead of throughput.

Doug S · Dec 27, 2022

FlameTail said:
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

Unlike the PC/server focus of Intel & AMD, Apple's primary market for their core designs is the iPhone, where SMT would be a negative.

marees · Oct 22, 2024

What is the significance of below 3d vcache patent ?

WIPO - Search International and National Patent Collections

patentscope.wipo.int

https://twitter.com/x/status/1848609092685729923

moinmoin · Oct 22, 2024

marees said:
What is the significance of below 3d vcache patent ?

WIPO - Search International and National Patent Collections

patentscope.wipo.int

https://twitter.com/x/status/1848609092685729923

The picture shows CCD's being put onto what amounts to an IOD. The patent itself instead seems to relate to mixing layers of different dies from different nodes, allowing to extend one die with additional layers of another die.

In either case it's not related to CPU microarchitecture.

soresu · Oct 22, 2024

marees said:
What is the significance of below 3d vcache patent ?

WIPO - Search International and National Patent Collections

patentscope.wipo.int

https://twitter.com/x/status/1848609092685729923

Looks like talking about CCD sitting on top of a v cache chiplet rather than the reverse given that the CCD is typically fabbed on the more advanced node.

FlameTail · Nov 8, 2024

Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]

itsmydamnation · Nov 9, 2024

FlameTail said:
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]

depends on sizes , the actual answer in my mind is neither , its IBM's dynamic L2/L3 , get the best of both worlds.

naukkis · Nov 9, 2024

It's not about cache itself. Large caches decreases need for memory bandwidth - small cache solutions are only possible when there's robust memory subsystem with enough bandwidth. Cache solutions are there to supplement memory system.

GTracing · Nov 9, 2024

naukkis said:
It's not about cache itself. Large caches decreases need for memory bandwidth - small cache solutions are only possible when there's robust memory subsystem with enough bandwidth. Cache solutions are there to supplement memory system.

CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.

naukkis · Nov 9, 2024

GTracing said:
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.

Memory bandwidth is average memory access latency inverted. Nowadays cpus have about hundred simultaneous memory accesses going - memory access latency itself isn't so important but sustained throughput for current load is. The larger memory access latencies are the lower sustained bandwidth. Halving last level cache from cpu will increase bandwidth demand as much as cache hit rate decreases, something 10-50% usually. And yes, todays mayy core cpu's are bandwidth starved in MT loads. Higher latency memory subsystem is well tolerated if workloads memory access pattern gets more effective bandwidth. x86 cpus memory controllers for lpddr just aren't effective - on the contrary I don't believe that Apple loses anything from their lpddr longer access latencies as their memory subsystem can sustain high bandwidth from it even those hard to predict use patterns.

GTracing · Nov 10, 2024

FlameTail said:
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]

I'll take a stab at answering this question.

Firstly, Apples shared L2 cache doesn't come no-strings-attached. Compare Anandtech M1 and M1 Max latency chart. https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2 It seems that each core can only write to a certain about of L2. I would bet that cores are organized in 4 core clusters and when a core needs data from L2 in another cluster there's a latency penalty.

AMD and Intel have a bigger shared cache and it's split between more cores. Both of these increase the latency. So they add a private L2 to bridge the L1 and shared L3. (And Intel actually added a L1.5 in Arrow Lake because the L2 is so large).

So to answer the question: Gaming is a cache sensitive workload. More cache is better. I would say that option 2, but it's mostly because the large caches that go along with the extra levels, not because of the cache hierarchy itself.

A few extra notes:

Apples L1 is somehow both larger and lower latency. Intel and AMD might choose different L2 and L3 cache sizes if they could make an L1 as good as Apple's.
AMD's L3 is a victim cache (the only way data gets into the L3 is when it's evicted from the L2). This makes the private L2 more important (necessary?) on AMD.
Intel Skymont has an L2 shared between 4 cores. It's rumoured that a skymont successor is going to take over as Intel's unified CPU architecture in a few years. So Intel's future CPUs might look a lot more Apple-like.

Tuna-Fish · Nov 10, 2024

GTracing said:
[*]Apples L1 is somehow both larger and lower latency. Intel and AMD might choose different L2 and L3 cache sizes if they could make an L1 as good as Apple's.

It's not lower latency, Apple CPUs just run at lower clock speeds.

The main reason it can be larger is that Apple uses a 16kB page size.

FlameTail · Nov 10, 2024

Tuna-Fish said:
It's not lower latency, Apple CPUs just run at lower clock speeds.

Usually, latency increases when the cache capacity becomes larger. So it is impressive that Apple (and Qualcomm) have been able to implement much larger caches than Intel/AMD, while having similar latency.

Tuna-Fish said:
The main reason it can be larger is that Apple uses a 16kB page size.

Oryon has a 192 KB L1i, and Lion Cove has a 192 KB L1d, rivalling the size of the Apple. They both use 4 KB page sizes iirc, so I don't think your statement is true.

GTracing · Nov 10, 2024

Tuna-Fish said:
It's not lower latency, Apple CPUs just run at lower clock speeds.

No, even accounting for the lower clock speeds Apple has lower latency.

The Apple M4 P-core has a latency of 0.68 ns. The 285k and 9950X each have a max clock speed of 5.7GHz, which gives a latency of 0.70 ns. And the x86 latency is higher on lower end chips, or when more cores are loaded, or in mobile chips.

GTracing · Nov 10, 2024

Tuna-Fish said:
The main reason it can be larger is that Apple uses a 16kB page size.

I've heard this, but I'm not convinced. Qualcomm also has large L1 caches but reportedly has the same 4kB page size and Intel and AMD. How do they do it?

naukkis · Nov 10, 2024

GTracing said:
I've heard this, but I'm not convinced. Qualcomm also has large L1 caches but reportedly has the same 4kB page size and Intel and AMD. How do they do it?

That page size restriction to L1 way size was when L1 accesses used VIPT-tagging. Newer cores will access their L1 with purely virtual address tags so L1 ways aren't limited to page sizes.

Gideon · Nov 10, 2024

FlameTail said:
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]

I've thought for some time that ... both

Large low-latency shared L2, if done right, should improve performance in many consumer workoads, but if you can use 3D stacking a (even larger) L3 would IMO make more sense than just bloating the L2 caches up

OneEng2 · Nov 10, 2024

Thunder 57 said:
Yes. ARM uses it's own extensions such as NEON and SVE and SVE2 to do similar things.

RISC has more simple instructions than CISC and in-fact, the first portion of the decode in CISC makes RISC like instructions (equal length instruction and data so it can be pipelined easily).

SIMD (Single Instruction, Multiple Data) instructions like SSE, SSE2, AVX, ETC. are a good idea for either processor design since it allows fewer cycles to accomplish the same work as a bunch of series-parallel instruction chains that it replaces.

FlameTail said:
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

PS: I am not very knowledgeable with regards to the low level of CPU microarchitecture. I do have a hazy understanding how decoders, ALUs, ROBs and all the other stuff inside a core work though. So a simplified explanation would be much appreciated.

For MT, SMT provides a very large return on investment with respect with performance per area, and performance per core. This is true because of the need for Instruction Level Parallelism built into a Core. This is called "Superscalar" design. It allows many calculations to go on in parallel in a single clock cycle (or some number of clock cycles).

In order to achieve the maximum ILP in a core design, you need to have more execution units than necessary under MOST loads in order not to get bogged down on SOME loads. As a result, quite a few of those extra execution units are just sitting around most of the time twiddling their thumbs ..... now enter SMT.

SMT allows a core to utilize those idle resources that would otherwise be unused to work on a different "thread" in code essentially turning a single core into 2 (or more in some designs) cores.

I don't believe it is possible to compete in DC processors without SMT. It is inefficient to have high performance in MT in desktop and laptop without it as well since having lots of cores duplicates MUCH more of the core transistors than SMT requires.

The cost (IMO) is in the complexity of the schedulers and the validation of the design.

Doug S said:
Unlike the PC/server focus of Intel & AMD, Apple's primary market for their core designs is the iPhone, where SMT would be a negative.

Exactly.

GTracing said:
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.

A large cache can eliminate the need for more main memory bandwidth (not CPU internal bandwidth) by keeping the information needed in cache and avoiding external memory fetches.

gai · Nov 11, 2024

naukkis said:
That page size restriction to L1 way size was when L1 accesses used VIPT-tagging. Newer cores will access their L1 with purely virtual address tags so L1 ways aren't limited to page sizes.

There's no difference between VIPT and PIPT when the cache is small enough, and this is where the implementation becomes much nicer. Hence, the 48 KiB Zen 5 L1D$ and the 48 KiB Lion Cove "L0" D$, which use 4K page sizes, are both 12-way PIPT.

When the first-level cache uses VIVT, it requires some alias analysis hardware to handle duplicate instances of the same cache line, and this hardware does not come for free.

Question CPU Microarchitecture Thread

Diamond Member

Diamond Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Platinum Member

Senior member

Junior Member