Question CPU Microarchitecture Thread

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
So I decided to finally make this thread so that one can find answers to minor queries regarding CPU micro architectures. Questions the like of which Google can't provide a good answer to, which is not surprising since this is a deep subject and there is a lot of inaccurate information out there.
 
Reactions: Vattila

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?
Its not in Raptor Lake, so what Intel core are you referring to ? Server parts ?
 
Last edited:

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
Its not in Raptor Lake, so what Intel core are you referring to ? Server parts ?

Yes, their server parts do have it. And upcoming Meteor Lake is rumoured to have it too. But didn't Alder Lake already have AVX 512 in the Golden Cove cores? Intel disabled it via a bios update.
 

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

PS: I am not very knowledgeable with regards to the low level of CPU microarchitecture. I do have a hazy understanding how decoders, ALUs, ROBs and all the other stuff inside a core work though. So a simplified explanation would be much appreciated.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,104
136
(1) Is AVX-512 exclusive to x86 cores? AMD and Intel are certainly implementing it on their latest CPU cores. However, i haven't heard of an ARM core with AVX 512. Is this possible?
AVX is an extension to the x86 ISA, so it's not possible for ARM to support it by definition. However, the operations themselves can (and in some cases, do) have parallels in ARM via NEON and SVE.
But didn't Alder Lake already have AVX 512 in the Golden Cove cores? Intel disabled it via a bios update.
Yes, in terms of hardware, Golden Cove and Raptor Cove have AVX-512 support. It's disabled by what's ultimately a firmware lock.
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?
No, the argument is kinda a non-sequitur. Apple's simply chosen to focus their efforts on other areas (single thread performance, power, area) instead of throughput.
 

Doug S

Platinum Member
Feb 8, 2020
2,888
4,912
136
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?


Unlike the PC/server focus of Intel & AMD, Apple's primary market for their core designs is the iPhone, where SMT would be a negative.
 
Reactions: lightmanek

moinmoin

Diamond Member
Jun 1, 2017
5,145
8,226
136

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,978
3,656
136
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]
depends on sizes , the actual answer in my mind is neither , its IBM's dynamic L2/L3 , get the best of both worlds.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
It's not about cache itself. Large caches decreases need for memory bandwidth - small cache solutions are only possible when there's robust memory subsystem with enough bandwidth. Cache solutions are there to supplement memory system.
 
Reactions: marees

GTracing

Member
Aug 6, 2021
168
396
106
It's not about cache itself. Large caches decreases need for memory bandwidth - small cache solutions are only possible when there's robust memory subsystem with enough bandwidth. Cache solutions are there to supplement memory system.
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.
 
Reactions: lightmanek

naukkis

Senior member
Jun 5, 2002
962
829
136
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.

Memory bandwidth is average memory access latency inverted. Nowadays cpus have about hundred simultaneous memory accesses going - memory access latency itself isn't so important but sustained throughput for current load is. The larger memory access latencies are the lower sustained bandwidth. Halving last level cache from cpu will increase bandwidth demand as much as cache hit rate decreases, something 10-50% usually. And yes, todays mayy core cpu's are bandwidth starved in MT loads. Higher latency memory subsystem is well tolerated if workloads memory access pattern gets more effective bandwidth. x86 cpus memory controllers for lpddr just aren't effective - on the contrary I don't believe that Apple loses anything from their lpddr longer access latencies as their memory subsystem can sustain high bandwidth from it even those hard to predict use patterns.
 

GTracing

Member
Aug 6, 2021
168
396
106
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]
I'll take a stab at answering this question.

Firstly, Apples shared L2 cache doesn't come no-strings-attached. Compare Anandtech M1 and M1 Max latency chart. https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2 It seems that each core can only write to a certain about of L2. I would bet that cores are organized in 4 core clusters and when a core needs data from L2 in another cluster there's a latency penalty.

AMD and Intel have a bigger shared cache and it's split between more cores. Both of these increase the latency. So they add a private L2 to bridge the L1 and shared L3. (And Intel actually added a L1.5 in Arrow Lake because the L2 is so large).

So to answer the question: Gaming is a cache sensitive workload. More cache is better. I would say that option 2, but it's mostly because the large caches that go along with the extra levels, not because of the cache hierarchy itself.

A few extra notes:
  • Apples L1 is somehow both larger and lower latency. Intel and AMD might choose different L2 and L3 cache sizes if they could make an L1 as good as Apple's.
  • AMD's L3 is a victim cache (the only way data gets into the L3 is when it's evicted from the L2). This makes the private L2 more important (necessary?) on AMD.
  • Intel Skymont has an L2 shared between 4 cores. It's rumoured that a skymont successor is going to take over as Intel's unified CPU architecture in a few years. So Intel's future CPUs might look a lot more Apple-like.
 
Reactions: FlameTail

FlameTail

Diamond Member
Dec 15, 2021
4,238
2,593
106
It's not lower latency, Apple CPUs just run at lower clock speeds.
Usually, latency increases when the cache capacity becomes larger. So it is impressive that Apple (and Qualcomm) have been able to implement much larger caches than Intel/AMD, while having similar latency.

The main reason it can be larger is that Apple uses a 16kB page size.
Oryon has a 192 KB L1i, and Lion Cove has a 192 KB L1d, rivalling the size of the Apple. They both use 4 KB page sizes iirc, so I don't think your statement is true.
 
Last edited:
Reactions: Gideon

GTracing

Member
Aug 6, 2021
168
396
106
It's not lower latency, Apple CPUs just run at lower clock speeds.
No, even accounting for the lower clock speeds Apple has lower latency.

The Apple M4 P-core has a latency of 0.68 ns. The 285k and 9950X each have a max clock speed of 5.7GHz, which gives a latency of 0.70 ns. And the x86 latency is higher on lower end chips, or when more cores are loaded, or in mobile chips.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
I've heard this, but I'm not convinced. Qualcomm also has large L1 caches but reportedly has the same 4kB page size and Intel and AMD. How do they do it?

That page size restriction to L1 way size was when L1 accesses used VIPT-tagging. Newer cores will access their L1 with purely virtual address tags so L1 ways aren't limited to page sizes.
 
Reactions: GTracing

Gideon

Golden Member
Nov 27, 2007
1,842
4,379
136
Question:

Other microarchitectural differences aside, which cache setup would be best for gaming (theoretically) ?

(1) Small, low-latency, private L2 + Very Large, high-latency, shared L3 [eg: Intel, AMD]

or

(2) Large, low-latency, shared L2 + No L3 [eg: Apple, Qualcomm]
I've thought for some time that ... both

Large low-latency shared L2, if done right, should improve performance in many consumer workoads, but if you can use 3D stacking a (even larger) L3 would IMO make more sense than just bloating the L2 caches up
 

OneEng2

Senior member
Sep 19, 2022
259
358
106
Yes. ARM uses it's own extensions such as NEON and SVE and SVE2 to do similar things.
RISC has more simple instructions than CISC and in-fact, the first portion of the decode in CISC makes RISC like instructions (equal length instruction and data so it can be pipelined easily).

SIMD (Single Instruction, Multiple Data) instructions like SSE, SSE2, AVX, ETC. are a good idea for either processor design since it allows fewer cycles to accomplish the same work as a bunch of series-parallel instruction chains that it replaces.
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?

PS: I am not very knowledgeable with regards to the low level of CPU microarchitecture. I do have a hazy understanding how decoders, ALUs, ROBs and all the other stuff inside a core work though. So a simplified explanation would be much appreciated.
For MT, SMT provides a very large return on investment with respect with performance per area, and performance per core. This is true because of the need for Instruction Level Parallelism built into a Core. This is called "Superscalar" design. It allows many calculations to go on in parallel in a single clock cycle (or some number of clock cycles).

In order to achieve the maximum ILP in a core design, you need to have more execution units than necessary under MOST loads in order not to get bogged down on SOME loads. As a result, quite a few of those extra execution units are just sitting around most of the time twiddling their thumbs ..... now enter SMT.

SMT allows a core to utilize those idle resources that would otherwise be unused to work on a different "thread" in code essentially turning a single core into 2 (or more in some designs) cores.

I don't believe it is possible to compete in DC processors without SMT. It is inefficient to have high performance in MT in desktop and laptop without it as well since having lots of cores duplicates MUCH more of the core transistors than SMT requires.

The cost (IMO) is in the complexity of the schedulers and the validation of the design.
Unlike the PC/server focus of Intel & AMD, Apple's primary market for their core designs is the iPhone, where SMT would be a negative.
Exactly.
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.
A large cache can eliminate the need for more main memory bandwidth (not CPU internal bandwidth) by keeping the information needed in cache and avoiding external memory fetches.
 

gai

Junior Member
Nov 17, 2020
10
28
91
That page size restriction to L1 way size was when L1 accesses used VIPT-tagging. Newer cores will access their L1 with purely virtual address tags so L1 ways aren't limited to page sizes.
There's no difference between VIPT and PIPT when the cache is small enough, and this is where the implementation becomes much nicer. Hence, the 48 KiB Zen 5 L1D$ and the 48 KiB Lion Cove "L0" D$, which use 4K page sizes, are both 12-way PIPT.

When the first-level cache uses VIVT, it requires some alias analysis hardware to handle duplicate instances of the same cache line, and this hardware does not come for free.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |