Question CPU Microarchitecture Thread

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106
So I decided to finally make this thread so that one can find answers to minor queries regarding CPU micro architectures. Questions the like of which Google can't provide a good answer to, which is not surprising since this is a deep subject and there is a lot of inaccurate information out there.
 
Reactions: Vattila

soresu

Diamond Member
Dec 19, 2014
3,491
2,782
136
Reactions: FlameTail

soresu

Diamond Member
Dec 19, 2014
3,491
2,782
136

The Cambridge Architecture​

As data grows bigger, caches are less effective and memory access becomes the dominant consumer of energy. This is unsustainable – industry needs a new approach. The Cambridge Architecture™ is a memory architecture that understands data structures: the next generation of Stored Program Machines.
  • Zero latency memory
  • Energy efficiency improved by more than 200%
  • Performance improvements 2x to 1000x
  • Especially well suited to big data, in-memory computing/databases

Memory acceleration use cases​

Blueshift Memory offers significant performance benefits, accelerating use cases requiring access to large amounts of memory in databases, including:
  • Big data and in-memory databases
  • AI training
  • High-Frequency Trading
  • Image processing
  • Computer vision

Integration points​

Blueshift memory technology is architectural and is independent from the applied memory cell technology. Cambridge Architecture™ IP can be integrated into any of the following:
  • a memory controller,
  • an accelerator card,
  • SSD/HDD data storage
  • network storage
Alternatively, as a SoC, it can be integrated into a CPU, TPU, GPU or an AI engine.
 

name99

Senior member
Sep 11, 2010
565
463
136
What do you mean by "100% correct stream of instructions"?
Take every branch correctly. Now you have a stream of instructions, one after the other, that represents the code you want to execute.
The magic is all in acquiring this stream of instructions, it's not in DECODING it - at least not for a sane ISA.
 

MS_AT

Senior member
Jul 15, 2024
449
972
96
ALL the magic in a decent design is in achieving that 100% correct stream of instructions, especially if you want that stream to pull in more than a basic block per cycle.
This (of course..., sigh) is also completely misunderstood by people who live exclusively in x86 land, but it's a whole different issue from clustered decoding.
Misunderstood in which sense? I thought that Zen5 has fairly accurate predictors, they are just slow. Or did you mean something else than accuracy? Genuinely asking.
 

name99

Senior member
Sep 11, 2010
565
463
136
Misunderstood in which sense? I thought that Zen5 has fairly accurate predictors, they are just slow. Or did you mean something else than accuracy? Genuinely asking.
What's the difference between a Fetch Predictor and a Branch Predictor?
When should a Branch Predictor deliver predictions?
What's the data provided by a Fetch Predictor?
Why am I using the term Fetch Predictor, not BTB?

Sorry, but there is something about Fetch, even more than the rest of the CPU, that makes the x86 contingent lose their goddamn minds. EVERY TIME I try to discuss the issue, it's like talking to a brick wall. I'm sick of it and wasting time on it.

Here's the proof. Verilator is by far the toughest "easy-ish to measure" load from the point of view of the FRONT-END.
SPEC is useless for testing front end, the code working set is tiny.
"Server" workloads are what you want, but most of those are difficult to run, verilator is the one fairly easy case to run.

This is from James Aslan's site, https://zhuanlan.zhihu.com/p/704707254, which being a mainland site is a freaking pain in the ass to deal with! You will have to register if you want to see anything, and you will never be able to comment because commenting requires a second stage of registration that requires a mainland phone number.

Regardless, the point is that when we push the front-end hard, ARM (and especially team Apple/ex-Apple) does vastly better than team x86. [Zen 4 is basically the same sort of level as Intel].
And yet team x86 refuse to listen every time you tell them they are doing it wrong...

(Firestorm was M1, Avalanche was M2.
Even Blizzard, the M2 small core does slightly better than raptor cove! That's on a different graph, but it achieves 2.50 on VTop1.)

OK, with all that rant out the way, go read:
especially volume 4. That will tell you how to handle instruction flow PROPERLY.
 

soresu

Diamond Member
Dec 19, 2014
3,491
2,782
136
Still wondering what perf improvement APX extensions will bring to x86 with the doubling of registers.

Intel's x86-S proposal has potential to shave the legacy cruft off the ISA too.

One of the signficant problems limiting µArch optimisation of x86 is that legacy cruft.
 

FlameTail

Diamond Member
Dec 15, 2021
4,384
2,754
106

Now there are 3 vendors with their own set of custom ARM CPU cores (Big + small).

Apple
Qualcomm
Hauwei

A remarkable development. One year ago, it was only Apple.

According to the big gChips leak in October, Google is also working on custom ARM cores (Orion, Orion-E) for their 2027 Pixel phone bound Tensor G7 SoC.

It's also interesting to compare the cache setups. Apple, Qualcomm and Google's designs all have big shared L2 caches. Whereas Huawei's Taishan core has a cache hierarchy similar to stock ARM cores (pL2, sL3).

We are witnessing a renaissance of custom ARM core designs.

I guess ARM the company isn't very happy about this, because ALAs have lower royalty rates than TLAs.

They all had various reasons to pursue custom ARM cores. Apple of course needs no explaining. They have been making custom ARM for a decade+.

For Qualcomm, it was so that they can;
(1) Create CPU cores that are more powerful and efficient than stock ARM cores.
(2) Reduce the royalty rates being paid to ARM.

As for Huawei, they had no other choice really. They can't license the big ARM cores (Cortex X, A7xx), because those are designed in America, and Huawei is under heavy US sanctions.

Not sure what Google's objectives might be. Will have to see when their custom core comes to market.
 
Last edited:

GTracing

Senior member
Aug 6, 2021
276
645
106
Chips and Cheese tested the 7950X3D with the micro op cache disabled.


With the cache disabled, the core loses 11.4% in spec int and 6.6% in spec fp. When two spec instances are run on two SMT threads, the core loses 16% in int and 11.3% in fp. Cyberpunk runs practically the same without the micro op cache.

The performance hit from disabling the micro op cache seems surprisingly low to me.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |