Discussion Zen 5 Architecture & Technical discussion

Page 14 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

DisEnchantment

Golden Member
Mar 3, 2017
1,749
6,614
136


Z4
The processor fetches instructions from the instruction cache in 32-byte blocks that are 16-byte aligned and contained within a 64-byte aligned block. The processor can perform a 32-byte fetch every cycle. The fetch unit sends these bytes to the decode unit through a 24 entry Instruction Byte Queue (IBQ),each entry holding 16 instruction bytes. In SMT mode each thread has 12 dedicated IBQ entries. The IBQ acts as a decoupling queue between the fetch/branch-predict unit and the decode unit.

Z5
The processor fetches instructions from the instruction cache in 32-byte blocks that are 32-byte aligned. Up to two of these blocks can be independently fetched every cycle to feed the decode unit’s two decode pipes. Instruction bytes from different basic blocks can be fetched and sent outof-order to the 2 decode pipes, enabling instruction fetch-ahead which can hide latencies for TLB misses, Icache misses, and instruction decode. Each decode pipe has a 20-entry structure called the IBQ which acts as a decoupling queue between the fetch/branch-predict unit and the decode unit. IBQ entries hold 16 byte-aligned fetch windows of the instruction byte stream. The decode pipes each scan two IBQ entries and output up to four instructions per cycle. In single thread mode the maximum throughput is 4 instructions per cycle. In SMT mode decode pipe 0 is dedicated to Thread 0 and decode pipe 1 is dedicated to Thread 1, supporting a maximum throughput of eight instructions per cycle.

So 2 individual IBQs, and the decoders can only decode different basic blocks. In short in when disabling SMT, the second decoder is wasted and the 2x fetch throughput is unused, unlike the retire queue for instance where the 1T gets all the resources.. Great!
This design is quite odd to me.
 

Bigos

Member
Jun 2, 2019
159
397
136
Yeah, it sounds like they did all of the work to make it possible to use both in a single thread (why else mention them being able to work out of order and being able to fetch-ahead) and then at the end "BTW, a single thread can only use one pipe". I am not an expert by any means, but this sounds like they did not manage to utilize both decode pipes in a single thread even though they tried. I would then predict this being fixed in Zen 6 (disclaimer: do not believe random predictions on the internet).
 

yuri69

Senior member
Jul 16, 2013
574
1,017
136
This just confirms they chased the SMT ghost with this weird implementation.

What makes this Zen 5 architectural decision even more WTF-like is the timeline:
* 2019 - public release of Sunny Cove - jump to 4+1-wide machine
* 2020 - unveil of Tremont - duplicated 3-wide Goldmont decoder for 1T
* 2021 - public release of Gracemont - even more efficient clustered decode for 1T
* 2021 - public release of Golden Cove - jump to 6-wide machine
// * 2024 - public announcement of Lion Cove - jump to 8-wide machine(?)

Zen 5 was in development since at least 2018.

Yet, they preferred to invest in this 2-BB oddity + super-large BTB instead of investing to INT-PRF/ROB.

// Btw no "multi-layer" VCache is intended for Zen 5:

Up to 96-Mbyte shared, victim L3, depending on configuration.
 
Last edited:
Reactions: Gideon and Vattila

DisEnchantment

Golden Member
Mar 3, 2017
1,749
6,614
136
Mispredict penalty

13 --> 15 cycles

Z5
The branch misprediction penalty is in the range from 12 to 18 cycles, depending on the type of mispredicted branch and whether the instructions are being fed from the Op Cache. The common case penalty is 15 cycles
Z4
The branch misprediction penalty is in the range from 11 to 18 cycles, depending on the type of mispredicted branch and whether or not the instructions are being fed from the Op Cache. The common case penalty is 13 cycles
 

Josh128

Senior member
Oct 14, 2022
511
865
106
Looks like AMD put in a lot of experimental tech into Zen 5 and the "low hanging fruits" in Zen 6 will be both making them work and making better use of them.
Mike Clark hinted to this in a recent interview about Zen 5. Now, it would be wise to take MCs comments with a grain of salt, because previously he gushed so much about Zen 5 and the reality, so far, has been far less impressive than he led us to believe, IMO.

The biggest challenge of Zen 5 design​

TH: What was the biggest challenge you encountered with Zen 5 development?

MC: It was actually dealing with two technologies [designing Zen 5 for both the 4nm and 3nm process technologies], especially a technology that the previous generation was in. And trying to do so much change, and therefore the unavoidable reality that in 4nm it's going to be [consume] more power than it's going to be in 3nm, no matter how smart we are.

But we need that flexibility in our roadmap, and it makes sense. But still that was really hard to try to control having the two technologies and the features, and a feature that looks great in 3nm not looking so great in 4nm because of the power impact of the not-as-efficient transistor and how it affects the floorplan. Normally, we do the architecture in one, and then we port on the next one, and then you have a lot of time to deal in the floor plan with the two technologies. [..] It was just really challenging. But that gives Zen 6 a lot of room to improve.

And we're going to deliver 3nm here in short order with 4nm; basically, they're on top of each other. So the design teams are separate in building those, but we're trying to communicate and work together — it is still the same. We've tried to keep it simple for our own sanity. We have all these designs we have to validate and we have to build, and the more they're different, the more things just get out of control. It drives complexity.

That was a challenge, and one we love because, like I said, now that we've done it, we've learned a lot from it. We're going to be able to do it better the next time. That's what makes this job so fun: constantly learning, constantly new challenges, and new innovation.


He seems to indicate than 3nm Zen 5 was the primary architecture design target in the bolded-- the fact that he admitted certain architected features look great in 3nm but not so great in 4nm is very telling and basically confirms this. He also says here (and in another interview I cant seem to find) that the Zen 6 team might come off looking great, but what they achieve is only possible due to the groundwork laid out by the Zen 5 design.
 

Mahboi

Golden Member
Apr 4, 2024
1,035
1,900
96
Not directly arch related (and mired in that stupid x64 vs ARM meme), but I think we should start exploring that aspect a bit more:
I think the root issue is firstly that client and data center share the same architecture and for that reason they don’t have something specific to client for perf per watt. So trade offs are being made.
I don't want to amend the Zen 5 Info thread yet, but I feel like my hunch wasn't just a hunch. There is a fundamental inclination that Zen 5 has which is server oriented. It may be a growing pain in the future and I wouldn't be surprised if we start hearing more and more about dividing architectures into client and server (and no, not just the uncore).
Zen 5 increasingly strongly feels like a top tier DC/supercomputer/CPS arch that handles SIMD like nothing in the world, but also like its general design philosophy is simply not that client-capable. Like we've traded Zen 1 to 4's simple and solid design for something that was aiming to remain balanced, but with Zen 5 they crunched everything to the max and the part that clearly got the lion's share for now is server. I was defending Zen 5 as "probably not that imbalanced, just having a hard time with INT/scalar", but in reality we can objectively say that Apple does it better for less watts.

If Zen 3 was the moment where we more or less got gaming under wraps, and anything beyond is a nice bonus, Zen 5 may be the moment where it becomes easier to crunch massive amounts of data well, but the design philosophy needed for that trumps a smaller, less complex design that would fit better for client workloads, with less latency.
Zen 5 may be a mild wake up call for server and client architectures to start being split. Something which ironically, thanks to their e-cores, Intel seems to be already de facto doing, and AMD has sort of started with Zen 4/5c and the like.
 

LightningZ71

Golden Member
Mar 10, 2017
1,910
2,260
136
"Zen 5 may be a mild wake up call for server and client architectures to start being split"

We already see that with Strix Point having Zen5 cores with half width (effectively) AVX-512 capabilities. It will become more pronounced as we go along.
 

JustViewing

Senior member
Aug 17, 2022
225
408
106
I was able to squeeze about 9.98 uops from Zen 3 from meaning less code
IPC was about 7.7, hmm. Not sure I am measuring it correctly. If someone finds the reason for this, let me know.

Code:
.data
align 16
Var1    dd 0
Var2    real4 0.0
Var3    real8 262144.0

.code
Test2 proc
    push rbp
    push rsi
    push rbx
    sfence
    rdtsc
    mov rbp,    rax
    mov ecx,    1
    shl ecx, 18
    lea rsi,    Var1
align 16
@@:
    add     eax,    [rsi]                   ; 2
    vxorps  xmm5,   xmm5,   xmm5            ; 1
    add     edx,    [rsi]                   ; 2
    vxorps  xmm6,   xmm6,   xmm6            ; 1
    add     rbx,    [rsi]                   ; 2
    vxorps  xmm7,   xmm7,   xmm7            ; 1
    xor     eax,    eax                     ; 1
    cmp     rsi,    0                       ; 1
    jz @end                                 ; 1  
    vaddss  xmm0, xmm0, dword ptr[rsi+4]    ; 2  
    mov     rdi, rdi                        ; 1
    vaddss  xmm1, xmm1, dword ptr[rsi+4]    ; 2
    cmp     rsi,    0                       ; 1
    jz @end                                 ; 1
    mov     rbp, rbp                        ; 1
    vaddss  xmm2, xmm2, dword ptr[rsi+4]    ; 2
    xor     edx,    edx                     ; 1
    cmp     rsi,    0                       ; 1
    jz @end                                 ; 1
    vaddss  xmm3, xmm3, xmm3                ; 1

    dec ecx                                 ; 1
    jnz @b                                  ; 1         ; 22 instructions, 28 uops
    sfence
    rdtsc  
    sub rax, rbp
    vcvtsi2sd xmm4, xmm4, rax  
    vdivsd      xmm0, xmm4, Var3            ; ~ 2.83 Cycles, 9.98 uops/cycle,  7.7 IPC
   

    pop rbx
    pop rsi
    pop rbp

@end:
    ret
Test2 endp
END
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,505
2,080
136
Is that benchmark code to determine probable IPC of any x86-64 CPU?
It's a hand-optimized loop, it doesn't look like a benchmark that gives any kind of comparable results between different architectures, it seems like it was crafted to try to find out the absolute maximum throughput on some machine.

IPC was about 7.7, hmm. Not sure I am measuring it correctly. If someone finds the reason for this, let me know.
I think retire bandwidth on Zen3 is 8 per cycle, and 7.7 is pretty close to it.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,933
96
Mispredict penalty

13 --> 15 cycles

Z5: 12-18
Z4: 11-18
Interesting, it's a wider range than Intel chips. 2 pipeline difference would be 4-8%, but they are basing the average on the uop cache hit. It would likely be 11 on a hit and 18 on a miss if it applies to all scenarios.

I could speculate based on that result the uop cache might be more "capable" but it comes at a cost of increased minimum mispredict penalty. Also if you assume the minimum equals = uop hit and maximum = uop miss then Z4 has a much higher hit rate than Zen 5 does.

If you also assume the pipeline stage stayed the same, then the gains would have been 16-17% on Int rather than 10%.
 
Last edited:

RnR_au

Platinum Member
Jun 6, 2021
2,115
5,098
106
This just confirms they chased the SMT ghost with this weird implementation.

What makes this Zen 5 architectural decision even more WTF-like is the timeline:
* 2019 - public release of Sunny Cove - jump to 4+1-wide machine
* 2020 - unveil of Tremont - duplicated 3-wide Goldmont decoder for 1T
* 2021 - public release of Gracemont - even more efficient clustered decode for 1T
* 2021 - public release of Golden Cove - jump to 6-wide machine
// * 2024 - public announcement of Lion Cove - jump to 8-wide machine(?)

Zen 5 was in development since at least 2018.

Yet, they preferred to invest in this 2-BB oddity + super-large BTB instead of investing to INT-PRF/ROB.

// Btw no "multi-layer" VCache is intended for Zen 5:
Just curious on this, what kind of limitations in its implementation of decoder arch does AMD have in terms of patents and such? Does Intel (or Arm) have 'foundational' patents in this space?
 

moinmoin

Diamond Member
Jun 1, 2017
5,145
8,226
136
I was defending Zen 5 as "probably not that imbalanced, just having a hard time with INT/scalar"
But that's not really the case. Zen 5 paid no attention to balance since only 512bit AVX-512 got doubled width (in the ideal case up to doubling compute performance), and scalar integer got expanded (in the ideal case offering up to 35% more compute performance). Everything else is essentially untouched, while the improvement in scalar integer seems to be hard to make use of, and that's the imbalance.

Zen 5 may be a mild wake up call for server and client architectures to start being split.
As @LightningZ71 already noted that's actually happening with Zen 5, clearly client oriented dies don't get the server Zen 5 512bit AVX-512 implementation but keep the double pumped 256bit AVX-512 implementation from Zen 4. We can that separation to increase, perhaps similar to what happened with the RDNA and CDNA split for GPUs.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |