Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Panino Manino · Nov 29, 2021

It's a shame that I may not be alive to see it, but I really want to see how Xilinx will add to AMD portfolio. Zen 4 was too late, but maybe Zen 5 generation will already come with some silicon by Xilinx? I wonder how they will change AMD's chips.

CHADBOGA · Nov 29, 2021

Panino Manino said:
It's a shame that I may not be alive to see it, but I really want to see how Xilinx will add to AMD portfolio. Zen 4 was too late, but maybe Zen 5 generation will already come with some silicon by Xilinx? I wonder how they will change AMD's chips.

How long do you think you have left?

jamescox · Nov 29, 2021

Doug S said:
I don't think it is the higher operating frequencies it is the smaller capacitors. It isn't needed for DDR5 today but DDR5's roadmap extends all the way to 64 Gb chips - 4x more dense than today's 16 Gb DDR5 chips.

Seems like it would make sense to pursue multilayer designs like NAND did when the cells got too small, which allowed them to use much bigger cells and avoid the issues. I don't know enough about how DRAM is produced to know how feasible that is, obviously if it was easy they would already be doing it...

Smaller capacitors are more sensitive to cosmic rays and more sensitive to thermal effects so it is getting more likely to suffer occasional bit flips.

DrMrLordX · Dec 2, 2021

Joe NYC said:
But they are not being released at the same time, Genoa is ahead of Raphael and Bergamo.

True. But that's release. Assuming Raphael does use the same CCDs as Genoa, they went into design at the same time (essentially).

Joe NYC · Dec 2, 2021

DrMrLordX said:
True. But that's release. Assuming Raphael does use the same CCDs as Genoa, they went into design at the same time (essentially).

Yeah, that's the conventional wisdom. I was just speculating what could make things different.

DisEnchantment · Dec 2, 2021

One item I noticed from the leaked manual is support for TSX (HLE)

Interesting, considering the fact that Intel disabled them due to security bugs and Power10 removed the support. (I think SPR is going to add them again, so I guess they might be working in GC)

Reading it again, I think it was a mistake on my part, it is Fixed 0, no support

https://twitter.com/x/status/1454025358022516736

Ajay · Dec 2, 2021

DisEnchantment said:
One item I noticed from the leaked manual is support for TSX (HLE)

View attachment 53719

Interesting, considering the fact that Intel disabled them due to security bugs and Power10 removed the support. (I think SPR is going to add them again, so I guess they might be working in GC)

Reading it again, I think it was a mistake on my part, it is Fixed 0, no support

https://twitter.com/x/status/1454025358022516736

Hard to find info on this, but I wonder if it will be more like Intel's TSXLDTRK instructions in SPR. I assume it is TSX redesigned to prevent side channel attacks that plagued Intel's first two implementations.

DisEnchantment · Dec 3, 2021

First Major patch for Zen4

[PATCH 2/3] x86/MCE/AMD, EDAC/mce_amd: Add new SMCA Bank Types - Yazen Ghannam

A raft of RAS features for GMI, the GMI interface seems to be serial one like IFIS, of course not sure how many lines being used and what is the corresponding PHY. Similar RAS like XGMI. Hopefully a low energy wider serial PHY with repeaters, unlike current gen dual 32 bit unidirectional PHY.
Based on patents I suppose there will be compression of data transfer as well

This mysterious MPDMA is a huge IP block with connection to many things

+ "Main SRAM [31:0] bank ECC or parity error",
+ "Main SRAM [63:32] bank ECC or parity error",
+ "Main SRAM [95:64] bank ECC or parity error",
+ "Main SRAM [127:96] bank ECC or parity error",
+ "Data Cache Bank A ECC or parity error",
+ "Data Cache Bank B ECC or parity error",
+ "Data Tag Cache Bank A ECC or parity error",
+ "Data Tag Cache Bank B ECC or parity error",
+ "Instruction Cache Bank A ECC or parity error",
+ "Instruction Cache Bank B ECC or parity error",
+ "Instruction Tag Cache Bank A ECC or parity error",
+ "Instruction Tag Cache Bank B ECC or parity error",
+ "Data Cache Bank A ECC or parity error",
+ "Data Cache Bank B ECC or parity error",
+ "Data Tag Cache Bank A ECC or parity error",
+ "Data Tag Cache Bank B ECC or parity error",
+ "Instruction Cache Bank A ECC or parity error",
+ "Instruction Cache Bank B ECC or parity error",
+ "Instruction Tag Cache Bank A ECC or parity error",
+ "Instruction Tag Cache Bank B ECC or parity error",
+ "Data Cache Bank A ECC or parity error",
+ "Data Cache Bank B ECC or parity error",
+ "Data Tag Cache Bank A ECC or parity error",
+ "Data Tag Cache Bank B ECC or parity error",
+ "Instruction Cache Bank A ECC or parity error",
+ "Instruction Cache Bank B ECC or parity error",
+ "Instruction Tag Cache Bank A ECC or parity error",
+ "Instruction Tag Cache Bank B ECC or parity error",
+ "System Hub Read Buffer ECC or parity error",
+ "MPDMA TVF DVSEC Memory ECC or parity error",
+ "MPDMA TVF MMIO Mailbox0 ECC or parity error",
+ "MPDMA TVF MMIO Mailbox1 ECC or parity error",
+ "MPDMA TVF Doorbell Memory ECC or parity error",
+ "MPDMA TVF SDP Slave Memory 0 ECC or parity error",
+ "MPDMA TVF SDP Slave Memory 1 ECC or parity error",
+ "MPDMA TVF SDP Slave Memory 2 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 0 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 1 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 2 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 3 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 4 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 5 ECC or parity error",
+ "MPDMA TVF SDP Master Memory 6 ECC or parity error",
+ "MPDMA PTE Command FIFO ECC or parity error",
+ "MPDMA PTE Hub Data FIFO ECC or parity error",
+ "MPDMA PTE Internal Data FIFO ECC or parity error",
+ "MPDMA PTE Command Memory DMA ECC or parity error",
+ "MPDMA PTE Command Memory Internal ECC or parity error",
+ "MPDMA PTE DMA Completion FIFO ECC or parity error",
+ "MPDMA PTE Tablewalk Completion FIFO ECC or parity error",
+ "MPDMA PTE Descriptor Completion FIFO ECC or parity error",
+ "MPDMA PTE ReadOnly Completion FIFO ECC or parity error",
+ "MPDMA PTE DirectWrite Completion FIFO ECC or parity error",
+ "SDP Watchdog Timer expired",

HW mitigations for a whole bunch of vulnerabilities like STIBP, IBRS, SSB, Upper Address, Secure TSC (SNP) and VMSA protection (SNP)... found in Volume 2 of PPR version 3.33

TBytemaster · Dec 3, 2021

Great find!

As for the acronym, it's probably MPsoc DMA, and we're getting Xilinx IP integration early!

edit: Okay but seriously, anyone have guesses?

DisEnchantment · Dec 3, 2021

TBytemaster said:
As for the acronym, it's probably MPsoc DMA, and we're getting Xilinx IP integration early!

It could be some form of it but the RAS messages indicate a far more sophisticated block.
Like IAmChester from Chips and Cheese fame is saying, this block could be migrating pages to and from SCM to DRAM.
It is doing Page table walking and migrating pages across memory, which the Xilinx PSoC does not seem to be doing anything similar besides performing DMA without CPU intervention.

Ajay · Dec 3, 2021

TBytemaster said:
Great find!

As for the acronym, it's probably MPsoc DMA, and we're getting Xilinx IP integration early!

edit: Okay but seriously, anyone have guesses?

From	Yazen Ghannam <>
Subject	[PATCH 0/3] AMD SMCA Updates
Date	Fri, 3 Dec 2021 02:00:14 +0000

share 0

Hi all,

This set adds supports for SMCA changes in future AMD systems.

Patch 1 adds an "unknown" bank type so that sysfs initialization issues
can be avoided on systems with new bank types.

Patch 2 adds new bank types and error descriptions used in future AMD
systems.

Patch 3 adjusts how SMCA bank information is cached. Future AMD systems
will have different bank type layouts between logical CPUs. So having a
single system-wide cache of the layout won't be correct.

Thanks,
Yazen

Yazen Ghannam (3):
x86/MCE/AMD: Provide an "Unknown" MCA bank type
x86/MCE/AMD, EDAC/mce_amd: Add new SMCA Bank Types
x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type
enumeration

arch/x86/include/asm/mce.h | 26 ++---
arch/x86/kernel/cpu/mce/amd.c | 114 +++++++++++++-----
drivers/edac/mce_amd.c | 148 +++++++++++++++++++++---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
4 files changed, 228 insertions(+), 62 deletions(-)

--
2.25.1

From: https://lkml.org/lkml/2021/12/2/1098

Above my paygrade ATM
This appears to be the work, in fact, of and AMD engineer with RAS & Linux background:

https://www.linkedin.com/in/yazenghannam

Have fun!

DisEnchantment · Dec 3, 2021

Ajay said:
Patch 3 adjusts how SMCA bank information is cached. Future AMD systems
will have different bank type layouts between logical CPUs. So having a
single system-wide cache of the layout won't be correct.

I could hazard a guess that it indicates that all the cores are not the same (but could be fully feature compatible for example). Otherwise no other explanation.
I would suppose if AMD's big.BIGGER hybrid cores are real I imagine they can just use the CPPC2 to let the OS manage it normally like how they do now for preferred cores because all cores can handle same instruction set and are feature compatible, just that peak perf will be decided by the CPPC2 preferred cores for that power plan. All cores/L3s can snoop and maintain coherency like regular non hybrid CPUs.
Windows can handle this well and if you check your System event logs you can already see this in action, Surprisingly there is a new patch to introduce CPPC2 scheduling in Linux for AMD processors called amd-pstate.

[PATCH v5 22/22] Documentation: amd-pstate: add amd-pstate driver introduction - Huang Rui

Only thing strange with this guess though is that it is a bit unexpected for AMD to do this at this stage. Or they could be having Multi socket configs with different CPUs?

DisEnchantment · Dec 3, 2021

DisEnchantment said:
HW mitigations for a whole bunch of vulnerabilities like STIBP, IBRS, SSB, Upper Address, Secure TSC (SNP) and VMSA protection (SNP)... found in Volume 2 of PPR version 3.33

Reading the security Bulletin related to CVE-2020-12966, I found it strange that VMSA protections was declared fixed with microcode update which introduce a new feature flag. That is nuts I was not never aware you can add new CPUID flags using microcode.

Ajay said:
Hard to find info on this, but I wonder if it will be more like Intel's TSXLDTRK instructions in SPR. I assume it is TSX redesigned to prevent side channel attacks that plagued Intel's first two implementations.

Yes it is hard. But there is this patent if you wanna read something related to AMD HLE feature.

Processor with accelerated lock instruction operation

https://www.freepatentsonline.com/10949201.html

Ajay · Dec 3, 2021

DisEnchantment said:
I could hazard a guess that it indicates that all the cores are not the same (but could be fully feature compatible for example). Otherwise no other explanation.
I would suppose if AMD's big.BIGGER hybrid cores are real I imagine they can just use the CPPC2 to let the OS manage it normally like how they do now for preferred cores because all cores can handle same instruction set and are feature compatible, just that peak perf will be decided by the CPPC2 preferred cores for that power plan. All cores/L3s can snoop and maintain coherency like regular non hybrid CPUs.
Windows can handle this well and if you check your System event logs you can already see this in action, Surprisingly there is a new patch to introduce CPPC2 scheduling in Linux for AMD processors called amd-pstate.

[PATCH v5 22/22] Documentation: amd-pstate: add amd-pstate driver introduction - Huang Rui
Only thing strange with this guess though is that it is a bit unexpected for AMD to do this at this stage. Or they could be having Multi socket configs with different CPUs?

I'm at bit lost, atm, since I can't find the meaning of MSCA banks. Obviously related to machine checks. Everything I look up points me back to Linux kernel code .
Then we have 'different bank type layouts', physical layouts or logical layouts???

DisEnchantment said:
Yes it is hard. But there is this patent if you wanna read something related to AMD HLE feature.

Processor with accelerated lock instruction operation

https://www.freepatentsonline.com/10949201.html

Thanks, starting to go blind bouncing around Linux Kernel code (with some useful info from phoronix - god bless the guy who runs that site!).
Time to watch Formula1 race practice or play a video game .

TBytemaster · Dec 3, 2021

SMCA is 'Scalable MCA'

This has some more background info regarding SMCA:

[PATCH] x86/mce/AMD: Allow Reserved types to be overwritten in smca_banks[] - Yazen Ghannam

Edit: Apologies if this reads as patronizing, I misread your post a bit. I'm also out of my depth here.

Ajay · Dec 3, 2021

TBytemaster said:
SMCA is 'Scalable MCA'

This has some more background info regarding SMCA:

[PATCH] x86/mce/AMD: Allow Reserved types to be overwritten in smca_banks[] - Yazen Ghannam

Edit: Apologies if this reads as patronizing, I misread your post a bit. I'm also out of my depth here.

Thanks. No problem. So, Machine Check Architecture. I found that in an architecture overview from the XenProject in 2009 . I'll just toddle off and play with my rubber ducky now.

DisEnchantment · Dec 4, 2021

Found this tidbit of info related to the GMI3 "ultra high-speed xxGbps" interconnect (GMI3 is used in Zen 4) on LinkedIn. On-die Coil/X3D/TSV design for 32 to 64 Gbps SerDes PHY on TSMC 5/3nm nodes.

moinmoin · Dec 4, 2021

DisEnchantment said:
Surprisingly there is a new patch to introduce CPPC2 scheduling in Linux for AMD processors called amd-pstate.
[PATCH v5 22/22] Documentation: amd-pstate: add amd-pstate driver introduction - Huang Rui Only thing strange with this guess though is that it is a bit unexpected for AMD to do this at this stage. Or they could be having Multi socket configs with different CPUs?

Thanks for the reference. The initial mail for that patch set even lists some performance per watts benchmarks showing that amd-pstates fares worse than current acpi-cpufreq (only 'performance' is superior, but still below current 'ondemand'), so that may have been a reason AMD had no urge to port CPPC2 support over (it did its job under Windows and wasn't necessary under Linux). I guess they port it now since support for it becomes more important in the coming CPU gens.

Ajay · Dec 4, 2021

moinmoin said:
Thanks for the reference. The initial mail for that patch set even lists some performance per watts benchmarks showing that amd-pstates fares worse than current acpi-cpufreq (only 'performance' is superior, but still below current 'ondemand'), so that may have been a reason AMD had no urge to port CPPC2 support over (it did its job under Windows and wasn't necessary under Linux). I guess they port it now since support for it becomes more important in the coming CPU gens.

Wow, I wish we had the relevant emails for many of these new features - very helpful (source code headers are less often useful**). Nice to see code names like Raphael being used, we are past the anonymous 'next gen cpu' and the like.

** As a former developer, I should know

MadRat · Dec 4, 2021

So is this all chicken-egg paradox, or do both developers and engineers work together to anticipate future roadblocks?

DisEnchantment · Dec 6, 2021

uzzi38 said:
Seems like N3 potentially might be worth it if you're willing to put in the effort with DTCO?

I don't like how TSMC didn't include N7 and N5 DTCO charts here though, we've already seen how mich of effect it can have (RDNA1 -> RDNA2).

moinmoin said:
Is DTCO even something that can be compared at node level? From what I'm aware of DTCO (Design and Technology Co-optimization) is a feedback progress the customer has to be willing to apply during silicon design to make the most of the node. With today's costly nodes I rather have to wonder who still doesn't do that at least to some degree.

Moving this to here to avoid boring the non x86 folks. It seems the problem with DTCO is longer lead time from physical design to bring up.

Zen2 took a longer time to market because AMD spent a lot of time optimizing their device, metal layers etc, Radeon VII on the other hand is fairly quick.
Even then full optimization did not happen until Zen3 when AMD was able to extract almost 5 GHz from a process which was not intended to run beyond 4.2 GHz. (from N7 Shmoo plot)
Fairly obvious when trying to feed more power to Zen3 does not land any meaningful perf gain. The cost or rather the tradeoff of this optimization is density and power.

Going forward I don't know if AMD (or other CPU designer with super long pipelines/high clocking design) would continue to do this, they better stick to tweaking a few knobs here and there, "optimized for HPC", but nothing more. (If you have not seen TSMC data, real N5 HPC flavor has 2x leakage over standard N5 )
N5 should really help with clocks without needing super deep optimizations. Question is how high would the standard N5 clock.
If AMD can land 5+ GHz frequency without deep optimizations and tradeoffs it would be greatly help with density and power efficiency.
Going way beyond 5 GHz is not going to work with heat density/thermal hotspots being a problem and parasitics degrading efficiency. (GAA advertized to solve the parasitics problem eventually, topic for another time)

AMD is moving to a new SAPR concept to reduce time to market after the high level design is done. It is faster to do architectural iterations with RTL simulation than to optimize during physical design
Also I believe Zen5 on N3 (according to rumors as early as 2H2023) being fast is probably because of lesser process/device optimizations and extensive use of highly automated SAPR.
This should help with aligning launches to yearly OEM updates.

Zen4 therefore is very interesting in this regard, it is going to give an idea how high clocking designs will look like in terms of efficiency/density with upcoming nodes.
This slide is therefore very interesting, N7-->N5 (efficiency with perf gain) while 14LPP-->N7 (efficiency at same perf)

LightningZ71 · Dec 6, 2021

The above is why I am of the opinion that AMD has an opportunity to compete better here by having separate mobile and desktop/server products. AMD can focus on getting the logic right and in production on a new product quickly by pushing out a first iteration on the desktop and server where power and efficiency isn't quite as crucial as it is in mobile. Then, follow up with a mobile design that has been iterated at the process level enough to have better power/efficiency characteristics than the desktop. Finally, in moving to the next generation or products, the desktop part can be mildly tweaked and iterated on the existing or slightly improved node to offer additional, more desirable SKUs for older platforms while the next generation product is pushed out on a new process node. We've seen elements of this in the recent past, but, I wonder if that's their targeted cycle?

Ajay · Dec 6, 2021

DisEnchantment said:
Moving this to here to avoid boring the non x86 folks. It seems the problem with DTCO is longer lead time from physical design to bring up.

Okay, apparently I don't understand DTCO yet.

DisEnchantment said:
AMD is moving to a new SAPR concept to reduce time to market after the high level design is done. It is faster to do architectural iterations with RTL simulation than to optimize during physical design

This has always been the case. The change is that larger scale HPC systems can iterate faster and more accurately than in years past. Don't know what SAPR stands for.

moinmoin · Dec 6, 2021

DisEnchantment said:
AMD is moving to a new SAPR concept to reduce time to market after the high level design is done. It is faster to do architectural iterations with RTL simulation than to optimize during physical design

Regarding the latter sentence, isn't AMD doing architectural iterations with RTL simulation already? I'm sure there's always room to automatize and optimize even more processes, but that's one step I thought AMD already did.

LightningZ71 said:
The above is why I am of the opinion that AMD has an opportunity to compete better here by having separate mobile and desktop/server products.

But that's already the case (currently APUs vs CPUs)?

Ajay said:
Don't know what SAPR stands for.

Synthesis Auto Place & Route

DisEnchantment · Dec 6, 2021

moinmoin said:
Regarding the latter sentence, isn't AMD doing architectural iterations with RTL simulation already? I'm sure there's always room to automatize and optimize even more processes, but that's one step I thought AMD already did.

Yes, anybody designing some circuit will do simulation.

What I meant is that you can improve perf by doing quick design iterations using RTL simulations to improve perf from architecture (of course provided you are running your device within the best range of the shmoo plot) rather than sit and optimize physical design for few extra 100MHz trading off efficiency and density greatly.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Platinum Member

Senior member

Lifer

Platinum Member

Golden Member

Lifer

Golden Member

Junior Member

Golden Member

Lifer

Golden Member

Golden Member

Lifer

Junior Member

Lifer

Golden Member

Diamond Member

Lifer

Lifer

Golden Member

Golden Member

Lifer

Diamond Member

Golden Member