Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

MarkPost · Dec 8, 2022

Carfax83 said:
Interesting. Of course, I have no way of verifying whether what you say is accurate. Pov-Ray has been around for decades. I remember Pov-Ray from the Pentium IV era, or even possibly before then.

AVX2 has been supported by AMD for a long time now as well, so I have no idea why it would be deactivated on Ryzen CPUs assuming what you say is true.

Doing a quick Google search seems to indicate that AVX2 acceleration IS SUPPORTED by Pov-Ray for AMD CPUs. There are even posts on PovRay.org that speak of AMD CPUs using AVX2.

At any rate, I'm not a programmer so I can't investigate this problem properly. But I do know that some compilers are faster than others due to better vectorization. So maybe the compiler you're using has better vectorized code output perhaps.

Just a guess.

No.

You can check it for yourself, go to the povray github repositorie, specifically here: povray/optimizednoise.cpp at master · POV-Ray/povray · GitHub

That file defines intructions to use. Here:

Hitman928 · Dec 8, 2022

Exist50 said:
Sure, but you can't explain 100W difference as run to run variation or any of the typical sources of statistical noise. 100W, either the CPU or GPU or both are responsible.

It's not unheard of for the same CPU used in two different motherboards to use 40 - 70 W more system power in one motherboard than the other with only the CPU loaded. So no, the CPU and GPUs don't have to be the responsible parties.

So why didn't HWUB actually investigate? There're plenty of applications that can tell you approximate numbers for the CPU, GPU, etc., and that should be plenty sufficient to diagnose the cause here. Instead they take the lazy way out and publish clearly flawed data without any effort to explain themselves.

I find it funny you calling them lazy when they consistently publish pieces with tests of far more games than the vast majority of other sites/channels and do not just rely on built in benchmarks. This takes an enormous amount of time, far more than most anyone else is willing to put in. Even if they had only done a limited amount of games, it's not lazy, it's sticking to their testing methodology between tests which is exactly what they should do. Reviewers don't have infinite time and again, it's not their responsibility to completely debug any results that seem off. It is their responsibility to ensure consistent testing methodology and to triple check all of their settings and methods to make sure nothing is wrong on their end. If they've done that (which they say they have) then they should publish. The community can give feedback and are free to try to replicate their results. I still haven't heard anyone point out what they've done that is obviously flawed. They just point to a number and think it's too big. If anything, my complaint would be that they are using system power consumption to compare CPU efficiency in the first place because of the large amount of variables you can't really control to say which CPU is more efficient in this instance. I typically ignore system consumption numbers for CPU tests for this reason.

They're the only ones that seem to have this particular issue. The scheduler isn't random, it's deterministic. If it's actually a bug with the scheduler it should be reproducible by others.

I'm sure it's not actually random, I wasn't using that word in a technical or strict sense.

But the exact cause of the issue is besides the point. You have a hypothesis right here - that it's a scheduling issue. It would be absolutely trivial for HWUB to disable E-cores, rerun the test, and make a conclusion on that hypothesis. Yet they couldn't be bothered to put in that minimal amount of effort?

I don't understand why people insist they're such a reputable outlet while defending their unwillingness to put even a minimal amount of effort into validating their results.

Maybe they will investigate further, I hope they do, but let's just say that it is a bug with the hybrid setup, does that make their results invalid just because they didn't figure out what was causing it? The only way the result is invalid is if they made a mistake with their setup or how they tested. If they've retested 5 times and had the same results, that's an honest effort to make sure that the issue isn't on their end. No one is perfect and everyone can make mistakes, even after retesting, but that doesn't mean we automatically assume it's their mistake and label their results invalid. JT tested a larger number of games and had similar overall results but no one seems to be calling out that channel. TPUs gaming summary rankings had some significant shifts over short periods of time and significant shifts again when they tested a larger set of games, but no one seems to find these things interesting. It's weird to me that people get so up in arms over which tests are valid because it shows one side or the other winning by a meaningless amount.

eek2121 · Dec 10, 2022

Carfax83 said:
Well I definitely agree with you about this. We don't have enough data to really come to a final conclusion about what happened. In the end it doesn't really matter. I've already unsubscribed from HWUB and if anyone asks me whether they are a quality reviewer I will tell them no. Quality reviewers don't publish anomalous results without investigating what, why, how and when until they come up with a reasonable explanation.

CapFrameX's twitter account posted performance numbers for Plague Requiem and they found poor performance on the 7950x compared to the 13900K in a populated area of the game. And since I own the game, I posted screenshots with MSI Afterburner of what I thought was the area in question, but it turns out the area that CapFrameX was referring to was in a much later chapter. The area that I investigated was in the early part of the game and I found out that it had occlusion culling issues that caused low GPU usage without spiking CPU usage, which ruled out a CPU bottleneck. I'm sure it was occlusion culling related as the camera angle had a strong impact on the framerate.

Anyway, I think I'm in the chapter now that CapFrameX was alluding to so when I get to the particular area, I will do a framerate analysis and see if there is any shenanigans going on that could explain the low performance on the 7950x.

https://twitter.com/x/status/1591802741265166336

I would not believe a word that comes out of that account. He/she claimed forever that they could not get DDR5 6000 working with the 7950x. They also claimed to have tried multiple kits. I brought up the fact that I had a 64gb kit working just fine, not only at DDR5 6000, but with tightened timings. Nothing except silence. Why does this matter? They clearly did something wrong on their end and failed to correct it, instead blaming AMD (they actually did blame AMD). That is when they lost credibility with me. One thing to note is that there are several options in the BIOS that can improve gaming performance depending on your setup, AGESA version, and whether you are single/dual CCD. For example, enabling NUMA mode for L3 on dual CCD helped raise my framerates (minimums and averages) by 15% or more in the games I've tested (one game went from 110-120fps with dips into the 80s to 190-210fps with the minimums at 135) Regardless of that, they've already started a negative RDNA3 spin by yelling at AMD for not providing telemetry for an unreleased card.

I'm not keen to buy that game, but if I do, I can run tests if anyone wants.

I am not saying they are wrong btw, but to me the credibility for that account is MLID or Adored level: less than zero. Shoot, they appear to be biased toward Intel (vs being neutral), so I will place them on an even lower rung on the ladder.

Kaluan · Dec 21, 2022

I saw someone in one of the Intel threads post some angry dude's twitter reply to HXL's (or whomever it was) post about the Phoronix article on impressive Zen4 AVX512 performance/efficiency... that Alder Lake is just as impressive, coming short of calling him a fanb...

First off, Alder Lake doesn't effectively have AVX512 at all (Alder Lake, not Sapphire). You have to go through 100 hurdles to activate it and when you do, you lose 1. Your E cores and 2. Any bug fixes and benefits the newer BIOS/microcode bring.

And secondly, it's simply not true:

Phoronix also did some testa, but only of very specific AVX512 CPU Mining (yuck), and even there power usage exploded w/ 8 core AVX512 and didn't bring any sort of power/performance benefit in half the sub-tests.

Sad that some still wanna die on that hill.

But hey, on the "upside" I can't wait to have my ears blown off by the constant "Zen4 doesn't have AMX!!1" now 😅

Hitman928 · Jan 21, 2023

HWUB just tested the Intel 13500 with DDR4 and DDR5. As expected, the DDR5 config shows higher performance (11%). It still trails the AMD 7600 by about 10% but using fast DDR5 cuts the gap in half. Just data that shows the previous reviewer being ignorant in dismissing DDR5 as having no real performance advantage. The DDR4 system does have a small perf/$ advantage, but I don't know if $65 is really enough to not justify stepping up to DDR5 and higher performance. Good to have data for both though so the consumer can make an informed decision.

Mopetar · Feb 25, 2023

Eh, just use Factorio to get an idea of how those games will perform. That's another game that really scales well with the v-cache and it's pretty easy to design a replicable benchmark with it.

Joe NYC · Feb 25, 2023

DAPUNISHER said:
That would be red meat for the haters. They'd go HAM screeching about cherry picking. However, in a 50+ game suite, having a subset devoted to cache friendly titles makes sense.

Agreed. Some of the CPU intensive simulations would be great addition to the mix.

Whenever I see a "Gaming review" and everything is about FPS, I just reach for the bucket under my desk, just in case the review makes me throw up.

DAPUNISHER · Feb 26, 2023

DrMrLordX said:
Well yes, but as you said, they can include it in their media material along with all the other titles they already use. Not that AMD is currently using 50+ games to showcase their hardware.

Keep in mind that such performance metrics are a bit on the niche side. You probably have more people playing Fortnight on PC than every turn-based strategy game released in the same timeframe. Yes some people are going to want to know what produces the fastest turns in Europa Universalis or Stellaris, but not very many.

There we go with the niche claims again. How niche, and what does that even mean? A game not showing up on the Steam top played games doesn't mean there isn't a big number of players worldwide.

For example: WoW still has something like 8.5 million monthly subscribers. It is difficult to assess games available through game pass too. MS Flight Sim had an estimated 10 million players across all platforms in December. That's why it should be in every major review. Yet it isn't. Dispensing with the "I can't log in!" excuse, it's because it adds time to the testing process. Nonetheless they did it before "I can't login derp!" That it favors Intel by not having it appear is a nice ancillary benefit for them however. Those average game charts lacking most of the biggest wins for 3D cache skews things.

If reviewers wanted to be logically consistent about why they choose games for their test suites, they'd include more "niche" titles. Because again, every type of game is a niche. Arguing any given title is an outlier due to its small player base also rings hollow. The leaked bar graphs for the new Zen 4 3D have Tomb Raider and Far Cry 5. Rise of the Tomb Raider averaged 800 players on Steam the last month. Assetto Corsa averaged 10 times that, let me type that again, 10 times. Yet which is the staple of review suites?

Not testing titles like MSFS, Assetto Corsa, Factorio, Stellaris, or Planet Coaster over other games no one plays like Tomb Raider, is a bummer. Limit the scope, limit the findings. Which creates doubt concerning the reviewers desire to truly explore the reaches and effects of the new cache technology. Otherwise they'd include more non gaming tests that are cache sensitive, as suggested by others here. And include more cache sensitive games as I have suggested.

I'll end my remarks by saying it is beneficial to at some point, spend the time on those titles for the loyal fan bases they have if nothing else. While serving the dual purpose of investigating the performance, which is a big part of their mission statements. And it certainly isn't impossible to do. Here is Wendell testing the 3D cache in Stellaris, Factorio, and Dwarf Fortress. Spoiler, it made the contemporary Intel CPU look pedestrian.

Note for the anticipated responses about scientific methodology and repeatable results. They have plenty of those already. A bit more real gameplay like in Stellaris and Planet Coaster would be most welcomed.

NTMBK · Oct 6, 2019

My money is on no SMT4.

soresu · Nov 28, 2019

beginner99 said:
I think for Ryzen the socket will stay the same and will remain on ddr4. For epyc socket will probably change and move to ddr-5. That's the beauty of the IO die.

Could be, could be that TR4 will go DDR5 leaving the mainstream on DDR4 until prices come down from crazy town.

Tuna-Fish · Dec 7, 2019

soresu said:
You have to wonder what was in it for AMD in the first place

Survival.

Running fabs is ridiculously expensive, especially when you need to spend several billion every two years or so to upgrade. AMD was already heavily in debt from upgrading them the last time, and then Opteron revenue crashed and it looked like the options were bankruptcy and selling the fabs. The only interested buyer they could found was ATIC, but they wanted assurances that they will have customers in the future. Hence, the WSA.

soresu · Oct 27, 2020

eek2121 said:
The desktop chip will probably be 6nm with a new 7nm IO die. Higher clocks and IPC along with new instructions.

No we already know from AMD PR that it is at least some variant of 5nm for one die, likely the CCD.

The exact process is likely to be N5P I think.

inf64 · Nov 14, 2020

That "leak" on reddit is a complete BS.

randomhero · Dec 1, 2020

Several pages ago I've speculated about ST performance of Zen4.
I changed my mind after reading some of posts. I am of belief that with 5 nm clocks will go down. What I think I got wrong was that IPC would not compensate enough.
What I have forgotten is that new processes could cut down latencies CCD wide and what I have forgotten even more is advanced packaging. They can get higher inter CCD bandwidth also, from 32 bit to 64 bit per lane per cycle. They could get rid of of SerDes links on package and go wide with some sort of the silicon bridge(full interposer or silicon interconnect bridge) further improving on package bandwidth and possibly latencies as well.
After seeing what have they accomplished with Zen3, how they optimised design to extract as much as possible from limited resources(transistor performance,execution width, etc.), Zen4 on 5nm could come as quite of shock to industry regarding gen to gen uplift in performance.
I have definitely missed a metric ton of things that could be done to improve design from Zen3 to Zen4 since my knowledge on the subject is shallow as pudlle. Thankfully, that's where you guys(and galls!) come in! 🙂

scannall · Dec 1, 2020

GPU's and consoles are staying at 7. Only the chiplets are moving to 5. I wouldn't worry too much about supply. More worried that they will order an appropriate number of wafers.

Gideon · Dec 2, 2020

Ajay said:
FWIW, revenues for N7, and especially N5 are disproportionate to wafer volume due to their much higher cost.

I can't figure out what the heck Warhol would be on N7 since it's still Zen3, unless it's on some modified N7 process like EUV for a bit more performance. If Warhol is for real, then AMD is taking longer developing Zen4, or needs to wait longer till N5P is up and running in volume.

Yeah i even mentioned the fact that this is disproportional.

Regarding Warhol, my guess its just the same Zen3 chiplets on AM5, new packaging (possibly 2.5D) and new I/O die (with DDR5 support). Extracting more performance from the I/O and uncore side.

EDIT:
Spelling (sorry I tend to write type-messes on my phone)

dr1337 · Feb 28, 2021

uzzi38 said:
That was because AMD increased the number of cores per CCD from 6 -> 8 for Zen 2.

This doesn't make any sense lol. zen 2 was always going to be two ccxs of four cores each... Its functionally similar to zen1 just with all of the IO moved off die. There was never a six core chiplet design, outside of speculation/fake leaks/rumors.

fleshconsumed · Mar 1, 2021

Panino Manino said:
At the beginning of the Zen/Ryzen saga I was very excited, but now? Everything seems so excessive.
Wish we could go back to simpler times.

Back to simpler times of paying $400 for 4c8t? Years of 2-4% performance gain year over year? No thanks. I'll take my 16 cores without having to pay through the nose.

Makaveli · Apr 9, 2021

LightningZ71 said:
It really depends on what they manage to do with Zen3+. If it's like the 3000XT processors where it's just a few percent more Mhz, then it's not going to compete as well. If, instead, they can get both a few hundred Mhz of all core and boost, and also improve IPC by a few percent, then it'll be a different story. Personally, I think that AMD needs to do what they can to improve the all-core clocks the most as it appears that Zen3 was a nice single thread boost, but wasn't quite as much of an improvement in multi-thread scenarios. Given that N6 isn't much of a density improvement, leaning to the clocks/power side of the curve may make the most sense.

I thought the boost in all core clocks was pretty good for me going from a 3800X to 5800X

The Zen 2 chip would hit about 4.25Ghz on full load.

While the Zen 3 chip does about 4.55-4.6Ghz full load.

mikk · May 6, 2021

DrMrLordX said:
Genoa's shipping already, so. Unless you think Raphael is going to be radically different . . . you have your answer.

Do you have a source?

yuri69 said:
It was supposedly taped-out approximately in August 2020. So yeah, it's been design complete for a while. They are iterating on spins to get it ironed out for mass production.

This can't be true because AMD told that Zen4 is in design last October. Don't trust every rumor.

mikk · May 7, 2021

Hitman928 said:
Was this from an earnings call or some kind of presentation? Was it Lisa Su or someone else from AMD?

Edit: As a note, I went looking for this but couldn't find it. The closest I found was Mark Papermaster saying Zen 4 was in design phase but it was from Oct. 2019.

It was from the Zen 3 announcement stream October 2020:

On Zen3 AMD told us they have completed the design: https://www.tweaktown.com/news/67008/amds-new-zen-3-design-complete-4-ready-2021/index.html

We haven't heard something like this for Zen 4.

soresu · May 24, 2021

uzzi38 said:
GPUs certainly don't need PCIe Gen 5 any time soon unless 1080p 1800fps or something else stupid like that picks up traction, so really only SSDs are left.

I could see it playing a role in DirectStorage access from M2 SSD's to the GPU when UE5 facilitates more and more insane bandwidth requirements for virtualised geometry and textures in future games.

What's comfortable for a highly optimised platform like PS5 and XSX/S could require more BW for heavier scenes on Windows 10 PCs.

uzzi38 · May 27, 2021

exquisitechar said:
Didn't notice the image at first. Yeah, that seems wrong.

Has MLID ever even gotten anything AMD related right that wasn't already leaked by someone before him?

I don't know, and to be frank, I don't entirely care. He's said enough bollocks for me to know that he's more than happy to either make stuff up or trust things from absolutely anyone.

You guys remember this one? I know who made it up. Hell a couple of things on their I contributed to. It was total BS designed to see who would take the bait without attempting to verify their sources, and MLID posted it in a video within just hours of receiving it.

DisEnchantment · May 31, 2021

Gideon said:
Yet the 4-wide decoder is a serious bottleneck (also according to agner) and there is a limit of how much uop cache entries you can add. If they ever want to go wider, back to~3.5 Ghz and work up from there (which IMO they should, considering the future of process nodes) they have to figure something out or they can't widen their architecture.

I would really like to see another "zen" moment from AMD, widening cores and lowerin clock-speeds. Now obviously they can't replicate the increase Zen did as their previous core is kickass not a failure, but something between 30-40% IPC is totally doable IMO (M1 is a reference).

Say increasing IPC 35% and reducing clocks 20% might regress single-threded perf by a few % compared to "doing another zen3" (e.g adding ~20% IPC at 5 Ghz) but it will be way better across the stack. When designed right, real-life MT performance would go up about ~30% at roughly the same power consumption (due to the nature of the power/frequency curve) . That means way higher real-life performance in laptops, servers and HEDT (all of which are the main "money-makers").

All of this provided they execute well (which is hard!). If they do it like Samsung, they'll just waste power and die-area.

Yeah, 250W gaming PCs might not benefit much, but the reality is that once they have this new 3.5-4Ghz architecture it will eventually go up to 5Ghz again (as Core and Zen did).

Probably one of the reasons why the patent aims to keep uop cache size in check by virtualizing it. With the goal of having very less decoding in the first place.

https://www.freepatentsonline.com/y2021/0149672.html

I think we can expect a clock gated decoding unit ( most likely not even a complex unit ) to be added but the actual goal is to have them execute out of the virtualized uop cache.
This is one advantage ARM has over x86, decoding does not take much power, they can just keep adding without a severe power penalty

Kepler_L2 said:
Going wide with lower clocks is the way.

This depends on whether the bean counters will allow overly significant die size balloning. Cost per wafer is the same but the dies per wafer is different. For non fully integrated company a very tough choice to make.
Also every Zen generation we see conservative but slightly wider/improved backends but the front end remains largely same. Not saying front end will remain same, but to be realistic for an x86 core expect a very conservative increase and very measured approach here

Regarding the MT efficiency, some patents and research have indicated that the MT is resulting in the sharing of a number of important queues, and aims to put in place a mechanism to allow a thread in a core to compete for resources to be able to access more of them to be able to effective improve performance

https://www.freepatentsonline.com/y2021/0096914.html

DisEnchantment · Jun 13, 2021

uzzi38 said:
I dunno the differences between how the two worked specifically, I haven't seen anything to really explain the exact implementation. It's possible AMD's implementation may not run into whatever issues Apple's did. But... It does worry me that Apple of all companies dropped the idea.

Its a patent application, there is no reason to 'worry' .
We are in a technical speculation thread discussing likely possibilities (albeit not blue sky fantasy)
If anything, AMD are being very pragmatic in their approaches. Also note there is no prior art in the application.

Since I generally do not follow anything Apple, I would only like to comment on the ARM big.LITTLE.
The cores are independent and they are asymmetric SMP cores, coherent at L2.
Disabling a small core when migrating the load to a big core makes no big difference if your CPU is not power constrained at all (e.g. if you are on cutting edge nodes), maybe if you were on an older node perhaps and power/thermal envelope is/was an issue.

AMD's application is for cores that have their register file bridged. In this setup they cannot run in parallel (for obvious reason that they use the same register file or the register files are bridged)
The patent application goes as far as saying that during migration, the small core stalls and the execution resumes from the last program counter in the big core.
This is way beyond what the OS can handle. OS manages at TCBs and PCBs not instruction level.

The small core could in fact be the same big core with much more granular power/clock gate-able blocks
And most certainly not a separate die for such small cores. It would be just a few mm2 without the L3, see Tremont.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Platinum Member

Super Moderator CPU Forum Mod and Elite Member

Lifer

Platinum Member

Golden Member

Platinum Member

Diamond Member

Member

Golden Member

Golden Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Golden Member

Golden Member