Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Hans de Vries · Jan 24, 2022

DisEnchantment said:
Only the Chips&Cheese link it is there, but the actual value of L3 is known only at runtime since CPUID does not have a fixed value read out. So won't know until the chips in hand and read out by SW.

Details on the Gigabyte Leak

Recently, a ransomware group leaked data from Gigabyte in an attempt to extort payment. That’s been well covered by other outlets (please everyone, secure your networks), so here we’re …

chipsandcheese.com

Yes I found that as well:

https://twitter.com/x/status/1454057599977074692

Indeed, it shows XXXX for the L3 cache size:

One can do a calculation with the following data they give:
L3 cache number of bytes per cacheline = 64
L3 cache number of ways = 16
L3 cache number of sets = XXXX

For the latter they have two defined values 16384 and 32768 and the rest is reserved.
BTW: This is the same as in the new manual for the B2 stepping.....

So that would be 16MB and 32MB, but we know that the B2 stepping can also have 96MB. A value which is not (yet) in the manual.

Strange is that both Zen 3 and Zen 4 are family 19h only the model numbers differ.
Zen 2 was family 17h.
Zen 3 is family 19h with many model numbers (01h, 21h, 51h)
Zen 4 is family 19h with model number 10h.

The Gigabyte leak Zen 4 manual has 9 volumes with all the AVX512 stuff in it, the1MB of L2 cache and so on.

DisEnchantment · Jan 24, 2022

Hans de Vries said:
Zen 4 is family 19h with model number 10h.

Isn't F19H Model A0h-AFh also Zen4? ~~This one should be Raphael I think.~~

Hans de Vries said:
The Gigabyte leak Zen 4 manual has 9 volumes with all the AVX512 stuff in it, the1MB of L2 cache and so on.

If you wouldn't mind me asking, what is the value of this reg, CPUID Fn8000_001A_EAX --> Bits [31:3]

I found a more precise value for GMI3

Current GMI2 is max 25Gbps, so max FCLK is 2500 MHz. This is transfer per lane, and the number of lanes could be different for GMI3. (GMI2 has 39Rx+31Tx lanes, 10 bits per transfer)

GMI3 is around 32-36 Gbps, around 30-40 percent higher transfer rate than GMI2. (Don't know if same number of bits per transfer)
But FCLK higher than 3000MHz should be easy considering DDR5-6000+ already planned for SP5 derivatives

yuri69 · Jan 24, 2022

B2 Genoa with 96MB L3?

AMD model family seems to change along with the overall "core complex" topology change.

Anyway, Zen 4 has already multiple known model numbers:

* 10h = Stones = Genoa = EPYC 7004
* 60h = Raphael = Ryzen 7000 "CPU"
* 70h = Phoenix = Ryzen 7000/8000 APU
* A0h = ? = Bergamo = EPYC 700?

Ajay · Jan 24, 2022

DisEnchantment said:
If you wouldn't mind me asking, what is the value of this reg, CPUID Fn8000_001A_EAX --> Bits [31:3]

AMD64 Architecture Programmer's Manual V1-5: https://www.amd.com/system/files/TechDocs/40332.pdf

Also, you had posted something from linked In that indicated future GMI links had been modeled up to 64Gbps. The image you posted indicates that GMI will be pegged at a max 36Gbps, including TSMC 3 based CPUs. Curious.

DisEnchantment · Jan 24, 2022

Ajay said:
AMD64 Architecture Programmer's Manual V1-5: https://www.amd.com/system/files/TechDocs/40332.pdf

I wanna see it from the Zen4 leaked manual

Ajay said:
Also, you had posted something from linked In that indicated future GMI links had been modeled up to 64Gbps. The image you posted indicates that GMI will be pegged at a max 36Gbps, including TSMC 3 based CPUs. Curious.

Indeed it was, but I think they are developing back to back 3nm and 5nm tech. And both the characters mentioned the same.
I believe GMI3 should be be around 32-36 Gbps, assuming 10bits/Transfer, FCLK 3200-3600MHz. Nice 1:1 FCLK ratio with DDR5-6400+
Not a radical change from Zen2/GMI2.

64 Gbps transfer rate sound like a huge jump which a Zen3 derived core might not be able to make use of.
Whole thing makes Greymon55's rumor of Zen5 3nm a bit more believeable.

Ajay · Jan 24, 2022

DisEnchantment said:
I wanna see it from the Zen4 leaked manual

Oh, Duh!

64 Gbps transfer rate sound like a huge jump which a Zen3 derived core might not be able to make use of.
Whole thing makes Greymon55's rumor of Zen5 3nm a bit more believable.

And Zen5 is a new core, IIRC. The IOD for forward looking higher speed GMI links must be pretty epic for Epyc CPUs.
Ha! I crack myself up

leoneazzurro · Jan 24, 2022

It's also entirely possible that for Zen4 we could have models with different cache size, due to stacking some V-cache on the higher end variants.

Hans de Vries · Jan 24, 2022

DisEnchantment said:
I wanna see it from the Zen4 leaked manual

This is Genoa: The clue is here above the "Bits" column where it says _ccd[11:0] for the 12 CCD's of Genoa.
This manual has all the AVX512 stuff as well. You were probably looking for a similar 512-bit "full width" flag
It is still 'reserved' here in this version. Can go either way...

Hans de Vries · Jan 24, 2022

DisEnchantment said:
I found a more precise value for GMI3
View attachment 56468
GMI3 is around 32-36 Gbps, around 30-40 percent higher transfer rate than GMI2. (Don't know if same number of bits per transfer)
But FCLK higher than 3000MHz should be easy considering DDR5-6000+ already planned for SP5 derivatives

It won't go much beyond 32 Gbps for PCIe gen 5, but if they upgrade to PAM4 signaling for PCIe gen 6 then it should reach the 64 Gbps you mentioned.

MadRat · Jan 24, 2022

They probably use 3 bits to define L3 cache. Maybe something like:

000 = no L3
100 = 16 MB
010 = 32 MB
011 = 96 MB

...and this probably means 110 (40 MB), 001 (64 MB), 101 (80 MB), and 111 (112 MB) are perfectly valid options.

soresu · Jan 25, 2022

yuri69 said:
* A0h = ? = Bergamo = EPYC 700?

If I had to guess there will be a cloud identifier in the SKU name.

Like 7Cxx/7xxxC or something.

Tuna-Fish · Jan 25, 2022

dnavas said:
I presume this is a matter of data width/packet size rather than pure latency, which appears to be ~10ns for both 6.0 (https://pcisig.com/sites/default/files/files/PCIe 6.0 Webinar_Final_.pdf) and DDR4 (and future DDR5)?

DDR4/5 has a separate out-of-band channel for commands/addresses, which it can use in parallel with the data bus. PCIe is a packet-based system where the equivalent data is placed into packet headers, and travels in the same link as the data. This, and minimum packet sizes, makes the practical latency that the bus adds much greater than that of DDR4/5.

Also note that you cannot use the latency add numbers there to directly compare to practical DDR4/5 memory latencies, because most of the latency in DRAM comes from the time takes to actually access the memory arrays, not from the bus. With CXL.memory, you have to first pay the latency of getting a request to the memory device, then the same memory array access latency as with any other DRAM, and then the latency of getting the request back. When doing random access to DDR, the request is sent in two parts, first of which selects the row (which is typically the slowest operation), and it only takes iirc two bus cycles for the row activate action to actually start. Then the CAS can be done to coincide with the row activate completing. PCIe is several nanoseconds in the hole already at this point because they have to put the entire memory operation into a single packet, and transmit the whole thing through a narrow pipe which means it will take many cycles to arrive.

Tuna-Fish · Jan 25, 2022

Also, there's no DRAM on the planet that does 10ns latency. The unloaded latency of DDR is Trcd + Tcas, and the lowest I've ever seen that go is ~17ns.

Thibsie · Jan 25, 2022

d'ascendant MSNBC

Tuna-Fish said:
Also, there's no DRAM on the planet that does 10ns latency. The unloaded latency of DDR is Trcd + Tcas, and the lowest I've ever seen that go is ~17ns.

Thanks for this.
What is the practical latency PCIe could provide (by itself I mean) ? That could give us a slight idea...

JoeRambo · Jan 25, 2022

DRAM has obvious disadvantage of traces having to be short and tight electrically in places where there is huge premium on space. And it's not like large capacity DRAM is fast. Registered, load reduced DIMMs add quite some latency already.

The future will have multiple tiers of memory and is already moving towards that, i think OS is not ready, vendors are coming out with at least 2-3 tiers already: on package HBM, casual DDR5 DIMMS and Optane DIMMs.
Does not make a huge leap of faith to add CXL style memory somewhere in this tiered memory hierarchy. Latency of CXL attached RAM should still be between DDR and optane, question of price efficiency is different tho.

The real challenge is getting apps/OS to properly use it. You would not want your JVM heap tier garbage collection on the outer tiers, and you don't want your OS file cache full of served web files on HBM either. It's complicated to manually manage and place, not speaking about any OS level automagic solution where you have multiple apps, each with own expectations.

dnavas · Jan 25, 2022

Tuna-Fish said:
Also, there's no DRAM on the planet that does 10ns latency. The unloaded latency of DDR is Trcd + Tcas, and the lowest I've ever seen that go is ~17ns.

Yeah -- that's ~10ns for my purposes (I was aiming at order of magnitude), but here's a good source:

Insights into DDR5 Sub-timings and Latencies

www.anandtech.com

Note that the specified latency seems to be cas-only, and the speeds are jedec-only. I'm not sure that always including Trcd is the best way to do a pcie vs ram latency (it wasn't done in this article for strict ram-vs-ram), but I hope it's clear that the ways in which ram-over-pcie and local ram should be accessed ought to be different given the nature of the busses involved. Once the bandwidth of pcie approaches ddr* territory, the question to ask (I think) is how small a chunk of memory can you access given a similar timeframe. In that case, including Trcd makes a lot of sense (because you shouldn't be talking about streaming/contiguous-access at that point). I assume that's where you're coming from.

Tuna-Fish said:
DDR4/5 has a separate out-of-band channel for commands/addresses, which it can use in parallel with the data bus. PCIe is a packet-based system where the equivalent data is placed into packet headers, and travels in the same link as the data. This, and minimum packet sizes, makes the practical latency that the bus adds much greater than that of DDR4/5.

The minimum TLP for pcie6 is 0DW, however FEC works on fixed size, and I believe pcie6 is sending 256byte sized packets (in FLIT, 20byte header, if I'm reading the spec correctly).

[...]Then the CAS can be done to coincide with the row activate completing. PCIe is several milliseconds in the hole already at this point because they have to put the entire memory operation into a single packet, and transmit the whole thing through a narrow pipe which means it will take many cycles to arrive.

Right, so here's where my theoretical understanding ends -- how is 256 bytes over a 64GT/s link "several milliseconds"? Obviously putting the (much smaller than 256byte) request into a single packet is going to stink from an efficiency standpoint, but surely we're not just sticking around until we have enough TLPs to stuff into a "full-ish" request? Also, this doesn't really track with network latencies which are talking up sub-microsecond latencies. I'm happy to attribute some level of bs to marketing, but nearly four orders of magnitude is a little hard to swallow without a few details.

As always, changes are going to provide for new opportunities. I personally believe a programmable Xilinx unit fronting a memory pool would be very helpful to data mining use-cases, but we'll have to see how all of this evolves. Clearly machine-remote memory pools are even further removed from a latency perspective.

Doug S · Jan 25, 2022

JoeRambo said:
The real challenge is getting apps/OS to properly use it. You would not want your JVM heap tier garbage collection on the outer tiers, and you don't want your OS file cache full of served web files on HBM either. It's complicated to manually manage and place, not speaking about any OS level automagic solution where you have multiple apps, each with own expectations.

The OS can't manage this, things happen FAR too quickly to manage this in software. It will have to be done in hardware and managed like layers of cache currently are since the latencies are closer to that than they are to HDD latencies that storage tiering was designed around. There is no room for smarts as can be done for storage tiering, where slower storage devices leave the OS plenty of time to "think".

The only shot at software management would be a dedicated app, like if you were running Oracle on something it would know what can tolerate higher latencies and what cannot and maintain separate memory pools for the different types of "RAM". This is not something any general purpose apps will ever be written to do.

Tuna-Fish · Jan 25, 2022

dnavas said:
Right, so here's where my theoretical understanding ends -- how is 256 bytes over a 64GT/s link "several milliseconds"? Obviously putting the (much smaller than 256byte) request into a single packet is going to stink from an efficiency standpoint, but surely we're not just sticking around until we have enough TLPs to stuff into a "full-ish" request? Also, this doesn't really track with network latencies which are talking up sub-microsecond latencies. I'm happy to attribute some level of bs to marketing, but nearly four orders of magnitude is a little hard to swallow without a few details.

No, that was my mistake. Was supposed to be nanoseconds.

dnavas · Jan 25, 2022

Tuna-Fish said:
No, that was my mistake. Was supposed to be nanoseconds.

Thanks -- I just want to underscore I appreciate the real-world experience. There's only so much I can glean from reading specs (and I often enough get it wrong anyway).

Doug S said:
The only shot at software management would be a dedicated app [...] This is not something any general purpose apps will ever be written to do.

An interesting option is to stuff a JIT between the app and the hardware reality. You're right that there probably aren't cycles to devote in between (and imho, it's going to become harder in the storage realm as well), but optimizing and rewriting in parallel could be interesting.

LightningZ71 · Jan 25, 2022

Its possible, to a certain degree, as long as the OS exposes to the app a method for designating what data goes in what pool. Its already possible to abstract RAM disks as file devices and assign various working sets to them for best performance, but that's an OS level operation that is being used by a client program.

There is a space for things like custom ASICs to do automatic data migration in real time while the OS just sees a flat memory system. There's Intel's OS level hooks for Optane devices, both at client scale and at server scale with persistent DIMMS. Some of that uses the existing virtual memory structure in Windows itself.

There's lots of ways to do this, and each has its pros and cons.

jamescox · Jan 25, 2022

LightningZ71 said:
I dare say that AMD isn't quite as constrained by available engineering resources (manpower and budget) to make it impossible for them to devote resources to making an optimized IOD for desktop that isn't just a quadrant of the EPYC IOD. Granted, I think it would be awesome to have a triple channel system for a 24 core Ryzen system, but, I think it highly unlikely. Though, come to think of it, with Threadripper being restricted to the 8 (maybe 12 for next edition)memory channel TR Pro/Workstation series, there's a veritable gulf between a dual channel desktop and a 8-12 channel WS/HEDT platform...

I kind of doubt that there will be a Threadripper based on the full SP5 sized package. It is huge and expensive. With the last few generations, the IO die has been made at GF, on an older process. It is large, but likely very cheap to make, so using them in Threadripper, even if they are fully functional, is likely not a cost issue.

With Genoa on SP5, the IO die is now presumably much more expensive 6 or 7 nm class chip. It may still be plausible to use this for Threadripper, but it would be quite wasteful. I was thinking that it would be good to make a modular IO die where each quadrant is a separate chip; only around 100 mm2. That would allow for an in-between socket with half of the full Epyc IO die for lower end server, workstation, and HEDT. Perhaps they do need a channel to make use of salvage IO die. Although, a salvaged modular part could possibly still be used in desktop Ryzen since the full part would have 3 memory channels instead of 2. Going modular would have wasted a lot less silicon, but it probably would require embedded silicon bridges between the IO die chiplets for Epyc.

With PCI-e 5 bandwidth off the cpu, it is plausible that they could provide significantly more IO by using multiple chipset chips. Due to the trace length restrictions with PCI-e 5 speeds, I could see them using using 2 chipset chips daisy chained or just used as pci-e bridges / expanders. They could do 8 in from the cpu and 24 out of each chip for 64 total (16 remaining from cpu and 48 from the chipset. Those would likely be the IO die design reused, but made on a cheaper process. That could keep trace lengths short and provide Threadripper level IO. It would have some shared paths, but that is likely not really an issue at pci-e 5 speeds.

There is still the 2 (or possibly 3) chiplet limitation though. Perhaps they should just make a dual socket AM5 to fill the gap between Ryzen and Epyc. That would be hilarious if they brought dual socket AM5 to HEDT parts. That would likely use 2 chipset chips and would not fit on smaller board sizes. The whole platform would likely still be cheaper than single socket SP5 / Genoa.

jamescox · Jan 25, 2022

Markfw said:
It just really surprises me that after years of EPYC beating Intel, that they are gaining so slowly. I mean the majort part of what Intel is shipping is 28 core cascade lake-SP that take as much power as the 64 core EPYC boxes ? Power/performance is critical in the data center, where twice the power is at least 4 times the cost, due to AC. It amazes me how stupid these data center people are.

Supplies may be better now, but we were having difficulty getting Milan chips. There were rumors of a covid outbreak late 2021 at packaging facilities in Malaysia that may have caused a massive backlog.

A lot of people are still dismissive of AMD as a serious enterprise solution, but part of this is that the server market just moves slower. A lot of times it is longer term deals or a large number of servers spec’ed and then purchased over several years. I don’t really think Intel will have that strong of competition for a while yet so AMD still has some time to gain mindshare and market share. Intel has some MCM / stacked solutions in the pipeline, but AMD also has a lot of things coming with Milan-X3D, Genoa, possibly Genoa-X3D, and Bergamo.

jamescox · Jan 25, 2022

Joe NYC said:
Nice job with this.

There seems to be a quite a bit of empty space on 2 sides of this arrangement. It makes me wonder if AMD might be adding some additional silicon there. Could this support possibly putting some HBM memory in the package? Or some sort of FPGA or customized chips for things like encoding / decoding / networking...

If there is nothing else, just 12 chiplets connected to IO Die, is this the most optimized approach, given that there will be:
- additional cost to cost to the fan out package and assembly costs
- AMD still has the SerDes connection with its limitations
- bandwidth increase will only be linear,
- power savings may be limited.
- may limit the desktop / AM5 to higher market segments, no entry level CPUs

It seems that EMIB / EFB could be:
- cheaper
- no more SerDes, latency hit
- lower power use
- greater bandwidth potential

I wonder if this will end up an interim solution for a single generation.

Or, there may be more to this solution that we don't know yet. Something beyond just IOD-CCD links

One idea: Cutress had an article / video speculating that AMD could use interposer, which could provide a mesh interconnect between cores of the package, replacing the ring bus

Alternatively, quadrants could be NUMA, groups, and there could be a faster chiplet to chiplet interconnect within a quadrant constituting NUMA group.

Another AMD exec was hypothesizing about a mesh chiplet to chiplet interconnect, possibly providing very fast access to each other's L3 caches. But that could be quite messy, and if all the links would need a SerDes, also power hungry.

Alternatively, it could be only quadrant to quadrant mesh of SerDes based connections, as opposed to chiplet to chiplet...

The package size is determined by the number of pins required, not really by what they put on it. They could still do something like adding HBM, but they seem to be going the stacked SRAM route instead. HBM would allow high bandwidth for streaming applications, but it isn’t low enough latency to act as a general cache. It is still DRAM with DRAM like latencies.

Stacked solutions are actually likely to take less area on package. I am hoping that Bergamo is a stacked solution using infinity cache bridge chips, either above or below the IO die and cpu chiplets. I don’t see why they wouldn’t use the same cache bridge chips for GPUs and CPUs. That would allow for 512 MB to 1 GB of cache without taking any 2D package area. It might be doable in a single reticle size, if they can actually route all of the necessary interconnect out of the package. If that is the route that they go, then it is unclear whether the off die cache will be L3 or L4. They may still have a small on-die L3, even if they have large stacked cache.

jamescox · Jan 25, 2022

DisEnchantment said:
I don't think it would just those, just an early thought since we don't have details yet.

Zen4c would likely not use the same process as the HPC optimized Zen4. Very likely lower power and density optimized flavor of N5, just like N7 for mobile 5000 series.

E.g. Cezanne has 62.5 MTr/mm2 density compared to 51.5 MTr/mm2 for Zen3 CCD even though Cezanne has a lot more IO blocks. If we remove the IO blocks, Cezanne could probably be 70+ MTr/mm2 (Apple got 82 MTr/mm2 on N7 for example)

L3 will get a cut like mobile series to 2MB/slice

L2 will likely stay the same as normal Zen 4 at 1MB. Doubling the L2 of Zen 4 (i.e. 2MB) would make it 2x the size of L3 slice, in other words, 2MB L2 has similar size as 4 MB L3 slice.

V-Cache likely not, because they cut the L3 to begin with, why add again, plus they need room to add the LDOs and the TSVs. Zen3 has LDOs between the L3 gaps to power the V Cache.

25-30% density gain from using Mobile variant of process + 20% reduction in area by cutting L3 in half + Cutting TSV area 2-3 mm2 + reducing LDO area.
That might just be enough to fit 2 such CCX in Zen4c.

I agree with some of this, but one thing to keep in mind is that it is Zen4c for cloud, which likely means a high end part. These are for server applications where giant caches can make a big difference, so going with no stacked caches seems less likely to me. I would expect this to use stacked cache by default; either the type of cache used in regular Zen X3D (64 MB or larger stacked on top of cpu die) or infinity cache chips that will also be used for GPUs. The on die L3 could be small, but having 16 cores with only a half size L3 for server applications doesn’t seem like a good solution unless the L2 is massive. A lot of high thread count server applications are not very cacheable, but some of the target application likely can make good use of large caches.

For a lot of servers, massive AVX512 units are a complete waste of die area, so it seems like they would cut that down somehow. They could have a smaller number of units compared to regular Zen 4. A more radical design would be to just share the AVX units. That would likely require multiple cores sharing an L2 cache though, so a lot more design work.

Edit: Also, if it is using bridge chips to save power instead of serdes links to cpu chiplets, then the TSV area would replace the serdes GMI link.

MadRat · Jan 25, 2022

If your cache sits between your memory and CPU it should be easier to customize servers to a wider market than making endless SKUs.

Seems that DDR5 would allow for grouping banks of memory together under a 'psuedo' memory controller. Memory used to be controlled externally from the CPU, but internal controllers offered many performance advantages. DDR5 shifted some of that control out of the CPU and allows individual chips to have some independence. So utilizing that architecture to hide a high bandwidth cache isn't outlandish.

The CPU might see the memory controller as one big chip or at least some fraction of what is there. The memory controller hides the raw details and balances out performance at the hardware level. The memory controller would spread communication out to multiple sticks in a RAID 0. Use HBM to victim cache data like an L3. The cache would also buffer fetches across all the sub-channels tied to the controller. You are probably never going to saturate the HBM bandwidth with your DDR5 connections, but you can easily saturate data out from your memory controller to the CPU using a cache with HBM.

The memory controller might even be smart enough to keep volatile memory in the fastest chunks, and use slower sticks for what hardware recognizes as largely static chunks. Having a cache you can squeeze in close to the CPU sure would supercharge many worksets. Being able to keep your memory channel full both ways from memory to CPU, and vice versa, certainly wouldn't hurt. As a caveat, maybe you support more than just DDR5 memory, because the vendor can wire in chipset to do more than one function, while using one chipset version. Maybe your memory controller in a high end board is several layers of cache. Gives your board makers something to exploit for an edge over competitors. And being able to do it without software, makes this strategy even better.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Senior member

Golden Member

Senior member

Lifer

Golden Member

Lifer

Golden Member

Senior member

Senior member

Lifer

Platinum Member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Platinum Member

Golden Member

Senior member

Golden Member

Senior member

Senior member

Senior member

Senior member

Lifer