Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Vattila · Oct 6, 2019

Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).

What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!

Tuna-Fish · Jan 20, 2022

Hans de Vries said:
PCIe-6 will replace Dram specific busses like PCIe-3 has done for SSD's

Eventually, some PCIe probably will, but I suspect it will be a while still. The packet-based PCIe interface adds substantial latency over a normal DDR-style bus. My bet for the future would be that most devices get DRAM integrated onto the CPU package (or on top of the cpu die, if they figure out a way to manage the thermals), to reduce power and increase bandwidth. However, this will tie DRAM amount to your CPU choice, and so anyone who needs more than what their CPU has built-in will have to attach it through PCIe.

moinmoin · Jan 20, 2022

Tuna-Fish said:
My bet for the future would be that most devices get DRAM integrated onto the CPU package (or on top of the cpu die, if they figure out a way to manage the thermals), to reduce power and increase bandwidth.

In the consumer market (with a higher likelihood in the mobile one) maybe. But PCIe development is with a focus on servers where integrating DRAM onto the CPU package may well be limited by size considering the high amount of RAM server tend to be built with.

Latency is indeed an issue, but the big question is: Would the latency of PCIe 6 be higher or lower compared to existing shared memory solutions? If lower it already will do a fine job its job in data centers.

Doug S · Jan 20, 2022

Tuna-Fish said:
Eventually, some PCIe probably will, but I suspect it will be a while still. The packet-based PCIe interface adds substantial latency over a normal DDR-style bus. My bet for the future would be that most devices get DRAM integrated onto the CPU package (or on top of the cpu die, if they figure out a way to manage the thermals), to reduce power and increase bandwidth. However, this will tie DRAM amount to your CPU choice, and so anyone who needs more than what their CPU has built-in will have to attach it through PCIe.

Given the huge number of SKUs Intel currently manages, having SKUs with -16 -32 -128 or whatever gigabytes of integrated DRAM would not be a problem. Especially if they are able to reduce the number of SKUs vis a vis CPU choice via the 'upgrade' stuff, so instead of having several dozen different SKUs based on the same physical die sold at different clock rates / TDP figures they can have have an order of magnitude fewer, with a "low end" (dies sold with bad cores that have limited upgradeability) "midrange", and "high end" (dies cherry picked to hit the highest clock rates / lowest power draw)

These handful of standard SKUs are then "upgraded" by the OEM to the model with the features/clock rate/power profile required for the particular laptop/desktop/etc. they are selling. That would help them massively with inventory management since they don't have to worry about shipment delays because they run out of a particular i7-12xxx, so keeping around a dozen different versions of each base die with different amounts of integrated DRAM wouldn't be a problem.

This wouldn't work as well for servers since there is larger variation for DRAM amounts and more desire (at least outside cloud customers) to be able to upgrade - though nothing stops them from having the entire CPU+DRAM module socketed so you can upgrade your RAM that way. Buying DRAM that way would allow Intel to use their buying power to source DRAM at a lower price than all but the largest OEMs can, which would benefit the smaller OEMs who have to buy from the spot market.

It is really unclear what the benefit would be of using PCIe to attach DRAM instead of DIMM slots given the latency hit - which keep in mind will increase significantly (at least as far as the expected latency of DRAM goes) with PCIe 6.0 due to using PAM4 instead of NRZ. You save pins and board area, but pay in latency and granularity of upgrade (unless someone is proposing DIMM slots on a PCIe card which would defeat the whole purpose)

Exist50 · Jan 20, 2022

CXL.mem will be hugely important. The inherent latency of PCIe really isn't that bad, and so it's perfectly feasible to have memory expansion cards providing the bulk of the system's capacity, whether that be in DDR5, Optane, or anything else. Would allow for a small, faster tier of system memory (HBM?).

LightningZ71 · Jan 20, 2022

I think that's the catch. It is expected that CXL and other remote connected RAM will have a non-trivial latency hit. However, it's also understood that the processor packages that use it will have large caches and local memory pools to minimize that apparent latency hit.

itsmydamnation · Jan 20, 2022

Exist50 said:
CXL.mem will be hugely important. The inherent latency of PCIe really isn't that bad, and so it's perfectly feasible to have memory expansion cards providing the bulk of the system's capacity, whether that be in DDR5, Optane, or anything else. Would allow for a small, faster tier of system memory (HBM?).

HBM isnt faster, it will probably always be slower then highly tuned xDDRx

Exist50 · Jan 20, 2022

LightningZ71 said:
I think that's the catch. It is expected that CXL and other remote connected RAM will have a non-trivial latency hit. However, it's also understood that the processor packages that use it will have large caches and local memory pools to minimize that apparent latency hit.

I think the latency is in the 10s of ns range, which is plenty acceptable. Though if you want to make things really interesting, CXL/Gen-Z over Ethernet for warehouse-scale memory scaling. Now that would be fun.

itsmydamnation said:
HBM isnt faster, it will probably always be slower then highly tuned xDDRx

It's vaguely comparable latency-wise, but much higher bandwidth. Depends what you're trying to build.

dnavas · Jan 20, 2022

Tuna-Fish said:
Eventually, some PCIe probably will, but I suspect it will be a while still. The packet-based PCIe interface adds substantial latency over a normal DDR-style bus.

I presume this is a matter of data width/packet size rather than pure latency, which appears to be ~10ns for both 6.0 (https://pcisig.com/sites/default/files/files/PCIe 6.0 Webinar_Final_.pdf) and DDR4 (and future DDR5)?

If that's the case, bigger caches with bigger line sizes would seem to be a possible way forward.

eek2121 · Jan 20, 2022

Exist50 said:
I think the latency is in the 10s of ns range, which is plenty acceptable. Though if you want to make things really interesting, CXL/Gen-Z over Ethernet for warehouse-scale memory scaling. Now that would be fun.

It's vaguely comparable latency-wise, but much higher bandwidth. Depends what you're trying to build.

dnavas said:
I presume this is a matter of data width/packet size rather than pure latency, which appears to be ~10ns for both 6.0 (https://pcisig.com/sites/default/files/files/PCIe 6.0 Webinar_Final_.pdf) and DDR4 (and future DDR5)?

If that's the case, bigger caches with bigger line sizes would seem to be a possible way forward.

I suspect you aren’t looking at the whole picture. Latency on PCIE 5.0 can be as high as 60ns. That is not a bidirectional number. As far as I am aware, (I know little about this subject) PCIE memory solutions were about providing a secondary pool of memory beyond the primary pool. I haven’t kept up with things, (I have ignored CXL almost completely, but PCIE is not ideal for a main memory pool because:

high latencies that are extremely variable.
Shared traffic with other devices.
Long traces result in greater opportunity for errors, even with error correction.
More expensive motherboards due to more layers and better shielding being needed.

Sure, you could dedicate lanes specifically to memory. You might even be able to lower latencies, but then you are just replacing 1 standard with another for no material benefit whatsoever.

I did have a link to a github page at one point that had a project that measures PCIE latency, however I can no longer find it.

EDIT: that document is not referring to latency of the PCIE bus itself btw.

Final EDIT: This page has information on FLIT mode, which can allow for low latency communication. Finally, this page talks a bit more about PCIE FLIT mode, and latency.

dnavas · Jan 20, 2022

eek2121 said:
I suspect you aren’t looking at the whole picture.

Oh, I suspect I don't understand the whole picture
If I understand correctly, the forward error correction they've chosen makes worse-case latencies pretty bad, but in many cases can decrease total latency. And yes, I don't think that's end-to-end latencies, and it certainly isn't bidirectional, so maybe the comparison was a little apples and oranges. Still, CXL3 is aiming to be pretty aggressive on latency.

More expensive motherboards due to more layers and better shielding being needed.

Yeah, I'm worried when I see even pcie4 getting skipped on some Zen releases and motherboard cooling being applied to (most) recent AM4 designs. [almost made it back on topic]

Sure, you could dedicate lanes specifically to memory. You might even be able to lower latencies, but then you are just replacing 1 standard with another for no material benefit whatsoever.

The advantage of being able to treat everything with a local cache as just memory is that you can trim buffer copying, which amusingly is a way to trim latencies. It's going to cause hell on folks who rely on that latency to provide benefit "for free", but that's another discussion, which itself is 2-degrees off-topic

Thanks for the reading materials, I'll try to slog my way through them!

DisEnchantment · Jan 21, 2022

Exist50 said:
CXL.mem will be hugely important. The inherent latency of PCIe really isn't that bad, and so it's perfectly feasible to have memory expansion cards providing the bulk of the system's capacity, whether that be in DDR5, Optane, or anything else. Would allow for a small, faster tier of system memory (HBM?).

CXL.mem + CXL.cache is basically what AMD's IF 3.0 is doing in Trento using custom links. By supporting CXL 1.1 on PCIe 5, Genoa+ CPUs could run compatible off the shelf accelerators with coherent and unified memory layout.
Memory expansion per node like the Samsung CXL Memory expansion card could be useful but not going to change current server architecture much.

Memory pooling is the big thing that will change the ecosystem in a big way. GenZ memory pooling being folded into CXL 2.0+ mem pooling should make this a really interesting tech for future servers.
A memory only node that allows sharing memory across several nodes could be interesting.

OEMs could possibly go HBM near memory with CXL far memory and something like MPDMA page migrator in between and reduce the node DRAM interfaces/slots.
What could be done custom is to put CXL on top of other interconnects like optical etc.
Does not sound like a good idea to go all alone and proprietary with next gen servers. STH/Patrick wrote quite a bit on these.

DrMrLordX · Jan 21, 2022

DisEnchantment said:
By supporting CXL 1.1 on PCIe 5, Genoa+ CPUs could run compatible off the shelf accelerators with coherent and unified memory layout.

Bingo. Being able to support external memory pools isn't necessarily about supplementing system RAM or CPU cache. It's about being able to properly address accelerators that have their own memory pools. NVLink has had this for awhile (it's critical to CUDA's UVM model). Not sure if CCIX supported it at all, but since AMD is going all-in on CXL it doesn't really matter now.

Doug S · Jan 21, 2022

DrMrLordX said:
Bingo. Being able to support external memory pools isn't necessarily about supplementing system RAM or CPU cache. It's about being able to properly address accelerators that have their own memory pools. NVLink has had this for awhile (it's critical to CUDA's UVM model). Not sure if CCIX supported it at all, but since AMD is going all-in on CXL it doesn't really matter now.

OK now this I agree with, this is an obvious win for an ESX farm as you can have normal DRAM installed on each server and then have a big pool of "remote" memory accessible to supplement it, with the hypervisor responsible for divvying it up as required by the needs of the VMs running on each server instance. I would guess it would mostly likely be seen in a blade chassis, where one of the blades is a bunch of DRAM instead of a server. At most it would be intra-rack, I doubt you'd ever see it used inter-rack in very limited circumstances (i.e. maybe a special HPC build where particularly massive amounts of DRAM per CPU core are required for one step in a calculation)

DrMrLordX · Jan 21, 2022

@Doug S

You can also have multiple enterprise dGPUs assigned to the same compute task and maintain memory coherence between accelerators without (necessarily) hitting main system RAM. Which is what NVLink + CUDA can do already.

Doug S · Jan 22, 2022

DrMrLordX said:
@Doug S

You can also have multiple enterprise dGPUs assigned to the same compute task and maintain memory coherence between accelerators without (necessarily) hitting main system RAM. Which is what NVLink + CUDA can do already.

OK thanks. I've never been involved at all with GPGPU so I know zero about the state of art in that field.

DrMrLordX · Jan 22, 2022

Doug S said:
OK thanks. I've never been involved at all with GPGPU so I know zero about the state of art in that field.

I'm a relative neophyte. I only know a little about it thanks to HSA and the SVM model. NV kind of took the concept a step further by creating a system-wide memory pool: unified virtual memory, or UVM. I think I pasted this link in another thread recently, but here's one of the newbie documents I've read on the subject:

Unified Memory for CUDA Beginners | NVIDIA Technical Blog

This post introduces CUDA programming with Unified Memory, a single memory address space that is accessible from any GPU or CPU in a system.

developer.nvidia.com

Anyway AMD and Intel want to be able to do the same thing with their own open standards, which is why (currently) everyone's lining up behind CXL. I think you could do this already with AMD accelerators via ROCm but don't quote me on that. Not sure if AMD ever really had the hardware to pull it off except maybe for that narrow period of time when EPYC shipped with CCIX-capable PCIe slots.

SteinFG · Jan 23, 2022

DisEnchantment said:
View attachment 55740

Very interesting pic there. I hope Dylan is right on the packaging tech. Either that or I just saved myself from another subscription. (But if he is right I will sub him and pay)
Because, I cannot see any hint of any fancy packaging tech in use there, granted the grey structure obscured everything else. I cannot even see the LGA pattern.

My current theory is that amd is using fan out package just on the IO die in order to decrease the cost of its manufacturing.
Looking at die shots of a 12nm server io die, most of it is taken up by connectors, about 1/3 is logic, and a little bit of dead space. Thу dead space is probably a giveaway that the IO die is at the limit.
Moving the IO to 7nm decreases the bump pitch from 150 to 130 micron, which gives about 33% increase in IO density. So, connectors can be 33% smaller. And we actually see this when looking at die shots of Raven Ridge(12nm) vs Renoir(7nm).
But this is not enough: While IO area decreases by 33%, logic is decreasing by over 55%. This will introduce even more dead space.
Moving to fan out will decrese the bump pitch to 40 micron (I'm using TSMC info), which will shrink the connectors by up to 93%.
Optimistically, the IO die will shrink by 70-80%, and what we see at the center of this xray is a rectangular fan out package with a small die at the center of it. But because of bad quality it's impossible to see the die itself.

edit: I haven't thought about how those RDL wires will carry the signal through the narrow fan out plane, but assuming they are 2/2 micron thick, it's solvable probably

DisEnchantment · Jan 23, 2022

SteinFG said:
My current theory is that amd is using fan out package just on the IO die in order to decrease the cost of its manufacturing.
Looking at die shots of a 12nm server io die, most of it is taken up by connectors, about 1/3 is logic, and a little bit of dead space. Thу dead space is probably a giveaway that the IO die is at the limit.
Moving the IO to 7nm decreases the bump pitch from 150 to 130 micron, which gives about 33% increase in IO density. So, connectors can be 33% smaller. And we actually see this when looking at die shots of Raven Ridge(12nm) vs Renoir(7nm).
But this is not enough: While IO area decreases by 33%, logic is decreasing by over 55%. This will introduce even more dead space.
Moving to fan out will decrese the bump pitch to 40 micron (I'm using TSMC info), which will shrink the connectors by up to 93%.
Optimistically, the IO die will shrink by 70-80%, and what we see at the center of this xray is a rectangular fan out package with a small die at the center of it. But because of bad quality it's impossible to see the die itself.

edit: I haven't thought about how those RDL wires will carry the signal through the narrow fan out plane, but assuming they are 2/2 micron thick, it's solvable probably

It is a reasonable proposition.
If the IOD is too small, it is reasonable to assume they would need to put the IOD in a fanout package and the RDL within the fanout package will redistribute the contacts over a larger area in order to match the maximum bump density that can be implemented on the substrate.
Currently AMD is already using the highest bump density that their supplier can do in the substrate (as per their Zen2 official briefings)
But bump density constraints could also be applicable to the smaller CCD as well which could be a problem if they have more VDD and more GMI lines. For a smaller package like AM4 might as well go fan out to include the CCDs as well.

SteinFG said:
what we see at the center of this xray is a rectangular fan out package with a small die at the center of it. But because of bad quality it's impossible to see the die itself.

Maybe, but the small rectangular texture could also be the SMD component area in the center that is devoid of the LGA pattern.

Hans de Vries · Jan 24, 2022

I have not seen anything about the L3 cache per Zen4 CCD sofar. (Not in the Genoa manuals) Is there anything officially know?

I'm becoming cautiously optimistic.

Zen3 on 7nm: 96 MB = 36mm2 + ~36mm2 = ~72mm2

With SRAM scaling from 7nm ---> 5nm given as 75%

Zen4 on 5nm: 64 MB = 3/4 * 2/3 * 72mm2 = 36mm2

So 64 MB for the L3 is not impossible for the ~ 72mm2 known die size.

Kepler_L2 · Jan 24, 2022

It's 32MB.

Hans de Vries · Jan 24, 2022

Kepler_L2 said:
It's 32MB.

Do you have a link?

DisEnchantment · Jan 24, 2022

Hans de Vries said:
So 64 MB for the L3 is not impossible for the ~ 72mm2 known die size.

Considering the fact that special use cases relying on huge L3 could always be handled by a V-Cache SKU, going 64 MB L3 would be a very poor choice of spending the bigger transistor budget, but I would assume AMD knows better.

Hans de Vries said:
Do you have a link?

Only the Chips&Cheese link it is there, but the actual value of L3 is known only at runtime since CPUID does not have a fixed value read out. So won't know until the chips in hand and read out by SW.

Details on the Gigabyte Leak

Recently, a ransomware group leaked data from Gigabyte in an attempt to extort payment. That’s been well covered by other outlets (please everyone, secure your networks), so here we’re …

chipsandcheese.com

eek2121 · Jan 24, 2022

Hans de Vries said:
I have not seen anything about the L3 cache per Zen4 CCD sofar. (Not in the Genoa manuals) Is there anything officially know?

I'm becoming cautiously optimistic.

Zen3 on 7nm: 96 MB = 36mm2 + ~36mm2 = ~72mm2

With SRAM scaling from 7nm ---> 5nm given as 75%

Zen4 on 5nm: 64 MB = 3/4 * 2/3 * 72mm2 = 36mm2

So 64 MB for the L3 is not impossible for the ~ 72mm2 known die size.

DisEnchantment said:
Considering the fact that special use cases relying on huge L3 could always be handled by a V-Cache SKU, going 64 MB L3 would be a very poor choice of spending the bigger transistor budget, but I would assume AMD knows better.

Only the Chips&Cheese link it is there, but the actual value of L3 is known only at runtime since CPUID does not have a fixed value read out. So won't know until the chips in hand and read out by SW.

Details on the Gigabyte Leak

Recently, a ransomware group leaked data from Gigabyte in an attempt to extort payment. That’s been well covered by other outlets (please everyone, secure your networks), so here we’re …

chipsandcheese.com

There have been a few leaks regarding this. AMD is not growing L3 for Zen 4. L2 cache is received some modifications.

EDIT: Also; a larger L3 would not help in many workloads anyway, if AMD is to be believed.

Mopetar · Jan 24, 2022

Hans de Vries said:
I have not seen anything about the L3 cache per Zen4 CCD sofar. (Not in the Genoa manuals) Is there anything officially know?

I'm becoming cautiously optimistic.

Zen3 on 7nm: 96 MB = 36mm2 + ~36mm2 = ~72mm2

With SRAM scaling from 7nm ---> 5nm given as 75%

Zen4 on 5nm: 64 MB = 3/4 * 2/3 * 72mm2 = 36mm2

So 64 MB for the L3 is not impossible for the ~ 72mm2 known die size.

I'm not sure if the SRAM scaling is even that good. Someone did an analysis of the Apple A-series SoCs when they transitioned from 7nm to 5nm and noticed that the cache didn't shrink at all. Maybe they just didn't have the designs fully finished yet so Apple just decided not to bother at all, but another possibility is that the scaling isn't that good for SRAM without using the density optimized process.

Doug S · Jan 24, 2022

Mopetar said:
I'm not sure if the SRAM scaling is even that good. Someone did an analysis of the Apple A-series SoCs when they transitioned from 7nm to 5nm and noticed that the cache didn't shrink at all. Maybe they just didn't have the designs fully finished yet so Apple just decided not to bother at all, but another possibility is that the scaling isn't that good for SRAM without using the density optimized process.

According to TSMC from N7 to N5 SRAM scales at 30% (20% from N5 to N3) Why Apple didn't manage even that is unknown, but they would have got less because they were coming from N7P to N5 not N7 to N5. It also may have added more ways, more ports, or done something that sacrificed area for lower power. Real cache is more than just an array of SRAM cells, after all.

Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Senior member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Golden Member

Platinum Member

Platinum Member

Senior member

Diamond Member

Senior member

Golden Member

Lifer

Platinum Member

Lifer

Platinum Member

Lifer

Senior member

Golden Member

Senior member

Senior member

Senior member

Golden Member

Diamond Member

Diamond Member

Platinum Member