Discussion Speculation: Zen 4 (EPYC 4 "Genoa", Ryzen 7000, etc.)

Page 176 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Except for the details about the improvements in the microarchitecture, we now know pretty well what to expect with Zen 3.

The leaked presentation by AMD Senior Manager Martin Hilgeman shows that EPYC 3 "Milan" will, as promised and expected, reuse the current platform (SP3), and the system architecture and packaging looks to be the same, with the same 9-die chiplet design and the same maximum core and thread-count (no SMT-4, contrary to rumour). The biggest change revealed so far is the enlargement of the compute complex from 4 cores to 8 cores, all sharing a larger L3 cache ("32+ MB", likely to double to 64 MB, I think).

Hilgeman's slides did also show that EPYC 4 "Genoa" is in the definition phase (or was at the time of the presentation in September, at least), and will come with a new platform (SP5), with new memory support (likely DDR5).



What else do you think we will see with Zen 4? PCI-Express 5 support? Increased core-count? 4-way SMT? New packaging (interposer, 2.5D, 3D)? Integrated memory on package (HBM)?

Vote in the poll and share your thoughts!
 
Last edited:
Reactions: richardllewis_01

jamescox

Senior member
Nov 11, 2009
640
1,104
136
FWIW, Charlie from SemiAccurate had some musings on Bergamo about a month ago (not even sure how reputable this info is), but I'll summarize it as follows:
- Bergamo takes same IOD as Genoa, but puts 8 Bergamo CCDs instead of Genoa's 12.
- Each Bergamo CCD has (16) Zen 4c cores but the same 32 MB of L3 as Genoa's CCD.
- Zen 4c CCDs splits up the (16) Zen 4c cores into two CCXs, and each CCX shares half of the total L3 (i.e. 2 MB of L3 per Zen 4c).
- Given that there's two CCXs on each Bergamo CCD, it is likely that there is a latency penalty when a core on one CCX needs to access another core's data on a different CCX, even if that CCX is on the same CCD.
- AMD likely has figured out how to connect (12) memory channels to (8) CCDs given that Milan already handles this situation fine.
- Twice the socket performance of top Milan for all key foundational workloads.
- Bergamo can run in non-SMT mode, which helps with per-thread performance. On a thread vs thread basis, 128C/128T Bergamo is about 60% more performant than 64C/128T Milan.

Edit: If anyone has an issue with me summarizing this info because it technically was behind a paywall, let me know and I can delete it from this thread.

If that is what Bergamo is, then it isn’t really that interesting. It would still be a good product and might make sense given that a lot of server purchasers may be be reluctant to use a stacked device of unknown reliability. They may be able to stay ahead of intel; I don’t know when Intel’s stacked solutions are supposed to be available?

I was expecting some use of silicon bridge technology to save power. If it is the same IO die, then it would work the same way as current Milan processors. Each quadrant of the IO die has 2 memory controllers in Milan, supposedly 3 in Genoa. They are relatively independent of the cpu chiplets, so it doesn’t matter how many chiplets are present. In NPS1 mode, memory is interleaved across all 8 channels in Milan and Rome. Rome had some weird models with only 2 chiplets, so some quadrants did not have any attached CPUs. Milan is only 4 or 8 cpu die.

This design would have the same characteristics as Rome, where communications between CCX on the same die still have to go through the IO die. That must be done for cache coherency. It would be much more complicated to allow the two CCX on a single die to talk to each other. It would still need to send coherency information to the IO die anyway, so it isn’t worth it.

With RDNA3 rumored to use some kind of cache bridge chip (512 MB MCD), it seems like that would be used other places. That is, if that rumor is actually correct. I don’t know how solid that is. Could they connect 4 cpu die to the IO die rather than two GPUs using a cache bridge chip? It seems like the sizes could be right. The 16 MB of cache per CCX seems like a step back unless it is more efficient in other ways. The lower power process may have higher cache density allowing for the smaller size. The rumor with the added v-cache may make some sense in that context, but how would it be configured with 2 CCX? Shared L4 cache? Some server applications are not very cacheable, so perhaps the smaller L3 isn’t an issue.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
If that is what Bergamo is, then it isn’t really that interesting. It would still be a good product and might make sense given that a lot of server purchasers may be be reluctant to use a stacked device of unknown reliability. They may be able to stay ahead of intel; I don’t know when Intel’s stacked solutions are supposed to be available?

[redacted] Some server applications are not very cacheable, so perhaps the smaller L3 isn’t an issue.

Bergamo is fully targeted towards cloud providors. It's going to result in three different server lines along with epyc genoa for general usage with milan-x and it's replacement for hpc.
Bergamo losing performance in some applications due to it's differing cache structure is okay for it's target market, which is also why I'm one of those who doubt the bergamo ccds will go on ryzen desktop.
A bizarre thought is if you add in the additionally cache crippled mobile zen cores then there are four performance targets for zen cpus dictated primarily by amounts of cache.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
Bergamo is fully targeted towards cloud providors. It's going to result in three different server lines along with epyc genoa for general usage with milan-x and it's replacement for hpc.
Bergamo losing performance in some applications due to it's differing cache structure is okay for it's target market, which is also why I'm one of those who doubt the bergamo ccds will go on ryzen desktop.
A bizarre thought is if you add in the additionally cache crippled mobile zen cores then there are four performance targets for zen cpus dictated primarily by amounts of cache.
I'm with you on the Zen 4c for desktop thing. Trading off single core speed for additional throughput and efficiency just doesn't make sense for a mainstream desktop part. Maybe for some of the embedded applications though.
 

jamescox

Senior member
Nov 11, 2009
640
1,104
136
I'm with you on the Zen 4c for desktop thing. Trading off single core speed for additional throughput and efficiency just doesn't make sense for a mainstream desktop part. Maybe for some of the embedded applications though.
If it is just a serdes connected chiplet, then it would be very easy to mix them. That is kind of how the 5900X is now. If there are no barriers to mixing them, you could have 8 high power cores and another 16 lower power cores that perhaps stay at base clock. Having that extra throughput could help a lot for massively multi-threaded applications, although I don’t know how much of a market there would be for that if we get new threadrippers. About the only advantage vs. a Threadripper type part would be power consumption, which isn’t too important on the desktop. It could possibly make a good high end mobile part.

I am hoping we just get a new middle socket between AM5 and SP5 for Threadripper and workstation products. That would allow some higher core count processors without going full Epyc SP5. There would be some market for a lot of these combinations though, they might just not be big enough for AMD to target.

I am still wondering if they will make use of stacking somehow for Bergamo, even if they have the 2 CCX structure. With that structure, it sounds like they just pulled a CCX out of a mobile APU and put two of them on one die rather than CCX + GPU, so it may be a rathe minor change from a design perspective. It appears that AMD will use silicon bridges in GPUs and Apple is apparently using some form of silicon bridge for the M1 Ultra. A lot of server applications are not really cacheable anyway, so the mobile cpu sized caches might not be an issue. AMD has been good at keeping secrets lately, so who knows.
 

MadRat

Lifer
Oct 14, 1999
11,922
259
126
I wouldn't so much want slower chips scabbed on. That's really a bit pointless.

I'd want a super duper single core for gaming, with an 8 core on the second for other stuff. If ypu had a whole chiplet for one super duper core, how bad ass could you make it?
 

Thunder 57

Platinum Member
Aug 19, 2007
2,794
4,075
136
I wouldn't so much want slower chips scabbed on. That's really a bit pointless.

I'd want a super duper single core for gaming, with an 8 core on the second for other stuff. If ypu had a whole chiplet for one super duper core, how bad ass could you make it?

That sounds an awful lot like a PS3, which was regarded as rather difficult to write code for.
 

soresu

Platinum Member
Dec 19, 2014
2,935
2,160
136
That sounds an awful lot like a PS3, which was regarded as rather difficult to write code for.
Weren't the SPU's more of a general SIMD compute architecture?

I remember someone saying that they could supplement the GPU, perhaps even with greater programmability, only that the API to interact with them was not as developed as say DX compute or CUDA.
 
Reactions: Saylick

Saylick

Diamond Member
Sep 10, 2012
3,372
7,107
136
Weren't the SPU's more of a general SIMD compute architecture?

I remember someone saying that they could supplement the GPU, perhaps even with greater programmability, only that the API to interact with them was not as developed as say DX compute or CUDA.
Yeah, the SPEs were basically glorified SIMD units in a nutshell but with the added annoyance of NUMA-like considerations because it used a ring bus, e.g. to really optimize for the SPEs, you had to account for the physical location of the SPE because the ones further away from the memory controller took a little longer to get the data. The only general purpose core available was just the PPE. Long story short, Cell was much different than a hybrid architecture.
 

eek2121

Diamond Member
Aug 2, 2005
3,042
4,259
136
I'm with you on the Zen 4c for desktop thing. Trading off single core speed for additional throughput and efficiency just doesn't make sense for a mainstream desktop part. Maybe for some of the embedded applications though.

Going to disagree with this. Slapping a bunch of cores in a desktop part will only help performance, not hurt it. Not everyone uses their PC for gaming and Microsoft office. Video encoding, code compilation, 3D rendering, etc. can all take advantage of the additional cores. Shoot, even extracting a zip file or encoding music can be improved.

Also Microsoft can potentially try new scheduler strategies like: having a single core dedicated to UI, running certain background jobs only on low power cores, running certain non-demanding apps on low power cores, etc.

I would love to see AMD take such an approach.
 
Jul 27, 2020
17,755
11,542
106
Also Microsoft can potentially try new scheduler strategies like: having a single core dedicated to UI, running certain background jobs only on low power cores, running certain non-demanding apps on low power cores, etc.

I would love to see AMD take such an approach.
This is what should have been done already before the Alder Lake launch. Funny if AMD ends up fixing Intel's E-core scheduling mess.
 

Abwx

Lifer
Apr 2, 2011
11,162
3,858
136
Well, 120W is such a nice round number.


170W, that could be 16 Core 5hgz All Core Turbo top model.

16C would use less power even at 5GHz, a direct shrink of a 5950X with a 6nm based IO would consume about 55W@4GHz.

We can expect a Zen 4 based 16C to be within 65W at same frequency, it would require 130W for 5GHz all cores and 1.45x the MT perf of previous gen.

They can also pack 32C@4.3GHz@170W for 2.5x the MT throughput of a 5950X.
 

deasd

Senior member
Dec 31, 2013
551
864
136
Well, 120W is such a nice round number.

170W, that could be 16 Core 5hgz All Core Turbo top model.

Sounds like AMD is letting us know the 170w SKU is our special top bin model which is forced to aimed at 13900k/s though.....
But since this is from the account which claimed Zen4 supporting DDR4(or just controller) I'll take a grain of salt that everything he said.....
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
Why is Zen 4 being referred to with the same family model as Zen 3???

EDIT: Maybe these patches are targeting Zen3+ rather than Zen 4?
Family 19h 00h-0Fh = Genesis(GN) (Milan) variants <-- Zen3
Family 19h 10h-1Fh = Stones(RS) (Genoa) variants <-- Zen4
Family 19h 30h-3Fh = Badami(BA) (Trento) variants <-- Zen3
Family 19h A0h-AFh = StonesDN(RSDN) (Bergamo) variants <-- Zen4c
 
Last edited:
May 17, 2020
123
233
116
16C would use less power even at 5GHz, a direct shrink of a 5950X with a 6nm based IO would consume about 55W@4GHz.

We can expect a Zen 4 based 16C to be within 65W at same frequency, it would require 130W for 5GHz all cores and 1.45x the MT perf of previous gen.

They can also pack 32C@4.3GHz@170W for 2.5x the MT throughput of a 5950X.
Raphael is an APU so you need to add the TDP of graphic part in it
 

ryanjagtap

Member
Sep 25, 2021
110
132
96
You know after looking at this slide, I think AMD ought to decouple core+L2$ from the L3$ in their CCX. Then they can scale the logic, SRAM and Analog in a structure somewhat like this.
Logic can go to leading process nodes as it is scaling better (N5->N3 and so forth)
SRAM can maybe stop at N5 and then just like they optimized a denser process on n7 they can create a denser SRAM optimized process on N5.
Analog, I think can stay on N6 for a long time if area scaling according to the chart is accurate. The only caveat is if there is less power draw on other nodes to change it.
Then They can use 2.5D or 3D packaging, whichever is feasible and cost effective to make new CPUs.
If they use 2.5D, EFB that they used in MI200 looks optimal or they can use CoWoS.
For 3D, they can use The IOD as base as well as LSI, for the core+L2$ die and L3$ die

This is just a thought, be free to tell me if its not possible or just a ridiculous thought. And this is for CPUs after zen 4 just putting it here as there are no other speculation threads.
 

Attachments

  • AMD-EPYC-7003X-Milan-X-Large-Cache-Opportunity.jpg
    243.8 KB · Views: 29
  • AMD-3-1024x497.jpg
    67.1 KB · Views: 28

itsmydamnation

Platinum Member
Feb 6, 2011
2,860
3,407
136
Yes your are wrong , how do you propose to interconnection your cores and handle cache coherency. If you stack cache what are you stacking it on top off?

It seems to me people weight way to high the cost of the silicon itself , thus come up with crazy ideas. I think even people like Ian get it wrong when considering price vs cost for zen3 vs zen3d. There is alot more to operating costs then just the die itself.
 
Last edited:

ryanjagtap

Member
Sep 25, 2021
110
132
96
Yes your are wrong , how do you propose to interconnection your cores and handle cache coherency. If you stack cache what are you stacking it on top off?

It seems to me people weight way to high the cost of the silicon itself , thus come up with crazy ideas. I think even people like Ian get it wrong when considering price vs cost for zen3 vs zen3d. There is alot more to open costs then just the die itself.
Okay.
I just thought using EFB as traces to interconnect the dies would work.
I was not trying to weigh in at the cost of die, more to the practical aspects of using nodes which don't show much area scaling as well as power reduction, but if it's not feasible then its not feasible.
Thanks for listening to my crazy ideas and pointing out the mistakes, so that I can nip it in the bud.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |