Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 12 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Doug S

Platinum Member
Feb 8, 2020
2,430
3,932
136
Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.


Apple's SLC already is a memory side cache much more than it is an L3. Any cache that's on the same silicon as your memory controllers can act as such, though AFAIK have never been designed that way by Intel or AMD.
 
Reactions: Lodix

jamescox

Senior member
Nov 11, 2009
640
1,104
136
Stacking may be it, while 3D stacked, using hybrid bond, is the only way to go.

But there are 2 possible approaches in the future, that would be 2.5D - horizontal. AMD mentioned EFB with hybrid bond, and TSMC mentioned SoIC_H, which would be interposer with chiplet attached via hybrid bond.

Once either or both of these technologies make it to production, then L3 on die or stacked could migrate to a separate SRAM stack or stacked either on top of memory controller or I/O die. And then, SRAM can work as system level cache, shared between different CPU chiplets or also graphics.
Those are not the only way to go and that doesn't seem to be what AMD is doing. Large interposers or "base die" would be very expensive. The SoIC (v-cache) style stacking is cheaper but still an added cost and and added opportunity for something to go wrong affecting yield and therefore cost. Hopefully something going wrong with v-cache stacking still leaves you with a usable cpu die.

For RDNA3, they are using MCD made on a cache optimized process with 16 MB infinity cache and dual channel GDDR6. The IO, cache, and logic are scaling differently, so making all of them on a different process is probably the most economical. The EFB connections are likely more expensive than the "infinity link fan out" that they came up with for connecting GCD and MCD. That tech can do something like 900 GB/s per MCD, so faster than most memory, except an HBM stack. This isn't really even 2.5D stacking. I don't think it is silicon based, so no embedded silicon die like EFB and no need to "elevate" everything else on the package.

I don't see why they wouldn't make use of MCD in more products if the infinity link fan out is yeilding well. They just need to make a version capable of driving DDR5 channels instead of GDDR6. Some rumors have been saying that Zen 5 will use a large shared L2. I was expecting something like that for a while, but I was thinking more along the lines of 2 cores sharing an L2 allowing easy scaling to 16 core CCDs. Some rumors are saying it may be all 8 cores sharing an L2. I don't know if that leaves a hole in the cache hierarchy if it goes from L2 -> infinity cache on an MCD. The infinity link fan out should be lower latency than GMI in addition to high bandwidth. They may make an IO die that is just pci-express and infinity link fan out connections for compute die, MCD die, and possibly other IO die if it isn't a monolithic IO die. I don't know how they would connect HBM in to a set-up like this. I also don't know if the infinity link fan out is cheap enough to use everywhere. A monolithic APU is generally the cheapest way to go for low end systems, but then you have to make the whole die on a very expensive process. It may be cheaper to crank out huge numbers of MCD on an older, cheaper process for everything.

Also, the MCD die seems very similar in size to the 64 MB v-cache die. If they designed them to be exactly the same size, then this could allow v-cache stacking on top of the MCD to be much cheaper than v-cache on top of a cpu die. The cpu die stacking requires a reconstituted wafer with the cache die and a bunch of filler silicon. If the die are the same size, they could just do wafer on wafer and then dice. This turns the 16 MB MCD into a 80 MB MCD; 480 MB in 6 MCD.
 

jamescox

Senior member
Nov 11, 2009
640
1,104
136
Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.

This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.

Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.
I may have already said something about this. Stacking the cpu die on top has issues since you would need to possibly pull a large amount of power through tiny TSVs. The number of TSVs may take ridiculous amount of die area in the base die making it unworkable. It seems more likely that MCD would get used more places to supplement the on-die caches. Most Zen 5 rumors say larger and shared L2 cache, so the L3 cache may go away if it can be made up for by infinity cache in MCDs. They could also shrink it down and/or stack it with v-cache. Lower end products may still be viable with no on die /stacked L3 due to infinity cache. If they use the infinity link fan out (connection used for MCD <-> GCD in RDNA3) between cpu chiplets and IO die, then that would likely reduce power and latency significantly.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,287
2,887
106
For RDNA3, they are using MCD made on a cache optimized process with 16 MB infinity cache and dual channel GDDR6. The IO, cache, and logic are scaling differently, so making all of them on a different process is probably the most economical. The EFB connections are likely more expensive than the "infinity link fan out" that they came up with for connecting GCD and MCD. That tech can do something like 900 GB/s per MCD, so faster than most memory, except an HBM stack. This isn't really even 2.5D stacking. I don't think it is silicon based, so no embedded silicon die like EFB and no need to "elevate" everything else on the package.

While the fan out is likely much cheaper, it uses micro bump connections and generates some power overhead. For RDNA 3, I recall seeing 9 Watt ballpark. Which can be fine in a 355 Watt card, but not so good for where high degree of efficiency would be needed (server, mobile)

Also, let's say for Zen 5 (since we are on Zen5 thread), the solution would have to work for I/O + 8-12 CCDs + whatever else will be in the package, and now you are at the limit (or beyond the limit of the RDL packaging. I think it is using a wafer on which the whole package is assembled.

So I have some doubt this technology will be used for Zen 5

I don't see why they wouldn't make use of MCD in more products if the infinity link fan out is yeilding well. They just need to make a version capable of driving DDR5 channels instead of GDDR6.

In RDNA3, MCDs are variable in numbers, and are single function that is on N6

In EPYC I/O, while there may be differences between Sienna and Genoa sockets, and count of memory channels, the node of IO Die is the same. It would probably be an overkill to make MCD for each memory channel for EPYC I/O die. In Genoa, there would be only additional cost in packaging and additional power overhead (vs single monolithic die). No savings of any kind.

In Sienna socket, removing 2-4 memory channels would not be much of the savings. Probably none. There is a die area overhead to make the external connections + power overhead + assembly overhead.

Some rumors have been saying that Zen 5 will use a large shared L2. I was expecting something like that for a while, but I was thinking more along the lines of 2 cores sharing an L2 allowing easy scaling to 16 core CCDs. Some rumors are saying it may be all 8 cores sharing an L2. I don't know if that leaves a hole in the cache hierarchy if it goes from L2 -> infinity cache on an MCD. The infinity link fan out should be lower latency than GMI in addition to high bandwidth.

I think it is just a parallel link as opposed to serial link that is used through substrate. So it has all of the advantage of parallel link in latency and bandwidth, given enough connections.

The major drawback is power overhead.

I think the Zen 4 V-Cache will set the bar very high on bandwidth and latency. Anything that is NOT a hybrid bond stacked connection will be 2 steps back and 1 step forward.

Which is why I think we will need to see two of these technologies I mentioned, that achieve 2.5D horizontal connections while using hybrid bond.
 

moinmoin

Diamond Member
Jun 1, 2017
4,993
7,763
136
While the fan out is likely much cheaper, it uses micro bump connections and generates some power overhead. For RDNA 3, I recall seeing 9 Watt ballpark. Which can be fine in a 355 Watt card, but not so good for where high degree of efficiency would be needed (server, mobile)
The big question is whether and with what that power overhead scales. Does it depend on the bandwidth? That'd be the ideal case for usage in CPUs as well since GPUs use much more bandwidth, so if MCDs' power overhead is due to the high bandwidth that must be achieved on GPUs that could be significantly scaled back. Obviously that doesn't work if the power overhead is there at idle in which case the technology is pretty specifically suited to GPUs. Another negative is the reported increase in latency that MCDs introduce. With GPUs and the higher importance for bandwidth this doesn't matter as much as for CPUs where latency is actually far more important than bandwidth. So with fan out as realized with MCDs we may be looking at a technology AMD won't bring to CPU in that form.
 
Reactions: Tlh97 and Vattila

jamescox

Senior member
Nov 11, 2009
640
1,104
136
While the fan out is likely much cheaper, it uses micro bump connections and generates some power overhead. For RDNA 3, I recall seeing 9 Watt ballpark. Which can be fine in a 355 Watt card, but not so good for where high degree of efficiency would be needed (server, mobile)

Also, let's say for Zen 5 (since we are on Zen5 thread), the solution would have to work for I/O + 8-12 CCDs + whatever else will be in the package, and now you are at the limit (or beyond the limit of the RDL packaging. I think it is using a wafer on which the whole package is assembled.

So I have some doubt this technology will be used for Zen 5



In RDNA3, MCDs are variable in numbers, and are single function that is on N6

In EPYC I/O, while there may be differences between Sienna and Genoa sockets, and count of memory channels, the node of IO Die is the same. It would probably be an overkill to make MCD for each memory channel for EPYC I/O die. In Genoa, there would be only additional cost in packaging and additional power overhead (vs single monolithic die). No savings of any kind.

In Sienna socket, removing 2-4 memory channels would not be much of the savings. Probably none. There is a die area overhead to make the external connections + power overhead + assembly overhead.



I think it is just a parallel link as opposed to serial link that is used through substrate. So it has all of the advantage of parallel link in latency and bandwidth, given enough connections.

The major drawback is power overhead.

I think the Zen 4 V-Cache will set the bar very high on bandwidth and latency. Anything that is NOT a hybrid bond stacked connection will be 2 steps back and 1 step forward.

Which is why I think we will need to see two of these technologies I mentioned, that achieve 2.5D horizontal connections while using hybrid bond.
Where are you getting this 9 watt number from? Is the 9 watts supposed to be the overhead for the infinity link fan out vs. monolithic? I find that very hard to believe. Is it supposed to be the power consumption of a single MCD (cache plus memory controller)? I could possibly see that being the case since driving IO takes a bit of power. GDDR6 is not very low power; the driving forces behind HBM was die area for the high speed interfaces getting too large, bandwidth needs, and power consumption of GDDR being too high. HBM operates at much lower clock and is much lower power consumption than high speed GDDR. So I can believe that the MCDs pull quite a bit of power, but that wouldn’t be the overhead for infinity link fan out vs. monolithic. The infinity link fan out is likely very power efficient due to lower clocked, more parallel connection similar to HBM interfaces. Given the area of the interface, it can’t burn that much power just from a thermal perspective. That is more than a cpu core would be consuming in many cases.

I don’t think MCDs for Epyc is overkill. Zen 5 is likely a massive boost in FP performance in some manner. It is unclear how that will be accomplished, but a massive boost in FP performance requires a massive boost in bandwidth. We already have 12 DDR5 interfaces on Genoa (460 GB/s; essentially a 768-bit interface). RDNA3 seems to be 6 MCDs with 2 GDDR6 interfaces which is 384-bits wide, but higher bandwidth.

If they have 128 or even 256 cpu cores with significantly increased floating point performance, they will likely need something like infinity cache and more memory channels to supply the necessary bandwidth. It is already a problem to require 12 memory modules just to fully populate the system. That is a lot of board area and it may not be enough going forward. How much bandwidth will 128 Zen 5 cores need compared to 96 Zen 4? That many memory controllers on a single, monolithic, IO die may be getting unreasonably large in addition to all of the other issues with that many channels. They can’t keep on scaling the number of memory interfaces. To even maintain the same bandwidth relationship as Genoa with DDR5 and scaling to 256 cores would require 32 memory channels. Using MCD to increase the effective bandwidth seems to make sense. AMD gets really good performance with only a 384-bit interface on RDNA3. It will be interesting to see how they tie HBM into these systems also.
 

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
Each interconnect's consumption directly scales with bandwidth, so this needs to be taken into account. The 9w figure at 1TBytes/s bandwidth would mean 1,12pJ/b. So this number sounds wrong as INFO-R is supposed to be around 0,3pJ/b. Where did you get that from?
Nevertheless: One IFoP connection in narrow mode at full load is 64 + 32 GByte/s with supposedly 1,5pJ/b, so around 1,2w. This could be reduced to around 0,3w with INFO-R or used for more bandwidth (which would be nice as it is a bottleneck for DDR5 ATM).

Please be aware that I have no idea if the pJ/b numbers above only apply to the wiring or to the PHYs as well.
 
Reactions: Vattila

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
If they have 128 or even 256 cpu cores with significantly increased floating point performance, they will likely need something like infinity cache and more memory channels to supply the necessary bandwidth. It is already a problem to require 12 memory modules just to fully populate the system. That is a lot of board area and it may not be enough going forward. How much bandwidth will 128 Zen 5 cores need compared to 96 Zen 4? That many memory controllers on a single, monolithic, IO die may be getting unreasonably large in addition to all of the other issues with that many channels. They can’t keep on scaling the number of memory interfaces.
MCR/MR-DIMM and generational DDR5 scaling could buy them a good ~2x bandwidth for the same number of channels. That should be enough to carry them for another gen or two at least. That's just kicking the can down the road, however. Eventually it might make sense to split the IO die in half. Then they can use that single die for a similar split to what we're seeing with Genoa/Siena. It would also give them several different options to stitch the two back together.
 
Reactions: Vattila

BorisTheBlade82

Senior member
May 1, 2020
667
1,022
136
[...] Eventually it might make sense to split the IO die in half. Then they can use that single die for a similar split to what we're seeing with Genoa/Siena. It would also give them several different options to stitch the two back together.
Hey, I already trademarked the Split-IOD in this very thread 😉
 
Reactions: Kaluan

Kaluan

Senior member
Jan 4, 2022
503
1,074
106
MLID's seems to think Granite Ridge is more than 2 (compute) chiplets. Main chiplet is N4X, tuned for performance (6GHz, yadda yadda) and the other(s) on something like N3E 2-1 Fin, focusing on small size/density and efficiency. Unclear if Zen5+Zen5C or all Zen5.

At first I thought meh, it's MLID, he's talking out of his nr 2 box, but the again... it's NOT not making sense.

Anyway, interesting idea, nothing too crazy, whatever merits it may have.
 
Last edited:
Reactions: BorisTheBlade82

MadRat

Lifer
Oct 14, 1999
11,922
259
126
It makes sense to do a checkerboard grid on one end, to spread heat out from each core. Cram your cache to the middle, too, so that you can split it like Exist50 said, and retain separate caches. If you could subdivide each half a second time, so its in quadrants, even better. The factory edge of the original would be reduced to 2/3rds on a half, and 1/2 remaining on a quadrant. Would not impact interposer connections, but would greatly reduce available interconnects I can imagine.
 

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
MLID's seems to think Granite Ridge is more than 2 chiplets. Main chiplet is N4X, tuned for performance (6GHz, yadda yadda) and the other(s) on something like N3E 2-1 Fin, focusing on small size/density and efficiency. Unclear if Zen5+Zen5C or all Zen5.

At first I thought meh, it's MLID, he's talking out of his nr 2 box, but the again... it's NOT not making sense.

Anyway, interesting idea, nothing too crazy, whatever merits it may have.
Sounds like his usual BS to me. Why would they use N4X, a process that would be useless for mobile and server, and quite likely outperformed by N3E anyway? It's really not worth giving his ravings the time of day.
 
Reactions: uzzi38

Kaluan

Senior member
Jan 4, 2022
503
1,074
106
Sounds like his usual BS to me. Why would they use N4X, a process that would be useless for mobile and server, and quite likely outperformed by N3E anyway? It's really not worth giving his ravings the time of day.
Well for starters.
N4X is more performant than N3E.
N4P could be a option as well.

It'll also likely be a good deal cheaper. Enough to offset the increase in die size.

Both start mass production around the end of the year.

And I highly doubt the process is "useless" for server lol
There's a very clear niche of servers that don't require hundreds of cores per board but high per-core performance.
And we know AMD is forking Zen core designs already with Zen4, same uArch but different caches, different nodes.
Wouldn't exactly be new ground for AMD.
 
Last edited:

Exist50

Platinum Member
Aug 18, 2016
2,452
3,101
136
N4X is more performant than N3E.
TSMC claims that N4X gives ~15% faster clocks than N5. Compare that to their claim that N3E is 15-20% higher performance, iso-power, than the same. Additionally, they claim N4X is only 4% faster than N4P at 1.2V (i.e. high voltage / best case). So what's leading you to conclude that N4X is more performant?

N4X's biggest selling point is a higher max voltage, but that's useless when server chips typically run in the 0.6V-1.0V range. And for that, you no doubt pay a severe leakage penalty. Iso-power, I wouldn't be surprised if N4X is worse than N4P across most of the V-F curve.
 

Kaluan

Senior member
Jan 4, 2022
503
1,074
106
TSMC claims that N4X gives ~15% faster clocks than N5. Compare that to their claim that N3E is 15-20% higher performance, iso-power, than the same.
Not the density optimized N3E variant tho from what I can tell. Hmm

But I suppose the node specifics don't matter right now, was primarily focusing on the concept of a Zen5 "master" chiplet at 70-80mm2 (tuned for high performance/lightly threaded) and 1 or 2 "slave" chiplets at 50-60mm2 that foucus on multithreaded throughput and efficiency.
 

Tigerick

Senior member
Apr 1, 2022
686
574
106

Timorous

Golden Member
Oct 27, 2008
1,723
3,124
136
So with the Dual CCD Zen 4 X3D parts being wonky designs what are the chances AMD will do 8c Zen 5 with v-cache and 16c Zen 5c without v-cache as their top tier 8950X(3D?) part to help get back the undisputed multithreaded crown.
 

Anhiel

Member
May 12, 2022
68
26
51
I think with the recent low sales problem it's reasonable to think AMD will be more cautious in producing costly chiplets for consumers. So I think it's likely the 8 core chiplet configuration will be kept. I doubt there will be 16 core chiplet for consumers.
N4 shrinking and Zen5(c) would make it possible to squeeze 3 chiplets into the ~same space(+30%) of previously 2 if the long side are rotated 90 degrees...
The key difference will be the 2nd gen XDNA. I think it's more likely to be its own chiplet. So it's more likely we get 2 Zen5 chiplets and one XDNA chiplet. Given what we now know about MI300 its performance could be estimated at 1/3.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |