- Mar 3, 2017
- 1,659
- 6,101
- 136
Yeah, reading this 'news' it looks like on die L3$ will wind up going the way of the dodo in a few node cycles. So stacked L3$ will rule the day then. I wonder if memory side caches will come into to alleviate some of the latency issues in the whole memory system (or L4$, either on the IOD). I don't know if memory side caches need to be snoopable, seems like it wouldn't be worth it if that was the case.
Those are not the only way to go and that doesn't seem to be what AMD is doing. Large interposers or "base die" would be very expensive. The SoIC (v-cache) style stacking is cheaper but still an added cost and and added opportunity for something to go wrong affecting yield and therefore cost. Hopefully something going wrong with v-cache stacking still leaves you with a usable cpu die.Stacking may be it, while 3D stacked, using hybrid bond, is the only way to go.
But there are 2 possible approaches in the future, that would be 2.5D - horizontal. AMD mentioned EFB with hybrid bond, and TSMC mentioned SoIC_H, which would be interposer with chiplet attached via hybrid bond.
Once either or both of these technologies make it to production, then L3 on die or stacked could migrate to a separate SRAM stack or stacked either on top of memory controller or I/O die. And then, SRAM can work as system level cache, shared between different CPU chiplets or also graphics.
I may have already said something about this. Stacking the cpu die on top has issues since you would need to possibly pull a large amount of power through tiny TSVs. The number of TSVs may take ridiculous amount of die area in the base die making it unworkable. It seems more likely that MCD would get used more places to supplement the on-die caches. Most Zen 5 rumors say larger and shared L2 cache, so the L3 cache may go away if it can be made up for by infinity cache in MCDs. They could also shrink it down and/or stack it with v-cache. Lower end products may still be viable with no on die /stacked L3 due to infinity cache. If they use the infinity link fan out (connection used for MCD <-> GCD in RDNA3) between cpu chiplets and IO die, then that would likely reduce power and latency significantly.Now that the io-die is made on N6, an errant thought that's bouncing around in the back of my head, why not put L3 on it and stack the CPU on it? For more v-cache, add the layers in between the ccd and iod.
This would eliminate the fairly power-hungry and latency-adding interface between the iod and the ccd, and allow making L3 with the more economical N6 node which is still quite competitive in SRAM density.
Of course, since it would mean every product is stacked they would need a lot more manufacturing throughput. Probably not viable for Zen5 yet.
For RDNA3, they are using MCD made on a cache optimized process with 16 MB infinity cache and dual channel GDDR6. The IO, cache, and logic are scaling differently, so making all of them on a different process is probably the most economical. The EFB connections are likely more expensive than the "infinity link fan out" that they came up with for connecting GCD and MCD. That tech can do something like 900 GB/s per MCD, so faster than most memory, except an HBM stack. This isn't really even 2.5D stacking. I don't think it is silicon based, so no embedded silicon die like EFB and no need to "elevate" everything else on the package.
I don't see why they wouldn't make use of MCD in more products if the infinity link fan out is yeilding well. They just need to make a version capable of driving DDR5 channels instead of GDDR6.
Some rumors have been saying that Zen 5 will use a large shared L2. I was expecting something like that for a while, but I was thinking more along the lines of 2 cores sharing an L2 allowing easy scaling to 16 core CCDs. Some rumors are saying it may be all 8 cores sharing an L2. I don't know if that leaves a hole in the cache hierarchy if it goes from L2 -> infinity cache on an MCD. The infinity link fan out should be lower latency than GMI in addition to high bandwidth.
The big question is whether and with what that power overhead scales. Does it depend on the bandwidth? That'd be the ideal case for usage in CPUs as well since GPUs use much more bandwidth, so if MCDs' power overhead is due to the high bandwidth that must be achieved on GPUs that could be significantly scaled back. Obviously that doesn't work if the power overhead is there at idle in which case the technology is pretty specifically suited to GPUs. Another negative is the reported increase in latency that MCDs introduce. With GPUs and the higher importance for bandwidth this doesn't matter as much as for CPUs where latency is actually far more important than bandwidth. So with fan out as realized with MCDs we may be looking at a technology AMD won't bring to CPU in that form.While the fan out is likely much cheaper, it uses micro bump connections and generates some power overhead. For RDNA 3, I recall seeing 9 Watt ballpark. Which can be fine in a 355 Watt card, but not so good for where high degree of efficiency would be needed (server, mobile)
Where are you getting this 9 watt number from? Is the 9 watts supposed to be the overhead for the infinity link fan out vs. monolithic? I find that very hard to believe. Is it supposed to be the power consumption of a single MCD (cache plus memory controller)? I could possibly see that being the case since driving IO takes a bit of power. GDDR6 is not very low power; the driving forces behind HBM was die area for the high speed interfaces getting too large, bandwidth needs, and power consumption of GDDR being too high. HBM operates at much lower clock and is much lower power consumption than high speed GDDR. So I can believe that the MCDs pull quite a bit of power, but that wouldn’t be the overhead for infinity link fan out vs. monolithic. The infinity link fan out is likely very power efficient due to lower clocked, more parallel connection similar to HBM interfaces. Given the area of the interface, it can’t burn that much power just from a thermal perspective. That is more than a cpu core would be consuming in many cases.While the fan out is likely much cheaper, it uses micro bump connections and generates some power overhead. For RDNA 3, I recall seeing 9 Watt ballpark. Which can be fine in a 355 Watt card, but not so good for where high degree of efficiency would be needed (server, mobile)
Also, let's say for Zen 5 (since we are on Zen5 thread), the solution would have to work for I/O + 8-12 CCDs + whatever else will be in the package, and now you are at the limit (or beyond the limit of the RDL packaging. I think it is using a wafer on which the whole package is assembled.
So I have some doubt this technology will be used for Zen 5
In RDNA3, MCDs are variable in numbers, and are single function that is on N6
In EPYC I/O, while there may be differences between Sienna and Genoa sockets, and count of memory channels, the node of IO Die is the same. It would probably be an overkill to make MCD for each memory channel for EPYC I/O die. In Genoa, there would be only additional cost in packaging and additional power overhead (vs single monolithic die). No savings of any kind.
In Sienna socket, removing 2-4 memory channels would not be much of the savings. Probably none. There is a die area overhead to make the external connections + power overhead + assembly overhead.
I think it is just a parallel link as opposed to serial link that is used through substrate. So it has all of the advantage of parallel link in latency and bandwidth, given enough connections.
The major drawback is power overhead.
I think the Zen 4 V-Cache will set the bar very high on bandwidth and latency. Anything that is NOT a hybrid bond stacked connection will be 2 steps back and 1 step forward.
Which is why I think we will need to see two of these technologies I mentioned, that achieve 2.5D horizontal connections while using hybrid bond.
MCR/MR-DIMM and generational DDR5 scaling could buy them a good ~2x bandwidth for the same number of channels. That should be enough to carry them for another gen or two at least. That's just kicking the can down the road, however. Eventually it might make sense to split the IO die in half. Then they can use that single die for a similar split to what we're seeing with Genoa/Siena. It would also give them several different options to stitch the two back together.If they have 128 or even 256 cpu cores with significantly increased floating point performance, they will likely need something like infinity cache and more memory channels to supply the necessary bandwidth. It is already a problem to require 12 memory modules just to fully populate the system. That is a lot of board area and it may not be enough going forward. How much bandwidth will 128 Zen 5 cores need compared to 96 Zen 4? That many memory controllers on a single, monolithic, IO die may be getting unreasonably large in addition to all of the other issues with that many channels. They can’t keep on scaling the number of memory interfaces.
Hey, I already trademarked the Split-IOD in this very thread 😉[...] Eventually it might make sense to split the IO die in half. Then they can use that single die for a similar split to what we're seeing with Genoa/Siena. It would also give them several different options to stitch the two back together.
Sounds like his usual BS to me. Why would they use N4X, a process that would be useless for mobile and server, and quite likely outperformed by N3E anyway? It's really not worth giving his ravings the time of day.MLID's seems to think Granite Ridge is more than 2 chiplets. Main chiplet is N4X, tuned for performance (6GHz, yadda yadda) and the other(s) on something like N3E 2-1 Fin, focusing on small size/density and efficiency. Unclear if Zen5+Zen5C or all Zen5.
At first I thought meh, it's MLID, he's talking out of his nr 2 box, but the again... it's NOT not making sense.
Anyway, interesting idea, nothing too crazy, whatever merits it may have.
Well for starters.Sounds like his usual BS to me. Why would they use N4X, a process that would be useless for mobile and server, and quite likely outperformed by N3E anyway? It's really not worth giving his ravings the time of day.
They did not specify what Zen 5 designs will utilise N3 at all.
TSMC claims that N4X gives ~15% faster clocks than N5. Compare that to their claim that N3E is 15-20% higher performance, iso-power, than the same. Additionally, they claim N4X is only 4% faster than N4P at 1.2V (i.e. high voltage / best case). So what's leading you to conclude that N4X is more performant?N4X is more performant than N3E.
Not the density optimized N3E variant tho from what I can tell. HmmTSMC claims that N4X gives ~15% faster clocks than N5. Compare that to their claim that N3E is 15-20% higher performance, iso-power, than the same.
So with the Dual CCD Zen 4 X3D parts being wonky designs what are the chances AMD will do 8c Zen 5 with v-cache and 16c Zen 5c without v-cache as their top tier 8950X(3D?) part to help get back the undisputed multithreaded crown.
We don't even know if Zen5 works the same way yet.