- But is it going to be ready for a top to bottom RDNA2 stack? This type of radical technological shift looks like it would be a prime candidate for a pipe cleaner product or a mid gen refresh, not a top to bottom stack launch.
Wonder if this is the kind of thing that's being kept in the pipe for an RDNA3 launch or even further down the line.
After all the promises of the new pathways and discard accelerators etc in Vega and Polaris I would be less surprised if AMD managed to bork the physical design so the feature is useless than not.
They do need some outside of the box solution to compete with Nvidia. I don’t think they will compete well if they just make a bigger gpu. They are increasing efficiency significantly, but that doesn’t seem sufficient, just necessary. They may have done better than Nvidia though, with TSMC vs. Samsung processes.
The cache rumors are still a bit odd. Such a cache may help significantly with ray tracing though. If they are saying that it will perform like it has a 384-bit bus with only 256-bits, then it seems like it needs to be a 4 IFOP style links to deliver that kind of bandwidth given the speed of GDDR6. I guess it could actually be 4x single link devices which would be very interesting, but expensive. That might still be cheap though, especially if it was made at GF or something.
This had me wondering how such a device could be re-used with Epyc processors. For Epyc, the most sense would be as an additional chip that would fit in between the IO die and the cpu die without really needing to change the IO die or the cpu die. If they could fit such a cache chip on either side of the Epyc IO die, they could have 2x 128 MB transparent L4 caches per Epyc package if they have 8 IFOP on the cache chip rather than just 4. Four could connect to the IO die and the other 4 to the cpu dies. The problem is, such a device would be quite large, possibly 200 to 250 square mm or so. That probably wouldn’t fit with the existing Epyc package layout.
This is wild speculation, but this led me to wonder if it could be a “pipe cleaning“ stacked device. With Zen 4, they would want to move to an active interposer for the IO die with cpu and possibly memory stacked on top. That is a big change to do all at once. Could this cache device be a precursor with 8x IFOP in one layer and 128 MB of cache stacked? Maybe later the cache die stack with the cpu. That would probably fit on an existing Epyc package with essentially the same layout. They could make some without cache and some with cache. Intel wouldn’t even compete with the cheaper parts with no L4.
This is continuing wild speculation, but if an RDNA GPU has 4 IFOP, then they could technically connect two GPUs together directly, or with one of these caches in between, with something like 150 to 200 GB/s or so in each direction. There has been some infinity architecture slides that show CDNA GPUs with what looks like 6 links, connecting up to 8 GPUs to each other. That isn’t actually fully connected, but the slide may not representative, or they don’t support fully connected with 8 GPUs. It may be possible to connect GPUs with IFOP on the same board. The die used in current AMD MCMs are really just BGA packages, which is why they shouldn't be called chiplets; chiplets should be reserved for devices on silicon interposers. For HPC, I could see them possibly mounting 4 to 8 HBM GPUs very close to each other with IFOP connecting adjacent GPUs together and IFIS style links for the longer runs. You would definitely need water cooling, but a lot of HPC has already gone water cooling. If they are IFIS links rather than IFOP, then it wouldn’t be quite as fast or power efficient, but it would allow for multiple GPUs on the same board with larger spacing between them.
If they are using it on their CDNA architecture then it wouldn’t be that much of a stretch that it would show up in the consumer cards if they can figure out how to make good use of it. Multi-gpu works fine for some compute applications. I have seen cases where 2 GPUs with half of the compute and half the memory bandwidth perform almost the same as one large gpu. That doesn’t necessarily work well with rendering. Although, at that speed, they could do some unified memory and some other things that might be interesting. AMD has fully virtualized memory for their GPUs similar to CPU memory, which should facilitate sharing.
This is brings me back around to wonder if the 128 MB cache rumor is that someone misunderstood. We seem to only have one source for that. Could it actually be that it is 128-bit infinity fabric connection (4x IFOP) rather than 128 MB “infinty cache”? The whole infinity cache thing could just be made up, or it could refer to something completely different, like sharing memory across infinity fabric.
I think this has been my journey for this weekend. I have things to do.