Question Speculation: RDNA3 + CDNA2 Architectures Thread

uzzi38 · Jan 23, 2021

Man I have been dying to make this one for a while now.

First rumours for RDNA3 are here so new thread time!

Just going to start off with this one for now: kopite7kimi on Twitter: "@VideoCardz Ah, I mean a simple mcm design with 10240 cores is not enough. Because the lift from RDNA2 to RDNA3 is much bigger than from RDNA1 to RDNA2. We should expect many big improvements in GFX11. 🤔" / Twitter

DownTheSky · Jun 1, 2021

Gideon · Jun 1, 2021

Mopetar said:
With the Computex announcements of AMD using stacked SRAM on their CPUs, I'm wondering what implications this has for infinity cache in future AMD GPUs. It seems like they'll be able to scale the size of infinity cache considerably or even utilize stacking to produce chips with a smaller die size without sacrificing cache size.

I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.

GodisanAtheist · Jun 1, 2021

Gideon said:
I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.

-Mother of god. If they can get that thing to scale then the next couple years are going to be a helluva thing and we're going back to old school generational doubling all the old timers loved (hated?) again.

Good time to be into computer architecture, not a great time to actually try and buy one. there truly is no hell but the one we make for ourselves...

itsmydamnation · Jun 2, 2021

Gideon said:
I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.

you wont have cache on top, that would be terrible for thermals, you would have cache on the bottom and gpu on the top. cacheon the top for zen3 CCD only works because of the layout of the CCD.

Gideon · Jun 2, 2021

itsmydamnation said:
you wont have cache on top, that would be terrible for thermals, you would have cache on the bottom and gpu on the top. cacheon the top for zen3 CCD only works because of the layout of the CCD.

The patent had both versions on cache on the bottom an on top. I don't see any problem having it on top as it only covers a small area of "compute" chiplets themselves. It might just be on top of the L3 in the compute chiplet just as in Zen (and there is bound to be some as it's slightly more efficent power-wise).

Here is the patent itself:

https://www.freepatentsonline.com/20210097013.pdf

The relevant text starts at the very end of page 3 [0025] and [0029]

Here is the top picture from the patent:

https://imgur.com/MBXF1FY

And the bottom one:

https://imgur.com/PUbRBAE

118 - is the active bridge chiplet
106 - are the compute chiplets
402 - is a carrier wafer bound to the chiplets

I do agree that having it at the bottom (FIG.3) might makes more sense memory access wise, but it certainly isn't a "free lunch" as not only doesn't the bridge chiplet now need TSVs (through-silicon vias) but you need a ton of TDVs (through-dielectric vias) as well in the "filler" silicon to connect to the solder balls below as explained in the patent:

Gideon · Jun 4, 2021

Well it looks like more and more likely it will indeed be at the bottom.

https://www.reddit.com/r/Amd/comments/nrr9gl

A very interesting new patent and analysis

Bigos · Jun 4, 2021

Sounds reasonable, other than stating that the memory controller is only on the first chiplet. The memory controller is in each chiplet each and serves memory access requests from a portion of the address space from each of the chiplets (routed from the shared L3). What is in the first chiplet though is the PCIe interface with the CPU that handles CPU memory access requests.

Regarding checkerboard pixel processing, it is very similar to how current monolithic GPUs contain multiple rasterizers that process a subset of the screen space. Large triangles go to multiple rasterizers, smaller ones might fit in a single one (with proper alignment).

Regarding synchronization, the GPU pipeline is already mostly unsynchronized, with a lot parallel work happening on the fly. The majority of synchronization happens per-pixel, where ordering of pixel blending is important, but with the checkerboard partitioning the synchronization will happen within each chiplet (actually, each one of multiple rasterizers within each chiplet). When across-pixel synchronization needs to happen (e.g. render-to-texture) some GPU commands might indeed stall, but asynchronous compute tasks might fill in.

I wonder how per-vertex (or per-primitive or per-patch) processing will be distributed, though. Each triangle must (unless out-of-order rasterization is enabled) be rasterized in-order, so that blending happens in-order at the end of the pipeline. Maybe binning (where triangles are distributed among buckets) will make a comeback? Or there will be a different process to sort primitives from multiple chiplets before rasterization? They already must deal with this issue with multi-SA monolithic GPUs, the shared L3 should help...

leoneazzurro · Jun 10, 2021

About CDNA2, it seems Aldebaran is shaping as a 2x112 CU GPU with HBM2e

AMD Aldebaran: Profi-Grafikkarten bekommen zwei GPU-Dies und 224 CUs

Die neue AMD-Profi-Grafikkarte mit zwei GPUs auf einem Package nimmt in Form von Aldebaran Gestalt an. 224 CUs werden aktuell vermutet.

www.computerbase.de

Max CU per die are 128, but probably we will see less due to yields.

eek2121 · Jun 11, 2021

AMD needs gigabytes of cache. If they could manage to put 1+ GB of stacked cache on a GPU with 2TB/s of bandwidth, the bus width and type of memory won’t matter as much.

Also I hope they do 40CU chiplets rather than 80, because that is the key to driving down prices across all of the segments.

soresu · Jun 11, 2021

eek2121 said:
Also I hope they do 40CU chiplets rather than 80, because that is the key to driving down prices across all of the segments.

More than that, it's cheaper to design the chips themselves.

Larger chips are more expensive to design as well as fab - though modern ML assisted time saving design tools may be changing that equation, or at least pushing back the cost of design a node or 2.

It also leaves less options open for SKU segmentation.

soresu · Jun 11, 2021

eek2121 said:
AMD needs gigabytes of cache. If they could manage to put 1+ GB of stacked cache on a GPU with 2TB/s of bandwidth, the bus width and type of memory won’t matter as much.

I could see them tripling cache size with V cache as with Ryzen, but 1 GB+ is a bit beyond that, unless you mean 1GB+ on package which definitely sounds possible for 160 CUs.

I wonder if the V cache might be the cause of at least part of the extra 50% on top of doubling CU's that is in the rumored 2.5x performance of RDNA3.

Or is that 25% x2 with the doubled CU's?

Certainly an extra 25% does sound quite achievable, and 50% possible too with this V cache.

If they put an NVMe/M2 connector directly on the big card they could reduce the latency to access an M2 drive for DirectStorage too, in fact with the whole emphasis about IO with the new consoles I would be willing to bet on this being an obvious direction for max game performance and efficiency.

Kepler_L2 · Jun 12, 2021

Gigabyte range is too much for now, they won't use 5nm for cache dies (and it's not worth anyway since SRAM scaling on newer nodes is terrible) so we can calculate roughly how much die space it would take given what we know about Zen V-Cache.

https://twitter.com/x/status/1399764756932616196

36mm² for 64MB
72mm² for 128MB
144mm² for 256MB
288mm² for 512MB
576mm² for 1GB

I think two 250mm²-ish GPU chiplets plus a 144 to 288mm² die for Infinity Cache is doable for flagship products.

Gideon · Jun 12, 2021

Kepler_L2 said:
https://twitter.com/x/status/1399764756932616196

36mm² for 64MB
72mm² for 128MB
144mm² for 256MB
288mm² for 512MB
576mm² for 1GB

I think two 250mm²-ish GPU chiplets plus a 144 to 288mm² die for Infinity Cache is doable for flagship products.

This seems reasonable. As 7nm on 5nm doesn't seem to be a thing for TSMC SoiC (right?) then either it's 5nm on 5nm in the very end of 2022 or (hopefully) 6nm on 6nm a couple quarters sooner.

movdx · Jun 12, 2021

What are the chances of seeing RDNA 3 in Phoenix APUs (7000 series)?

uzzi38 · Jun 12, 2021

movdx said:
What are the chances of seeing RDNA 3 in Phoenix APUs (7000 series)?

High.

soresu · Jun 12, 2021

movdx said:
What are the chances of seeing RDNA 3 in Phoenix APUs (7000 series)?

Comparatively the bring up for RDNA2 in APUs has been slow, at least for Rembrandt anyway, which probably won't see light of day until CES 2022 next January.

So it wouldn't be a huge surprise to see RDNA3 in Phoenix, especially if it is delayed a bit from the current yearly release cadence of APUs.

Even then, by the time CES 2023 rolls around N5 will be very mature and N5P somewhat mature too, with Apple moving on to N3 more than likely - so a 5nm AMD monolithic mainstream APU in early 2023 is definitely possible.

biostud · Jun 13, 2021

Remind me again why HBM 3 isn't the way to go?

Leeea · Jun 13, 2021

biostud said:
Remind me again why HBM 3 isn't the way to go?

probably would be more expensive.

I have a feeling Global Foundries is going to be fabbing these SRAM caches.

soresu · Jun 14, 2021

Leeea said:
probably would be more expensive.

I have a feeling Global Foundries is going to be fabbing these SRAM caches.

More expensive yes, but not as expensive as HBM2 is due to a narrower 512 bit bus per stack if I have read the rumours correctly.

This may allow for simpler organic interposers rather than silicon which could potentially be cheaper, as well as smaller memory controllers on the GPUs themselves.

darkswordsman17 · Jun 17, 2021

biostud said:
Remind me again why HBM 3 isn't the way to go?

Glad I'm not the only one. Not only that, but I think HBM3 is 7nm, meaning, wouldn't it be possible to integrate it into say a 7nm I/O die, removing the added expense and complexity of the interposer? Although, why couldn't they do with HBM what they're doing with this cache?

Tuna-Fish · Jun 18, 2021

Leeea said:
I have a feeling Global Foundries is going to be fabbing these SRAM caches.

They won't. For now, the best/cheapest 3d packaging is only available if both dies are from TSMC and made on the same process. They will likely relax this over time, but not soon enough for these GPUs.

soresu · Jun 22, 2021

darkswordsman17 said:
Not only that, but I think HBM3 is 7nm

The problem with that is DRAM below 10nm is a sizable roadblock due to limitations with the current capacitor laden design and its scalability at lower process geometries.

At leas one capacitor-less DRAM device design has been put forward, but whether it will be ready in time for HBM3?

I have my doubts.

Happy to be proven wrong though.

Krteq · Jul 1, 2021

Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:

On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

Shmee · Jul 1, 2021

Krteq said:

Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:

On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

Haha I read that as Alderan bits. RIP Alderan. I bet someone will pair this with a Hitachi deathstar

Ajay · Jul 1, 2021

Shmee said:
Haha I read that as Alderan bits. RIP Alderan. I bet someone will pair this with a Hitachi deathstar

Did the same thing, uhhh, great minds think alike?

Question Speculation: RDNA3 + CDNA2 Architectures Thread

Platinum Member

Senior member

Golden Member

Diamond Member

Platinum Member

Golden Member

Golden Member

Member

Golden Member

Diamond Member

Platinum Member

Platinum Member

Senior member

Golden Member

Junior Member

Platinum Member

Platinum Member

Lifer

Diamond Member

Platinum Member

Lifer

Golden Member

Platinum Member

Senior member

Memory & Storage, Graphics Cards Mod Elite Member

Lifer