Question Speculation: RDNA3 + CDNA2 Architectures Thread

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146

Gideon

Golden Member
Nov 27, 2007
1,714
3,938
136
With the Computex announcements of AMD using stacked SRAM on their CPUs, I'm wondering what implications this has for infinity cache in future AMD GPUs. It seems like they'll be able to scale the size of infinity cache considerably or even utilize stacking to produce chips with a smaller die size without sacrificing cache size.
I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.
 
Reactions: Joe NYC

GodisanAtheist

Diamond Member
Nov 16, 2006
7,070
7,492
136
I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.

-Mother of god. If they can get that thing to scale then the next couple years are going to be a helluva thing and we're going back to old school generational doubling all the old timers loved (hated?) again.

Good time to be into computer architecture, not a great time to actually try and buy one. there truly is no hell but the one we make for ourselves...
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
I think the rumors about 2x 80CU chiplets and with stacked cache on top of them make a ton more sense now for RDNA3 they could easily fit at least 256-512MB of L3 there.
you wont have cache on top, that would be terrible for thermals, you would have cache on the bottom and gpu on the top. cacheon the top for zen3 CCD only works because of the layout of the CCD.
 

Gideon

Golden Member
Nov 27, 2007
1,714
3,938
136
you wont have cache on top, that would be terrible for thermals, you would have cache on the bottom and gpu on the top. cacheon the top for zen3 CCD only works because of the layout of the CCD.
The patent had both versions on cache on the bottom an on top. I don't see any problem having it on top as it only covers a small area of "compute" chiplets themselves. It might just be on top of the L3 in the compute chiplet just as in Zen (and there is bound to be some as it's slightly more efficent power-wise).

Here is the patent itself:
The relevant text starts at the very end of page 3 [0025] and [0029]

Here is the top picture from the patent:

And the bottom one:

118 - is the active bridge chiplet
106 - are the compute chiplets
402 - is a carrier wafer bound to the chiplets

I do agree that having it at the bottom (FIG.3) might makes more sense memory access wise, but it certainly isn't a "free lunch" as not only doesn't the bridge chiplet now need TSVs (through-silicon vias) but you need a ton of TDVs (through-dielectric vias) as well in the "filler" silicon to connect to the solder balls below as explained in the patent:

 
Last edited:

Bigos

Member
Jun 2, 2019
138
322
136
Sounds reasonable, other than stating that the memory controller is only on the first chiplet. The memory controller is in each chiplet each and serves memory access requests from a portion of the address space from each of the chiplets (routed from the shared L3). What is in the first chiplet though is the PCIe interface with the CPU that handles CPU memory access requests.

Regarding checkerboard pixel processing, it is very similar to how current monolithic GPUs contain multiple rasterizers that process a subset of the screen space. Large triangles go to multiple rasterizers, smaller ones might fit in a single one (with proper alignment).

Regarding synchronization, the GPU pipeline is already mostly unsynchronized, with a lot parallel work happening on the fly. The majority of synchronization happens per-pixel, where ordering of pixel blending is important, but with the checkerboard partitioning the synchronization will happen within each chiplet (actually, each one of multiple rasterizers within each chiplet). When across-pixel synchronization needs to happen (e.g. render-to-texture) some GPU commands might indeed stall, but asynchronous compute tasks might fill in.

I wonder how per-vertex (or per-primitive or per-patch) processing will be distributed, though. Each triangle must (unless out-of-order rasterization is enabled) be rasterized in-order, so that blending happens in-order at the end of the pipeline. Maybe binning (where triangles are distributed among buckets) will make a comeback? Or there will be a different process to sort primitives from multiple chiplets before rasterization? They already must deal with this issue with multi-SA monolithic GPUs, the shared L3 should help...
 

eek2121

Diamond Member
Aug 2, 2005
3,053
4,281
136
AMD needs gigabytes of cache. If they could manage to put 1+ GB of stacked cache on a GPU with 2TB/s of bandwidth, the bus width and type of memory won’t matter as much.

Also I hope they do 40CU chiplets rather than 80, because that is the key to driving down prices across all of the segments.
 

soresu

Platinum Member
Dec 19, 2014
2,972
2,201
136
Also I hope they do 40CU chiplets rather than 80, because that is the key to driving down prices across all of the segments.
More than that, it's cheaper to design the chips themselves.

Larger chips are more expensive to design as well as fab - though modern ML assisted time saving design tools may be changing that equation, or at least pushing back the cost of design a node or 2.

It also leaves less options open for SKU segmentation.
 
Reactions: Joe NYC

soresu

Platinum Member
Dec 19, 2014
2,972
2,201
136
AMD needs gigabytes of cache. If they could manage to put 1+ GB of stacked cache on a GPU with 2TB/s of bandwidth, the bus width and type of memory won’t matter as much.
I could see them tripling cache size with V cache as with Ryzen, but 1 GB+ is a bit beyond that, unless you mean 1GB+ on package which definitely sounds possible for 160 CUs.

I wonder if the V cache might be the cause of at least part of the extra 50% on top of doubling CU's that is in the rumored 2.5x performance of RDNA3.

Or is that 25% x2 with the doubled CU's?

Certainly an extra 25% does sound quite achievable, and 50% possible too with this V cache.

If they put an NVMe/M2 connector directly on the big card they could reduce the latency to access an M2 drive for DirectStorage too, in fact with the whole emphasis about IO with the new consoles I would be willing to bet on this being an obvious direction for max game performance and efficiency.
 

Kepler_L2

Senior member
Sep 6, 2020
474
1,927
106
Gigabyte range is too much for now, they won't use 5nm for cache dies (and it's not worth anyway since SRAM scaling on newer nodes is terrible) so we can calculate roughly how much die space it would take given what we know about Zen V-Cache.


36mm² for 64MB
72mm² for 128MB
144mm² for 256MB
288mm² for 512MB
576mm² for 1GB

I think two 250mm²-ish GPU chiplets plus a 144 to 288mm² die for Infinity Cache is doable for flagship products.
 

Gideon

Golden Member
Nov 27, 2007
1,714
3,938
136

36mm² for 64MB
72mm² for 128MB
144mm² for 256MB
288mm² for 512MB
576mm² for 1GB

I think two 250mm²-ish GPU chiplets plus a 144 to 288mm² die for Infinity Cache is doable for flagship products.

This seems reasonable. As 7nm on 5nm doesn't seem to be a thing for TSMC SoiC (right?) then either it's 5nm on 5nm in the very end of 2022 or (hopefully) 6nm on 6nm a couple quarters sooner.
 

soresu

Platinum Member
Dec 19, 2014
2,972
2,201
136
What are the chances of seeing RDNA 3 in Phoenix APUs (7000 series)?
Comparatively the bring up for RDNA2 in APUs has been slow, at least for Rembrandt anyway, which probably won't see light of day until CES 2022 next January.

So it wouldn't be a huge surprise to see RDNA3 in Phoenix, especially if it is delayed a bit from the current yearly release cadence of APUs.

Even then, by the time CES 2023 rolls around N5 will be very mature and N5P somewhat mature too, with Apple moving on to N3 more than likely - so a 5nm AMD monolithic mainstream APU in early 2023 is definitely possible.
 
Reactions: Tlh97 and Saylick

soresu

Platinum Member
Dec 19, 2014
2,972
2,201
136
probably would be more expensive.

I have a feeling Global Foundries is going to be fabbing these SRAM caches.
More expensive yes, but not as expensive as HBM2 is due to a narrower 512 bit bus per stack if I have read the rumours correctly.

This may allow for simpler organic interposers rather than silicon which could potentially be cheaper, as well as smaller memory controllers on the GPUs themselves.
 
Mar 11, 2004
23,182
5,646
146
Remind me again why HBM 3 isn't the way to go?

Glad I'm not the only one. Not only that, but I think HBM3 is 7nm, meaning, wouldn't it be possible to integrate it into say a 7nm I/O die, removing the added expense and complexity of the interposer? Although, why couldn't they do with HBM what they're doing with this cache?
 

soresu

Platinum Member
Dec 19, 2014
2,972
2,201
136
Not only that, but I think HBM3 is 7nm
The problem with that is DRAM below 10nm is a sizable roadblock due to limitations with the current capacitor laden design and its scalability at lower process geometries.

At leas one capacitor-less DRAM device design has been put forward, but whether it will be ready in time for HBM3?

I have my doubts.

Happy to be proven wrong though.
 

Krteq

Senior member
May 22, 2015
993
672
136
Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:
On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

 

Shmee

Memory & Storage, Graphics Cards Mod Elite Member
Super Moderator
Sep 13, 2008
7,548
2,547
146
Some Aldebaran bits from recent Linux Kernel commit => 2 dies + 128GB HBM2(e) confirmed

Code:
On newer heterogeneous systems from AMD with GPU nodes connected via
xGMI links to the CPUs, the GPU dies are interfaced with HBM2 memory.

This patchset applies on top of the following series by Yazen Ghannam
AMD MCA Address Translation Updates
[[URL]https://patchwork.kernel.org/project/linux-edac/list/?series=505989[/URL]]

This patchset does the following
1. Add support for northbridges on Aldebaran
   * x86/amd_nb: Add Aldebaran device to PCI IDs
   * x86/amd_nb: Add support for northbridges on Aldebaran
2. Add HBM memory type in EDAC
   * EDAC/mc: Add new HBM2 memory type
3. Modifies the amd64_edac module to
   a. Handle the UMCs on the noncpu nodes,
   * EDAC/mce_amd: extract node id from InstanceHi in IPID
   b. Enumerate HBM memory and add address translation
   * EDAC/amd64: Enumerate memory on noncpu nodes
   c. Address translation on Data Fabric version 3.5.
   * EDAC/amd64: Add address translation support for DF3.5
   * EDAC/amd64: Add fixed UMC to CS mapping


Aldebaran has 2 Dies (enumerated as a MCx, x= 8 ~ 15)
  Each Die has 4 UMCs (enumerated as csrowx, x=0~3)
  Each die has 2 root ports, with 4 misc port for each root.
  Each UMC manages 8 UMC channels each connected to 2GB of HBM memory.

Muralidhara M K (3):
  x86/amd_nb: Add Aldebaran device to PCI IDs
  x86/amd_nb: Add support for northbridges on Aldebaran
  EDAC/amd64: Add address translation support for DF3.5

Naveen Krishna Chatradhi (3):
  EDAC/mc: Add new HBM2 memory type
  EDAC/mce_amd: extract node id from InstanceHi in IPID
  EDAC/amd64: Enumerate memory on noncpu nodes

Yazen Ghannam (1):
  EDAC/amd64: Add fixed UMC to CS mapping

Haha I read that as Alderan bits. RIP Alderan. I bet someone will pair this with a Hitachi deathstar
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |