Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

DisEnchantment · Sep 29, 2022

Speculate at will

MadRat · Feb 16, 2023

The era of chiplets seems to be tied to manufacturing limitations at the moment. Without some miracle, they will all be following AMD's lead. Between aperture limits in chip manufacturing and contraints in packing, its pretty obvious they are up against a technology wall.

Maybe less obvious solutions to manufacturing will make all this chiplet technology superfluous. If they could work within aperture limitations to precisely scroll across wide areas without oscillations and other anomalies that lessen resolution, then they could build make any size monolithic chip. If you could etch the entire surface of your entire packaging then it could get very interesting.

But maybe something simpler could make a leap ahead. What if you build 3D cubes, putting chiplets on the 6-sides. Link the cube to your packaging. But also link cube to cube. You could stack cubes. You could put them in rows or patterns to optmize goals. That would put a whole new spin on '3D' chips.

DisEnchantment · Feb 16, 2023

jamescox said:
It isn't uncommon for HPC machines to have 512 GB to 1 TB of system memory per GPU. The latency of CXL is possibly not an issue with all of the SRAM and HBM in the system, but I would expect it to have memory expansion somewhere / somehow.

I mentioned about this in the CDNA3 thread, from latest official disclosures, MI300 carry Zen 4 features, Unified Memory and CXL 2.0 (IF4.0 features).
MI300 can run directly from on package HBM with no RAM kits and connect to CXL memory.
I believe the MPDMA can migrate the memory pages from CXL to HBM like in Zen 4, which means the MI300 GPU Chiplets with its unified memory can access the CXL memory pool as its own memory.
Quite interesting to see how this unfolds.

BorisTheBlade82 said:
@jamescox
Aside from your MI300 layout suggestion, for which I already expressed a different opinion in this thread as well as the MI300 thread, this is exactly my line of thinking as well.
Why not use Zen4c in MI300 for the reasons you explained? Why not getting rid with IFoP in this process? Why not switching interconnects for Bergamo already and re-use (variations of) that CCD as small cores together with Zen5?
Maybe I am terribly wrong - but one can at least dream 😉

Indeed MI300 should not be using IFOP at all.
The compute dies are on top of base dies, likely using SoIC. And the connection from one base die to another should be some kind of Si bridge to support the huge throughput needed for non local memory GPU request and sync workgroups output across multiple dies because Nicholas (from AMD HPC) basically confirmed MI300 appears as one GPU.

jamescox · Feb 16, 2023

Joe NYC said:
Wouldn't the TSVs between the dies keep the temperature differential relatively small? It might be worth adding even dummy TSVs just to keep the temperatures in sync.

I think there was some discussion of deliberatly adding extra TSVs for greater thermal conductivity, since copper has higher thermal conductivity than silicon. This may affect make thermal expansion issues worse though.

jamescox · Feb 16, 2023

The image of the LGA under an MI300 here looks almost exactly like genoa, except genoa appears to have two small circles at the corners in the area in the middle that Mi300 does not have.

AMD Instinct MI300 Data Center APU Pictured Up Close: 13 Chiplets, 146 Billion Transistors

AMD's biggest chip yet.

www.tomshardware.com

Perhaps SH5 has the same IO as genoa, just higher power draw limits.

jamescox · Feb 16, 2023

moinmoin said:
No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.

I saw some geekbench scores for M2 and some for possible Genoa. The single core still seems to be much higher for Apple M2.

This is the M2 cache structure (according to wikipedia):

L1 cache Performance cores 192+128 KB per core
Efficiency cores 128+64 KB per core

L2 cache Performance cores M2: 16 MB
M2 Pro and M2 Max: 32 MB
Efficiency cores 4 MB

Last level cache M2: 8 MB
M2 Pro: 24 MB
M2 Max: 48 MB

Not sure if the 16 MB L2 is unified though? If Zen 5 implements similar massive L2 caches, then pushing the L3 farther out, to the IO die or memory side, seems reasonable.

itsmydamnation · Feb 16, 2023

jamescox said:
I saw some geekbench scores for M2 and some for possible Genoa. The single core still seems to be much higher for Apple M2.

This is the M2 cache structure (according to wikipedia):

L1 cache Performance cores 192+128 KB per core
Efficiency cores 128+64 KB per core

L2 cache Performance cores M2: 16 MB
M2 Pro and M2 Max: 32 MB
Efficiency cores 4 MB

Last level cache M2: 8 MB
M2 Pro: 24 MB
M2 Max: 48 MB

Not sure if the 16 MB L2 is unified though? If Zen 5 implements similar massive L2 caches, then pushing the L3 farther out, to the IO die or memory side, seems reasonable.

it doesn't sound reasonable at all

M2 clocks way lower and is a much wider core in all regards.
its L1D is massive to reduce pressure on the L2 and that has its own trade offs that arent very good for many server workloads / high IOPS small read/writes
Next you next to explain how you are going to handle cache coherency in this model out to 8+ CCX with CCX's attached to different nodes on the IOD. how is your LLC going to attach and how will it actually work.
How many loads / store a cycle do you expect from this L2 , how many writes out of the L2 to whatever up the chain do you expect per cycle.

having only 16.25mb of cache across 8 cores vs 40.25mb means far more read and write pressure off the CCD.

And just remember Zen4 has the same amount of decode / execute and load store of skylake. These Cores are not in the same class.

Correlation is not causation. a big fast shared L2 sounds amazing for a single threaded app , but sustained high performance across all cores means very high read/write requirements and it needs to run at high clock rate.

You can also look the other way like ibm Z15 , really big(32mb per L2) slow private L2's that together act as a virtual L3 for each other which really has the effect of reduced pressure on the L3 because of high L2 hit rate.

jamescox · Feb 16, 2023

itsmydamnation said:
it doesn't sound reasonable at all

M2 clocks way lower and is a much wider core in all regards.
its L1D is massive to reduce pressure on the L2 and that has its own trade offs that arent very good for many server workloads / high IOPS small read/writes
Next you next to explain how you are going to handle cache coherency in this model out to 8+ CCX with CCX's attached to different nodes on the IOD. how is your LLC going to attach and how will it actually work.
How many loads / store a cycle do you expect from this L2 , how many writes out of the L2 to whatever up the chain do you expect per cycle.

having only 16.25mb of cache across 8 cores vs 40.25mb means far more read and write pressure off the CCD.

And just remember Zen4 has the same amount of decode / execute and load store of skylake. These Cores are not in the same class.

Correlation is not causation. a big fast shared L2 sounds amazing for a single threaded app , but sustained high performance across all cores means very high read/write requirements and it needs to run at high clock rate.

You can also look the other way like ibm Z15 , really big(32mb per L2) slow private L2's that together act as a virtual L3 for each other which really has the effect of reduced pressure on the L3 because of high L2 hit rate.

IOPS in regard to L2/L3 caches?

There have been a lot of rumors saying that Zen 5 will have large, shared L2, but they could always be completely wrong. We have had large L2 cache chips in the past and many applications do very well with large L2. The old Core 2 Duo processors with 6 MB L2 still performed very well against later processors with tiny L2 and larger L3. Many server applications are often not really very cacheable anyway.

Zen 5 is what, 3 or 4 nm process tech? I haven't even payed any attention. Some things may become possible that were not before just due to the density. Interconnect paths will be reduced significantly. Making L2 large and fast (enough) is a possibility. Also, there is no guarantee that Zen 5 will be like Zen 4. We do not know. It is a new design. It may be more like M2. M2 isn't really that low of clock compared to past cores. Increasing IPC significantly while staying the same as Zen 4 / Skylake is likely not possible, so I am not sure we are going to get massive clock speed increases. It may just make more sense to throw more hardware at the problem at a lower clock.

Also, a large, shared L2 does not rule out on package L3. They could still have a smaller L3 on the cpu die, but given the scaling, it may make more sense to move it off. That would be an argument for stacking it over or under or moving to adjacent die. Even if going off die increases, that may not really be a problem if it is stacked or uses some RDL tech like the connection between RDNA3 and MCD. There is very little latency penalty for such connections (near zero for SoIC) and the bandwidth is an order of magnitude higher than SerDes based GMI. Multiple orders of magnitude for SoIC. For epyc, why would cache coherency with part of the hierarchy moved to the IO die side be any different? There is already presumably a lot of traffic to the IO die to maintain cache coherency. The labeled IO die diagrams (if correct) show a rather large area dedicated to tag directories, so it may actually be easier to maintain coherency, but it has been a long time since my computer architecure classes, so I don't quite remember exactly how directory based coherency works. I will have to do some reading. The Epyc IO die can be separate numa nodes, but it can also be configured as a single numa node (in fact, that is the default, I believe) and memory addresses will be interleaved across all 8 (or 12) memory channels in NPS1 mode.

itsmydamnation · Feb 17, 2023

jamescox said:
IOPS in regard to L2/L3 caches?

The problem is page sizes , 4kb page , 32k low latency L1D , 16kb page size 128k low latency L1D.
you can ( and amd does) merging of multiple 4kb pages into a single larger page , but its not always that simple.

Intel went to 48kb @ 4kb page size and pay the price in extra latency

There have been a lot of rumors saying that Zen 5 will have large, shared L2, but they could always be completely wrong. We have had large L2 cache chips in the past and many applications do very well with large L2.

The old Core 2 Duo processors with 6 MB L2 still performed very well against later processors with tiny L2 and larger L3. Many server applications are often not really very cacheable anyway.

Zen 5 is what, 3 or 4 nm process tech? I haven't even payed any attention. Some things may become possible that were not before just due to the density. Interconnect paths will be reduced significantly. Making L2 large and fast (enough) is a possibility. Also, there is no guarantee that Zen 5 will be like Zen 4. We do not know. It is a new design. It may be more like M2. M2 isn't really that low of clock compared to past cores. Increasing IPC significantly while staying the same as Zen 4 / Skylake is likely not possible, so I am not sure we are going to get massive clock speed increases. It may just make more sense to throw more hardware at the problem at a lower clock.

Also, a large, shared L2 does not rule out on package L3. They could still have a smaller L3 on the cpu die, but given the scaling, it may make more sense to move it off. That would be an argument for stacking it over or under or moving to adjacent die. Even if going off die increases, that may not really be a problem if it is stacked or uses some RDL tech like the connection between RDNA3 and MCD. There is very little latency penalty for such connections (near zero for SoIC) and the bandwidth is an order of magnitude higher than SerDes based GMI. Multiple orders of magnitude for SoIC. For epyc, why would cache coherency with part of the hierarchy moved to the IO die side be any different? There is already presumably a lot of traffic to the IO die to maintain cache coherency. The labeled IO die diagrams (if correct) show a rather large area dedicated to tag directories, so it may actually be easier to maintain coherency, but it has been a long time since my computer architecure classes, so I don't quite remember exactly how directory based coherency works. I will have to do some reading. The Epyc IO die can be separate numa nodes, but it can also be configured as a single numa node (in fact, that is the default, I believe) and memory addresses will be interleaved across all 8 (or 12) memory channels in NPS1 mode.

You never actually addressed anything and missed the major point, which was scaling performance and sustaining high throughput low latency across your memory sub system. Apple has a very large L2 but only has 4 cores connected to it (3mb a core on M1) , 8 cores would be a big step back at 12mb shared L2 and if you make the L2 larger you pay with more latency.

it also doesn't matter about fancy packaging if you are on average going further to retrieve data ( what would happen with only 16mb of cache on a CCD) you are going to loose big time. The thing to remember is this isnt the year 2000 anymore as these nodes shrink wire resistance goes through the roof, you want short fat wires.

its interesting to look at 7800X3D , it has a TDP of 120w vs 105W of 7700X for 300mhz less base clock speed that's like a 22% increase in power per clock. So its probably actually going to be less efficient then the 7700X assuming 15-20% IPC jump.

But on the last bit , the tags are in the L3's currently. You can still have a directory above that , but they directory only needs to track to the CCX. So now you are wasting your L2 on tags or you need way more space on the IOD for it and updating the state of a cache line will cost a whole lot more power. Also if you have an L3 on the IOD its going to be dog slow because the IOD is clocked slow and has multiple physically large hops. You could implement memory control side caches , then you have 4 separate blocks of memory controllers. Throughput per clock across the IOD would need to increase significantly which would be a significate power cost.

If there is a large shared L2 i am betting it will be more like IBM's virtual cache and less like apples L2 and still around the same size as current L2+ L3. Everyone from phones to desktop to Server to big iron are adding more cache per core, and more cache closer not less cache and further away.

DisEnchantment · Feb 17, 2023

Zen 5 is for sure a major architectural rework but taking RGT rumor machinery seriously is just plain folly. The odds of the established software ecosystem not aligning with radical architectural changes and instead working against AMD are quite high. For an architecture which is meant to be their bread and butter you can bet they would be very careful.

moinmoin · Feb 17, 2023

DisEnchantment said:
The odds of the established software ecosystem not aligning with radical architectural changes and instead working against AMD are quite high. For an architecture which is meant to be their bread and butter you can bet they would be very careful.

Very good point. That can be expanded to the hardware manufacturing ecosystem as well. It seems to me that AMD proceeds with flexibility and different alternate plans in mind already. So e.g. Zen 3 was planned with X3D cache from the start, but mass production feasibility only aligned significantly later.

jamescox · Feb 17, 2023

itsmydamnation said:
The problem is page sizes , 4kb page , 32k low latency L1D , 16kb page size 128k low latency L1D.
you can ( and amd does) merging of multiple 4kb pages into a single larger page , but its not always that simple.

Intel went to 48kb @ 4kb page size and pay the price in extra latency

I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.

itsmydamnation said:
You never actually addressed anything and missed the major point, which was scaling performance and sustaining high throughput low latency across your memory sub system. Apple has a very large L2 but only has 4 cores connected to it (3mb a core on M1) , 8 cores would be a big step back at 12mb shared L2 and if you make the L2 larger you pay with more latency.

it also doesn't matter about fancy packaging if you are on average going further to retrieve data ( what would happen with only 16mb of cache on a CCD) you are going to loose big time. The thing to remember is this isnt the year 2000 anymore as these nodes shrink wire resistance goes through the roof, you want short fat wires.

its interesting to look at 7800X3D , it has a TDP of 120w vs 105W of 7700X for 300mhz less base clock speed that's like a 22% increase in power per clock. So its probably actually going to be less efficient then the 7700X assuming 15-20% IPC jump.

But on the last bit , the tags are in the L3's currently. You can still have a directory above that , but they directory only needs to track to the CCX. So now you are wasting your L2 on tags or you need way more space on the IOD for it and updating the state of a cache line will cost a whole lot more power. Also if you have an L3 on the IOD its going to be dog slow because the IOD is clocked slow and has multiple physically large hops. You could implement memory control side caches , then you have 4 separate blocks of memory controllers. Throughput per clock across the IOD would need to increase significantly which would be a significate power cost.

If there is a large shared L2 i am betting it will be more like IBM's virtual cache and less like apples L2 and still around the same size as current L2+ L3. Everyone from phones to desktop to Server to big iron are adding more cache per core, and more cache closer not less cache and further away.

As I said someplace previously, I was thinking more along the lines of 2 or 4 cores sharing a large L2. Rumors have said all 8 though and the 16 MB number I just discussed because that is what Apple is using. I could believe 16 MB per 4 cores or something like that also. You also still seem to be assuming that everything else will be held constant. I doubt that is the case. I expect Zen 5 may be massively different from Zen 4. Probably not quite the same as the jump form excavator to zen, but a lot bigger than the zen 3 to zen 4 changes. It is kind of pointless to discuss, I guess. AMD will have register transfer level simulators to explore all of these options, so they will know what is best, but there are always trade offs. Given how much money Apple has to spend and how well their cores seem to do, I expect that they went in the right direction.

"Fancy" packaging is responsible for getting AMD to where they are today. If they were trying to compete using a monolithic die, Intel would probably still be a lot more dominate, even with all of the mistakes they have made. If you do not recognize that SoIC and other stacking technologies represent a disruptive change, then I don't know what to say. They can easily have full speed caches off die. That seems to be the case with MI300. Given the thermal constraints, it may be better to have cache under areas that are not the compute portion, so they may have cache under the on-die cache rather than under the logic areas.

This is from techpowerup showing the supposed die area dedicated to directory based cache coherency. They have it labeled "Tag Directories", but these are not L3 tags. If you put cache on the "CCD IFOP" side, then it looks the same from a cache coherency perspective, just a lot faster (for cache coherency) since it is on chip. An RDNA3/MCD style fan out link is close to 900 GB/s (63 for pci-e 5 speeds) with a very minimal PHY, so very little latency. If it is stacked with some other tech, then the latency would be even lower and the bandwidth higher. SoIC would have pretty much no latency penalty compared to on-die. Even HBM style links are very low latency. The latency from HBM comes from the DRAM array itself, not the interface.

itsmydamnation · Feb 17, 2023

jamescox said:
I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.

The problem is lots of workloads especially server workloads work on small non contiguous bits of data , like DB/disk reads/write. So if your smallest page is 16K its a whole lot more wasted cache for many critical workloads today. its that backward compatibility trade off that can bite you, Apple can just force it in its closed ecosystem and its happy days.

As I said someplace previously, I was thinking more along the lines of 2 or 4 cores sharing a large L2. Rumors have said all 8 though and the 16 MB number I just discussed because that is what Apple is using. I could believe 16 MB per 4 cores or something like that also. You also still seem to be assuming that everything else will be held constant. I doubt that is the case. I expect Zen 5 may be massively different from Zen 4. Probably not quite the same as the jump form excavator to zen, but a lot bigger than the zen 3 to zen 4 changes. It is kind of pointless to discuss, I guess. AMD will have register transfer level simulators to explore all of these options, so they will know what is best, but there are always trade offs. Given how much money Apple has to spend and how well their cores seem to do, I expect that they went in the right direction.

So now your back to the whole cache coherency problem , how are you going to scale those numa states over all these small L2 based CCX's. But also that's just crap for server /vm workloads in general which will use thread counts > 8. Also then what's the latency of that 16mb shared L2, if your L1D is staying at 32K that's a lot of L1 misses that are now a lot slower then today.

FYI i expect Zen5 to be a big departure too . to me Zen4 is still in Zen1's "bones". I expect Zen5 to be completely new Bones that are much bigger, decode , execute and load and store all to increased by a non trivial amount . So if you look at Zen 1 -> 4 logic. I expect in effect Zen5 to be the model that all the effort is spent in just getting the structure right and operating well , and subsequent designs to ramp up the complexity / efficiency of those resources ( like how Zen 3 has more execution ports but the same number of PRF read/write bandwidth)

"Fancy" packaging is responsible for getting AMD to where they are today.

No it isn't, they have been using the same packaging technology they have been using since like 2010 ( Mangy-Cours). AMD are yet to really stake there primary revenue streams on the modern wave of packaging tech.

If they were trying to compete using a monolithic die, Intel would probably still be a lot more dominate, even with all of the mistakes they have made. If you do not recognize that SoIC and other stacking technologies represent a disruptive change, then I don't know what to say. They can easily have full speed caches off die. That seems to be the case with MI300. Given the thermal constraints, it may be better to have cache under areas that are not the compute portion, so they may have cache under the on-die cache rather than under the logic areas.

You're now saying things I didn't say. What I said was it doesn't matter what technology you use , if you on average are going physically further away from the core for your data you are going to be burning more power and it will be slower. We can see this right now today , AMD pay a to main memory cost in desktop compared to intel. They get away with it more in Server because intels fabrics tend to be clocked quite slow.

This is from techpowerup showing the supposed die area dedicated to directory based cache coherency. They have it labeled "Tag Directories", but these are not L3 tags. If you put cache on the "CCD IFOP" side, then it looks the same from a cache coherency perspective, just a lot faster (for cache coherency) since it is on chip. An RDNA3/MCD style fan out link is close to 900 GB/s (63 for pci-e 5 speeds) with a very minimal PHY, so very little latency. If it is stacked with some other tech, then the latency would be even lower and the bandwidth higher. SoIC would have pretty much no latency penalty compared to on-die. Even HBM style links are very low latency. The latency from HBM comes from the DRAM array itself, not the interface.

View attachment 76712

im not really arguing with any of this except for two things

One , Those Tag directories are like the HT assist Directories of K10 / bulldozer , this was 100% the case in Zen1 , an AMD engineer told me so. I was asking him about the additional states they now have ( 7 ) without going into details it was fundamentally the link bellow with a few additional edge case situations .

https://www.cs.columbia.edu/~junfeng/11sp-w4118/lectures/amd.pdf

The L3 in the CCX is the thing tracking the exact state of a cache line down to its cores and it answers probes coming from the rest of the fabric.

Two, No way in hell at the start of 2024 are AMD moving all there primary revenue to chips requiring complex packaging technology. It will be phased in over time to allow for both scale and cost factors. As a percentage of revenue they do almost 0 complex packaging tech today. we have what Millan X / Zen3D , not even Zen4 yet and N31.

DisEnchantment · Feb 18, 2023

itsmydamnation said:
One , Those Tag directories are like the HT assist Directories of K10 / bulldozer , this was 100% the case in Zen1 , an AMD engineer told me so. I was asking him about the additional states they now have ( 7 ) without going into details it was fundamentally the link bellow with a few additional edge case situations .
https://www.cs.columbia.edu/~junfeng/11sp-w4118/lectures/amd.pdf The L3 in the CCX is the thing tracking the exact state of a cache line down to its cores and it answers probes coming from the rest of the fabric.

Indeed, the L3 actually is the CCM (Cache Coherent Master) and responsible for snooping and maintaining coherency across all CCMs

Re CCX, for the record, Mike Clark ("Leader of the Core Roadmap") alluded to more cores per CCX (which said rumor peddler was saying not possible due to IF bottlenecks)
Twice.

That's the optimization point we've chosen, and as we continue to go forward, getting more cores, and getting more cores in a sharing L3 environment, we’ll still try to manage that latency so that when there are lower thread counts in the system, you still getting good latency out of that L3. Then the L2 - if your L2 is bigger then you can cut back some on your L3 as well.

We do see core counts growing, and we will continue to increase the number of cores in our core complex that are shared under an L3. As you point out, communicating through that has both latency problems, and coherency problems, but though that's what architecture is, and that's what we signed up for. It’s what we live for - solving those problems. So I'll just say that the team is already looking at what it takes to grow to a complex far beyond where we are today, and how to deliver that in the future.

Not sure if this starts with MI300, with two 12 Core CCDs (instead of 3 CCDs, can only see 2 CCDs and 1 FPGA? chiplet besides the GPU chiplets). On the other hand it won't tally up to Bergamo core count.
Also Mike was talking about more cores sharing an L3 not L2.

BorisTheBlade82 · Feb 20, 2023

Okay, so now the layout of MI300 gets clearer:

https://twitter.com/x/status/1627720114823888922

Hitman928 · Feb 20, 2023

BorisTheBlade82 said:
Okay, so now the layout of MI300 gets clearer:

https://twitter.com/x/status/1627720114823888922

That's where I thought the CPUs were, but it still doesn't look like 24 cores to me, unless they are shipping with more and disabling some on every die for yield.

BorisTheBlade82 · Feb 20, 2023

Hitman928 said:
That's where I thought the CPUs were, but it still doesn't look like 24 cores to me, unless they are shipping with more and disabling some on every die for yield.

No, it looks like a 16c CCX and two 4c CCX. The big questions:

Are these really just 3 CCX on a custom die or 2 different types of CCD connected?
Is the 16c CCD essentially Bergamo?
Will Bergamo get rid of IFoP just as MI300 does?
Will this be the little to Zen5's BIG?

Just the picture - annotations by AMD.

maddie · Feb 20, 2023

jamescox said:
I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.

Can you not place 16 MB simple images? Not all of us have fast net access.

BorisTheBlade82 · Feb 20, 2023

BorisTheBlade82 said:
Just the picture - annotations by AMD.
No, it looks like a 16c CCX and two 4c CCX. The big questions:

Are these really just 3 CCX on a custom die or 2 different types of CCD connected?

Is the 16c CCD essentially Bergamo?

Will Bergamo get rid of IFoP just as MI300 does?

Will this be the little to Zen5's BIG?

View attachment 76892

I am going out on a limb: Back at CES Lisa was talking about 9 5nm dies. So I got it all right already on Jan 5th - before later on chiming in on the theory of @Vattila 😂

Page 2 - Discussion - RDNA4 + CDNA3 Architectures Thread

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

Hitman928 · Feb 20, 2023

BorisTheBlade82 said:
I am going out on a limb: Back at CES Lisa was talking about 9 5nm dies. So I got it all right already on Jan 5th - before later on chiming in on the theory of @Vattila 😂

Page 2 - Discussion - RDNA4 + CDNA3 Architectures Thread

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

forums.anandtech.com

So you see the middle of the CPU section as 16 cores and then above and below are 4 cores each?

HurleyBird · Feb 20, 2023

BorisTheBlade82 said:
No, it looks like a 16c CCX and two 4c CCX. The big questions:

You can clearly see eight core-like structures in the "top and bottom" CPU dies/sections.

The die size of those top and bottom sections also look way too large to only be 4c. By overlaying the rendered version over an actual package shot, I get ~97mm2 using the reported interposer size 2750mm2. Relative to an HBM die (here I can't find the size of an HBM3 die so I used the 86.25mm2 of HBM2E as an approximation) I get ~85mm2.

Obviously there are error bars here and also scale issues with the rendered version (as with most/all AMD renderings), so I wouldn't be surprised if that number moved closer to the 70mm2 of a normal Zen 4 CCD, but it's not going to move enough that it starts looking like a 4c CCD.

BorisTheBlade82 · Feb 20, 2023

@Hitman928
Exactly. Interestingly in the lower res version from ISSCC it looks closer to my proposal as in the higher res version from CES.

HurleyBird said:
You can clearly see eight core-like structures in the "top and bottom" CPU dies/sections.

The die size of those top and bottom sections also look way too large to only be 4c. By overlaying the rendered version over an actual package shot, I get ~97mm2 using the reported interposer size 2750mm2. Relative to an HBM die (here I can't find the size of an HBM3 die so I used the 86.25mm2 of HBM2E as an approximation) I get ~85mm2.

Obviously there are error bars here and also scale issues with the rendered version (as with most/all AMD renderings), so I wouldn't be surprised if that number moved closer to the 70mm2 of a normal Zen 4 CCD, but it's not going to move enough that it starts looking like a 4c CCD.

You have a point. Could you please post the overlay with some approx. numbers? What exactly did you measure? Only the 2x2 block or the surroundings as well? What about the size of the 4x4 structure?

BorisTheBlade82 · Feb 20, 2023

@Hitman928
Do you know the annotated picture by Locuza?
If you take the width of one base die of 12.75mm as a reference and keep in mind, that even the 4x4 structure does not go right until the edges, that would give you around 10*10mm for 16 cores. So 100mm2 for the cores - and as this is a stylized render this might already include L3. Does not sound too bad.

https://twitter.com/x/status/1611048510949888007

Hitman928 · Feb 20, 2023

BorisTheBlade82 said:
@Hitman928
Do you know the annotated picture by Locuza?
If you take the width of one base die of 12.75mm as a reference and keep in mind, that even the 4x4 structure does not go right until the edges, that would give you around 10*10mm for 16 cores. So 100mm2 for the cores - and as this is a stylized render this might already include L3. Does not sound too bad.

https://twitter.com/x/status/1611048510949888007

The 4x4 block just doesn't look anything like a CPU whereas, as @HurleyBird mentioned, the top and bottom sections look like 8 core CCXs. I realize it's not a true to life render, but the 4x4 section is not even really close to a CPU looking structure. Who knows though, I'm happy waiting until AMD actually reveals more details.

Tigerick · Feb 21, 2023

During ISSCC conference, AMD talked about Mi300 layout for those who interested:

AMD Lays The Path To Zettascale Computing: Talks CPU & GPU Performance Plus Efficiency Trends, Next-Gen Chiplet Packaging & More

AMD talked about the future of computing, laying out its CPU & GPU trends in terms of efficiency & performance during ISSCC 2023.

wccftech.com

There is YT video link as well

moinmoin · Feb 21, 2023

Not aware of the live stream I only followed Ian's tweets before. Here's the archived stream (Su's section starts at 46:23):

Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Golden Member

Lifer

Golden Member

Senior member

Senior member

Senior member

Platinum Member

Senior member

Platinum Member

Golden Member

Diamond Member

Senior member

Platinum Member

Golden Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Senior member

Senior member

Diamond Member

Senior member

Diamond Member