- Mar 3, 2017
- 1,747
- 6,598
- 136
I mentioned about this in the CDNA3 thread, from latest official disclosures, MI300 carry Zen 4 features, Unified Memory and CXL 2.0 (IF4.0 features).It isn't uncommon for HPC machines to have 512 GB to 1 TB of system memory per GPU. The latency of CXL is possibly not an issue with all of the SRAM and HBM in the system, but I would expect it to have memory expansion somewhere / somehow.
Indeed MI300 should not be using IFOP at all.@jamescox
Aside from your MI300 layout suggestion, for which I already expressed a different opinion in this thread as well as the MI300 thread, this is exactly my line of thinking as well.
Why not use Zen4c in MI300 for the reasons you explained? Why not getting rid with IFoP in this process? Why not switching interconnects for Bergamo already and re-use (variations of) that CCD as small cores together with Zen5?
Maybe I am terribly wrong - but one can at least dream 😉
I think there was some discussion of deliberatly adding extra TSVs for greater thermal conductivity, since copper has higher thermal conductivity than silicon. This may affect make thermal expansion issues worse though.Wouldn't the TSVs between the dies keep the temperature differential relatively small? It might be worth adding even dummy TSVs just to keep the temperatures in sync.
No, it doesn't, that would kill latency. And the excellent low latency of L3$ despite their size is one of Zen's major advantages.
it doesn't sound reasonable at allI saw some geekbench scores for M2 and some for possible Genoa. The single core still seems to be much higher for Apple M2.
This is the M2 cache structure (according to wikipedia):
L1 cache Performance cores 192+128 KB per core
Efficiency cores 128+64 KB per core
L2 cache Performance cores M2: 16 MB
M2 Pro and M2 Max: 32 MB
Efficiency cores 4 MB
Last level cache M2: 8 MB
M2 Pro: 24 MB
M2 Max: 48 MB
Not sure if the 16 MB L2 is unified though? If Zen 5 implements similar massive L2 caches, then pushing the L3 farther out, to the IO die or memory side, seems reasonable.
it doesn't sound reasonable at all
M2 clocks way lower and is a much wider core in all regards.
its L1D is massive to reduce pressure on the L2 and that has its own trade offs that arent very good for many server workloads / high IOPS small read/writes
Next you next to explain how you are going to handle cache coherency in this model out to 8+ CCX with CCX's attached to different nodes on the IOD. how is your LLC going to attach and how will it actually work.
How many loads / store a cycle do you expect from this L2 , how many writes out of the L2 to whatever up the chain do you expect per cycle.
having only 16.25mb of cache across 8 cores vs 40.25mb means far more read and write pressure off the CCD.
And just remember Zen4 has the same amount of decode / execute and load store of skylake. These Cores are not in the same class.
Correlation is not causation. a big fast shared L2 sounds amazing for a single threaded app , but sustained high performance across all cores means very high read/write requirements and it needs to run at high clock rate.
You can also look the other way like ibm Z15 , really big(32mb per L2) slow private L2's that together act as a virtual L3 for each other which really has the effect of reduced pressure on the L3 because of high L2 hit rate.
The problem is page sizes , 4kb page , 32k low latency L1D , 16kb page size 128k low latency L1D.IOPS in regard to L2/L3 caches?
You never actually addressed anything and missed the major point, which was scaling performance and sustaining high throughput low latency across your memory sub system. Apple has a very large L2 but only has 4 cores connected to it (3mb a core on M1) , 8 cores would be a big step back at 12mb shared L2 and if you make the L2 larger you pay with more latency.There have been a lot of rumors saying that Zen 5 will have large, shared L2, but they could always be completely wrong. We have had large L2 cache chips in the past and many applications do very well with large L2.
The old Core 2 Duo processors with 6 MB L2 still performed very well against later processors with tiny L2 and larger L3. Many server applications are often not really very cacheable anyway.
Zen 5 is what, 3 or 4 nm process tech? I haven't even payed any attention. Some things may become possible that were not before just due to the density. Interconnect paths will be reduced significantly. Making L2 large and fast (enough) is a possibility. Also, there is no guarantee that Zen 5 will be like Zen 4. We do not know. It is a new design. It may be more like M2. M2 isn't really that low of clock compared to past cores. Increasing IPC significantly while staying the same as Zen 4 / Skylake is likely not possible, so I am not sure we are going to get massive clock speed increases. It may just make more sense to throw more hardware at the problem at a lower clock.
Also, a large, shared L2 does not rule out on package L3. They could still have a smaller L3 on the cpu die, but given the scaling, it may make more sense to move it off. That would be an argument for stacking it over or under or moving to adjacent die. Even if going off die increases, that may not really be a problem if it is stacked or uses some RDL tech like the connection between RDNA3 and MCD. There is very little latency penalty for such connections (near zero for SoIC) and the bandwidth is an order of magnitude higher than SerDes based GMI. Multiple orders of magnitude for SoIC. For epyc, why would cache coherency with part of the hierarchy moved to the IO die side be any different? There is already presumably a lot of traffic to the IO die to maintain cache coherency. The labeled IO die diagrams (if correct) show a rather large area dedicated to tag directories, so it may actually be easier to maintain coherency, but it has been a long time since my computer architecure classes, so I don't quite remember exactly how directory based coherency works. I will have to do some reading. The Epyc IO die can be separate numa nodes, but it can also be configured as a single numa node (in fact, that is the default, I believe) and memory addresses will be interleaved across all 8 (or 12) memory channels in NPS1 mode.
Very good point. That can be expanded to the hardware manufacturing ecosystem as well. It seems to me that AMD proceeds with flexibility and different alternate plans in mind already. So e.g. Zen 3 was planned with X3D cache from the start, but mass production feasibility only aligned significantly later.The odds of the established software ecosystem not aligning with radical architectural changes and instead working against AMD are quite high. For an architecture which is meant to be their bread and butter you can bet they would be very careful.
I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.The problem is page sizes , 4kb page , 32k low latency L1D , 16kb page size 128k low latency L1D.
you can ( and amd does) merging of multiple 4kb pages into a single larger page , but its not always that simple.
Intel went to 48kb @ 4kb page size and pay the price in extra latency
You never actually addressed anything and missed the major point, which was scaling performance and sustaining high throughput low latency across your memory sub system. Apple has a very large L2 but only has 4 cores connected to it (3mb a core on M1) , 8 cores would be a big step back at 12mb shared L2 and if you make the L2 larger you pay with more latency.
it also doesn't matter about fancy packaging if you are on average going further to retrieve data ( what would happen with only 16mb of cache on a CCD) you are going to loose big time. The thing to remember is this isnt the year 2000 anymore as these nodes shrink wire resistance goes through the roof, you want short fat wires.
its interesting to look at 7800X3D , it has a TDP of 120w vs 105W of 7700X for 300mhz less base clock speed that's like a 22% increase in power per clock. So its probably actually going to be less efficient then the 7700X assuming 15-20% IPC jump.
But on the last bit , the tags are in the L3's currently. You can still have a directory above that , but they directory only needs to track to the CCX. So now you are wasting your L2 on tags or you need way more space on the IOD for it and updating the state of a cache line will cost a whole lot more power. Also if you have an L3 on the IOD its going to be dog slow because the IOD is clocked slow and has multiple physically large hops. You could implement memory control side caches , then you have 4 separate blocks of memory controllers. Throughput per clock across the IOD would need to increase significantly which would be a significate power cost.
If there is a large shared L2 i am betting it will be more like IBM's virtual cache and less like apples L2 and still around the same size as current L2+ L3. Everyone from phones to desktop to Server to big iron are adding more cache per core, and more cache closer not less cache and further away.
The problem is lots of workloads especially server workloads work on small non contiguous bits of data , like DB/disk reads/write. So if your smallest page is 16K its a whole lot more wasted cache for many critical workloads today. its that backward compatibility trade off that can bite you, Apple can just force it in its closed ecosystem and its happy days.I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.
So now your back to the whole cache coherency problem , how are you going to scale those numa states over all these small L2 based CCX's. But also that's just crap for server /vm workloads in general which will use thread counts > 8. Also then what's the latency of that 16mb shared L2, if your L1D is staying at 32K that's a lot of L1 misses that are now a lot slower then today.As I said someplace previously, I was thinking more along the lines of 2 or 4 cores sharing a large L2. Rumors have said all 8 though and the 16 MB number I just discussed because that is what Apple is using. I could believe 16 MB per 4 cores or something like that also. You also still seem to be assuming that everything else will be held constant. I doubt that is the case. I expect Zen 5 may be massively different from Zen 4. Probably not quite the same as the jump form excavator to zen, but a lot bigger than the zen 3 to zen 4 changes. It is kind of pointless to discuss, I guess. AMD will have register transfer level simulators to explore all of these options, so they will know what is best, but there are always trade offs. Given how much money Apple has to spend and how well their cores seem to do, I expect that they went in the right direction.
No it isn't, they have been using the same packaging technology they have been using since like 2010 ( Mangy-Cours). AMD are yet to really stake there primary revenue streams on the modern wave of packaging tech."Fancy" packaging is responsible for getting AMD to where they are today.
You're now saying things I didn't say. What I said was it doesn't matter what technology you use , if you on average are going physically further away from the core for your data you are going to be burning more power and it will be slower. We can see this right now today , AMD pay a to main memory cost in desktop compared to intel. They get away with it more in Server because intels fabrics tend to be clocked quite slow.If they were trying to compete using a monolithic die, Intel would probably still be a lot more dominate, even with all of the mistakes they have made. If you do not recognize that SoIC and other stacking technologies represent a disruptive change, then I don't know what to say. They can easily have full speed caches off die. That seems to be the case with MI300. Given the thermal constraints, it may be better to have cache under areas that are not the compute portion, so they may have cache under the on-die cache rather than under the logic areas.
im not really arguing with any of this except for two thingsThis is from techpowerup showing the supposed die area dedicated to directory based cache coherency. They have it labeled "Tag Directories", but these are not L3 tags. If you put cache on the "CCD IFOP" side, then it looks the same from a cache coherency perspective, just a lot faster (for cache coherency) since it is on chip. An RDNA3/MCD style fan out link is close to 900 GB/s (63 for pci-e 5 speeds) with a very minimal PHY, so very little latency. If it is stacked with some other tech, then the latency would be even lower and the bandwidth higher. SoIC would have pretty much no latency penalty compared to on-die. Even HBM style links are very low latency. The latency from HBM comes from the DRAM array itself, not the interface.
View attachment 76712
Indeed, the L3 actually is the CCM (Cache Coherent Master) and responsible for snooping and maintaining coherency across all CCMsOne , Those Tag directories are like the HT assist Directories of K10 / bulldozer , this was 100% the case in Zen1 , an AMD engineer told me so. I was asking him about the additional states they now have ( 7 ) without going into details it was fundamentally the link bellow with a few additional edge case situations .
https://www.cs.columbia.edu/~junfeng/11sp-w4118/lectures/amd.pdf The L3 in the CCX is the thing tracking the exact state of a cache line down to its cores and it answers probes coming from the rest of the fabric.
That's the optimization point we've chosen, and as we continue to go forward, getting more cores, and getting more cores in a sharing L3 environment, we’ll still try to manage that latency so that when there are lower thread counts in the system, you still getting good latency out of that L3. Then the L2 - if your L2 is bigger then you can cut back some on your L3 as well.
We do see core counts growing, and we will continue to increase the number of cores in our core complex that are shared under an L3. As you point out, communicating through that has both latency problems, and coherency problems, but though that's what architecture is, and that's what we signed up for. It’s what we live for - solving those problems. So I'll just say that the team is already looking at what it takes to grow to a complex far beyond where we are today, and how to deliver that in the future.
Okay, so now the layout of MI300 gets clearer:
That's where I thought the CPUs were, but it still doesn't look like 24 cores to me, unless they are shipping with more and disabling some on every die for yield.
Can you not place 16 MB simple images? Not all of us have fast net access.I still don't know what you are trying to get at here. The virtual memory page size is often 4KB, but can be, and probably should be larger. I have noticed that most applications perform better with the 2 MB pages, probably due to less stress on the TLB. 4KB is very small when you have 1 TB of memory. The linux kernel doesn't seem to support 2 MB pages that well yet though. AFAIK, the cache line size for these processors is 64 Bytes, which is quite small, even compared to a 4096 Byte page. I believe it is the case that a single Zen 3 core cannot actually get low latency access to the whole 32 MB L3 because the TLB can't cache sufficient page mappings.
I am going out on a limb: Back at CES Lisa was talking about 9 5nm dies. So I got it all right already on Jan 5th - before later on chiming in on the theory of @Vattila 😂Just the picture - annotations by AMD.
No, it looks like a 16c CCX and two 4c CCX. The big questions:
View attachment 76892
- Are these really just 3 CCX on a custom die or 2 different types of CCD connected?
- Is the 16c CCD essentially Bergamo?
- Will Bergamo get rid of IFoP just as MI300 does?
- Will this be the little to Zen5's BIG?
I am going out on a limb: Back at CES Lisa was talking about 9 5nm dies. So I got it all right already on Jan 5th - before later on chiming in on the theory of @Vattila 😂
Page 2 - Discussion - RDNA4 + CDNA3 Architectures Thread
Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.forums.anandtech.com
No, it looks like a 16c CCX and two 4c CCX. The big questions:
You have a point. Could you please post the overlay with some approx. numbers? What exactly did you measure? Only the 2x2 block or the surroundings as well? What about the size of the 4x4 structure?You can clearly see eight core-like structures in the "top and bottom" CPU dies/sections.
The die size of those top and bottom sections also look way too large to only be 4c. By overlaying the rendered version over an actual package shot, I get ~97mm2 using the reported interposer size 2750mm2. Relative to an HBM die (here I can't find the size of an HBM3 die so I used the 86.25mm2 of HBM2E as an approximation) I get ~85mm2.
Obviously there are error bars here and also scale issues with the rendered version (as with most/all AMD renderings), so I wouldn't be surprised if that number moved closer to the 70mm2 of a normal Zen 4 CCD, but it's not going to move enough that it starts looking like a 4c CCD.
@Hitman928
Do you know the annotated picture by Locuza?
If you take the width of one base die of 12.75mm as a reference and keep in mind, that even the 4x4 structure does not go right until the edges, that would give you around 10*10mm for 16 cores. So 100mm2 for the cores - and as this is a stylized render this might already include L3. Does not sound too bad.