Very impressive results, even more so considering atrociuos memory latency. I took the liberty to compare the results to 4Ghz CFL with tight memory:
http://browser.geekbench.com/v4/cpu/compare/13241660?baseline=13277187
The fact that Ryzen operates so well with latency twice as high is incredible.
Why does this show 8MB for L3 cache when it should be 16 MB per CCX? If it does have the 16 MB of L3 per CCX, that that is huge by itself. A 16-core part would have 64 MB of L3 cache. The highest cache single die Intel Xeon is only 38.5 MB. Even if you add in the L2 caches (Xeon has larger L2 at the moment), the 16-core part would still have more on package cache than Xeon processors costing around $10,000 or more. The dual die Xeon 56 core part gets up to 77 MB L3, but Epyc 2 could have as high as 256MB (16 CCX * 16 MB each).
I haven’t read this whole 120+ page thread, so I don’t know if this has been covered. I don’t expect the raw memory latency to increase that much over Zen 1. The actual average latency may go down significantly due to increased cache size and other improvements. The CCX to CCX (remote L3) latency will probably be much smaller. It will probably not be as low latency or as high of bandwidth as Intel’s L3 mesh network, but the mesh network seems to burn a lot of power unnecessarily, at least for most applications. The cache latency within the CCX is much lower than intel latency, but this requires the OS and applications to take advantage of that.
I expect relatively low latency, even with separate chips for several reason. One is that there is no memory clock on the cpu chiplet. There would be no reason to have any kind of clock boundary. For Zen 1, on die CCX to CCX communication has to go through a 10 or 11 port infinity fabric switch at memory clock. With Zen 2, the cpu chiplet switch will most likely just be a 3 port switch (or something else) that operates at core clock. The average latency improvements along with the CCXs having possibly significantly lower latency between them may make an 8 core single chiplet Ryzen 3000 much better for games. The chiplet to chiplet latency should be much lower also, so comparisons between 16-core ThreadRipper and 16-core Ryzen 3000 will be interesting.
Due to the increased infinty fabric bandwidth, I would expect the on die switch to need to widened to double the width of the first generation. Also, the clock speed of the infinty fabric is much higher, so that will reduce latency between multiple cpu chiplets and memory. The double width paths will be able to transfer a cache line a lot faster. There is also prefetch improvements that will reduce average latency.
I expect that it also has improvements in bandwidth efficiency. Zen 1 appears to have two separate ports for the two memory controller channels. This may not be the case for Zen 2 since the width of the internal switch may have doubled. They may combine the two channels it one port on the switch. What took 2 memory clocks before may only be one now. The switch can be optimized for desktop use since it doesn’t need anywhere near the number of ports as Zen 1 or the Epyc IO die. It may only have 5 or 6 ports on the switch compared to the 10 to 11 on Zen 1. Also, the IFIS do not need to support connection to other cpu chiplets. They can be optimized for IO only.
I kind of expect that a ThreadRipper update will be much later, if at all. There is a good possibility that Zen 2 with 16 cores, much higher memory clocks with better bandwidth utilization, and AVX256 will outperform 16-core ThreadRipper in most cases. The IO differences are not anywhere near as important as people think. There isn’t going to be that many 4.0 cards available right away, and a lot of them just don’t need that much speed. More expensive boards can be made with PCI express switch chips to supply probably as many slots as is really reasonable on an ATX board. The new chipset will have more bandwidth also. The doubled bandwidth of pci express 4.0 should bring it close to current ThreadRipper IO capabilities.
This leaves current ThreadRipper owners without a cpu upgrade for the platform . They may make a new ThreadRipper based on Epyc IO die eventually, but that would be a low priority compared to Rome server processors. Also, it is unclear whether the Epyc IO die would allow good performance in games. It is also unclear whether they would have that many defective IO die to make it reasonable to use in ThreadRipper. They do not want to be wasting fully functional Epyc IO die on ThreadRipper. Those needing workstation performance may need to go with Epyc, but since it may not do well on games, it may not fit as a HEDT processor.