So, the implication there is that windows servers (which, let's face it, if you are buying an Epyc and want to use windows, is what you will run if you actually go bare metal) are the only major stumbling block for this?
These standards aren't built in a vacuum, AMD can push things forward if need be. It's an artificial limitation that can be removed at any time.
Yeah, as long as Linux or various hypervisors doesn't have a problem with 48+ cores it shouldn't be an issue.
I mean, multiple leaks have already pointed out Starship having 48 cores, even before Zen released if memory serves. The leaks were often accompanied by information about first gen Zen that ended up being true, so there is high credibility to them.Standards, specifically common industry standards, would need to be revised... this takes time.
Look, fellas, I'm not saying higher core counts aren't ever coming... I'm just saying don't hold your breath for the time being.
The promise is that you can buy a motherboard with current EPYC CPU, and upgrade it to an EPYC 2 CPU with 48 cores some time down the line. So, 8 channels and 48 cores.
I still do not see any point in the "connection topology" argument for 4-core CCX. Cores do not talk to other cores, they talk to L3 slices.
And? Maybe you should go look at how Zen/Zepplin/ EYPC work, because right now each Processor group only has 16 logical threads in it. A EPYC two socket system has 8 NUMA nodes, 192/8 = 24, 24 is < 64 ...................No, the point wasn't that Linux or whatever don't have a problem with more than 48 cores... for that matter, neither does Windows (10 or Server 2016 or again, whatever). The problem is the Windows restriction (which very well may also be present in some fashion on other OSes as I believe this is based on an APIC standard) of no more than 64 logical processors per Processor Group. If you're not familiar with how NUMA really works (translation: you've not operated a NUMA system and don't truly understand how they function), we can stop here as I've likely lost you.
How does Knights Landing work in Windows Server 2012 R2 then? Shows 64 cores and 256 threads.AMD will *not* release a 48-core CPU for the same reason Intel won't release a 48-core CPU: currently, Windows has inherent limitation in size of processor group of max 64 logical processors. This is based on current APICX2 spec and is not possible to modify without a coordinated industry change.
And? Maybe you should go look at how Zen/Zepplin/ EYPC work, because right now each Processor group only has 16 logical threads in it. A EPYC two socket system has 8 NUMA nodes, 192/8 = 24, 24 is < 64 ...................
edit: i derp, 48 cores a P , 96 cores 2P, 96 threads 1P, 192 threads 2P.
Just as idle speculation. I'm wondering ifSMT2 orSMT4 is in the cards, ala IBM's implementation.
This conversation is just really dumb, and this isn't a good point for so many reasons.Correct. And if I disable NUMA, then what happens....? (I know the answer but will ask it here for fun.)
This conversation is just really dumb, and this isn't a good point for so many reasons.
1. Why would you want to kill performance
2. If we just turn off NUMA we can already exceed 64 threads with intel v4 4S systems.
3. Even in 2012 ( cant find newer data) bare metal was only 30% of them market, generally you keep vcpu per guest low(around 4 vcpu) so you dont run into scheduling contention ( before you run into memory or io)
4. AMD have already said they are producing a 48 core server, because of the architecture, its going to be 4*X*4*2*2 or X*2*4*2*2 for thread count. 48 core is really the smallest logical increase they can make.
5.Consider if they dont make a bigger die for 7nm(relative to zeppelin) at ~100mm sq they could be pad limited.
You are wrong stop trying to pretend there is some magically reason as to why you are right.
I love how you just ignore 85% of my post that points out how retarded this line your taking is.Why don't you ask AMD? Content creation mode essentially disables NUMA. The reason is clear: non-NUMA aware apps will not cross a NUMA boundary during execution.
Again, There is no processor count limitation with respect to NUMA; there is a Windows restriction of no more than 64 threads per processor group.
Have you played with a NUMA system? Threadripper counts....
Would be highly interesting to see. The idea of a core that is wastefully wide for single threaded but still no slower and potentially even faster than an SMT2 core, built to handle four threads, always intrigued me.I don't think there is any reliable information out in the open suggesting this, but it sure would be cool. And it would fit nicely with the design philosophy and topology outlined in my original post, in which everything is grouped by four: 4 threads constituting a powerful wide core, optimally interconnected in a CCX of four cores, stamped out on a die with 4 CCXs optimally interconnected, in a socket of 4 optimally interconnected dies, in a system of 4 optimally interconnected sockets. Balancing everything around the simple and efficient grouping by four feels very zen.
In particular, SMT4 would allow AMD to design a very wide core achieving high competitive single-thread IPC, while ensuring it will be well utilised for multi-threaded workloads in which each thread has a low IPC (e.g. due to memory access stalls).
PredictionsBy the way, this is how I envisioned 64-core EPYC:
I am trying to reconcile this with the latest rumour about a chiplet and interposer design. It might be done as follows.
Imagine that this hypothetical 64-core EPYC chip will still use 4 dies (gray) mounted and interconnected as shown (and like before) on an organic package substrate. However, these dies are not monolithic. Instead they are relatively small 28nm active interposers with all the uncore logic, as well as the interconnect for the CCXs (green), which are tiny 7nm dies mounted on the 28nm interposer.
I presume they would need some filling material around the 7nm dies, so that the uncore parts of the interposer can make contact with the heat-spreader, for stability and efficient cooling. But this is a general problem for 2.5D designs, I guess.
The nice thing about this design is that the 7nm yield would be great, since a single-CCX die without uncore logic would be tiny. Also, they could test for known-good-die on three levels, before testing the whole package: the 7nm CCX dies and the 28nm interposers before and after mounting the CCXs. This should reduce the test failure rate for the whole package.
For the consumer chips they would just mount fewer CCXs, or fuse off dysfunctional ones that failed testing.
Could it be done? If not, where did I go wrong?
No uncore but more L3 cache
So, have 5 chips on one package, four with IF links to the 5th, and the 5th handling all the I/O between the package and the rest of the system.
Sorry, I meant no separate uncore and or L3 die. Every chiplet is a complete CPU. Edited nowWhere does the uncore logic go? On a separate chiplet in the middle of a 3x3 layout on the EPYC interposer?
100mm^2 max is my guess
As I had mentioned in another thread, moving to 7nm at the earliest opportunity when yield rates will be at the lowest implies the smallest die needed. This is my fundamental starting assumption and 100mm^2 max is my guess and this excludes a 16 Core unit.