Speculation: The CCX in Zen 2

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

jpiniero

Lifer
Oct 1, 2010
14,838
5,456
136
Yeah, as long as Linux or various hypervisors doesn't have a problem with 48+ cores it shouldn't be an issue.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
These standards aren't built in a vacuum, AMD can push things forward if need be. It's an artificial limitation that can be removed at any time.
 

kjboughton

Senior member
Dec 19, 2007
330
118
116
So, the implication there is that windows servers (which, let's face it, if you are buying an Epyc and want to use windows, is what you will run if you actually go bare metal) are the only major stumbling block for this?

There may be others; this is one that I have pointed out and a good reason I believe you will *not* see CPUs with more than 32 real cores anytime soon.

BTW, I've run Server 2016 on my rig and actually prefer the scheduling done by Windows 10. Feels and operates more like a workstation and not a server (surprise). Given that Server 2016 and Win 10 share the same kernel code base and can both manage these CPUs properly, I would go with an EPYC workstation using Win10 over Server 2016 knowing what I know.
 

kjboughton

Senior member
Dec 19, 2007
330
118
116
These standards aren't built in a vacuum, AMD can push things forward if need be. It's an artificial limitation that can be removed at any time.

Standards, specifically common industry standards, would need to be revised... this takes time.

Look, fellas, I'm not saying higher core counts aren't ever coming... I'm just saying don't hold your breath for the time being.
 

kjboughton

Senior member
Dec 19, 2007
330
118
116
Yeah, as long as Linux or various hypervisors doesn't have a problem with 48+ cores it shouldn't be an issue.

No, the point wasn't that Linux or whatever don't have a problem with more than 48 cores... for that matter, neither does Windows (10 or Server 2016 or again, whatever). The problem is the Windows restriction (which very well may also be present in some fashion on other OSes as I believe this is based on an APIC standard) of no more than 64 logical processors per Processor Group. If you're not familiar with how NUMA really works (translation: you've not operated a NUMA system and don't truly understand how they function), we can stop here as I've likely lost you.
 

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
Standards, specifically common industry standards, would need to be revised... this takes time.

Look, fellas, I'm not saying higher core counts aren't ever coming... I'm just saying don't hold your breath for the time being.
I mean, multiple leaks have already pointed out Starship having 48 cores, even before Zen released if memory serves. The leaks were often accompanied by information about first gen Zen that ended up being true, so there is high credibility to them.
The likelyhood of it being true is very high, which means it's less a question of if, and more of how.
 

Vattila

Senior member
Oct 22, 2004
805
1,394
136
The promise is that you can buy a motherboard with current EPYC CPU, and upgrade it to an EPYC 2 CPU with 48 cores some time down the line. So, 8 channels and 48 cores.

It is not inconceivable that they will introduce a new platform before the end of SP3 — either while keeping compatibility with SP3, or even by producing two variants of the CPUs, differing only in the packaging/socket and how many memory controllers are enabled.

I still do not see any point in the "connection topology" argument for 4-core CCX. Cores do not talk to other cores, they talk to L3 slices.

Your line of argument makes no sense to me. My argument is simply that direct connections between 4 cores is the optimal topology (lowest latency). Your argument seems to be that the cores in the current CCX are not directly connected (which, unbelievable to me, and contrary to presentations by AMD, means they must then have a suboptimal topology), and that this claimed fact then makes topology irrelevant. That's strange reasoning to me.

Further, you argue that AMD can just as well throw in two more cores in the current scheme, i.e. as you seem to put it, with the L3 slices (controllers) acting as independent routers between cores. My point is: That is suboptimal. It seems clear that you can have a interconnection scheme (topology) between 4 cores that beats that.

I think AMD will focus on making a CCX of 4 cores optimal and well balanced. The trade off is inter-CCX latency penalty already at 5 cores. But I don't think shifting this penalty to a mere 7 cores matters much, especially since you have to increase latency within the CCX to do so (by going to a suboptimal topology that can interconnect more cores than 4). Better to have 4 cores optimally connected, I suspect.
 
Last edited:
Reactions: cbn

scannall

Golden Member
Jan 1, 2012
1,948
1,640
136
Just as idle speculation. I'm wondering if SMT2 or SMT4 is in the cards, ala IBM's implementation.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,418
136
No, the point wasn't that Linux or whatever don't have a problem with more than 48 cores... for that matter, neither does Windows (10 or Server 2016 or again, whatever). The problem is the Windows restriction (which very well may also be present in some fashion on other OSes as I believe this is based on an APIC standard) of no more than 64 logical processors per Processor Group. If you're not familiar with how NUMA really works (translation: you've not operated a NUMA system and don't truly understand how they function), we can stop here as I've likely lost you.
And? Maybe you should go look at how Zen/Zepplin/ EYPC work, because right now each Processor group only has 16 logical threads in it. A EPYC two socket system has 8 NUMA nodes, 192/8 = 24, 24 is < 64 ...................

edit: i derp, 48 cores a P , 96 cores 2P, 96 threads 1P, 192 threads 2P.
 

moinmoin

Diamond Member
Jun 1, 2017
4,994
7,765
136
AMD will *not* release a 48-core CPU for the same reason Intel won't release a 48-core CPU: currently, Windows has inherent limitation in size of processor group of max 64 logical processors. This is based on current APICX2 spec and is not possible to modify without a coordinated industry change.
How does Knights Landing work in Windows Server 2012 R2 then? Shows 64 cores and 256 threads.
https://www.servethehome.com/intel-xeon-phi-x200-knights-landing-boots-windows/
 

kjboughton

Senior member
Dec 19, 2007
330
118
116
And? Maybe you should go look at how Zen/Zepplin/ EYPC work, because right now each Processor group only has 16 logical threads in it. A EPYC two socket system has 8 NUMA nodes, 192/8 = 24, 24 is < 64 ...................

edit: i derp, 48 cores a P , 96 cores 2P, 96 threads 1P, 192 threads 2P.

Correct. And if I disable NUMA, then what happens....? (I know the answer but will ask it here for fun.)
 

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
Not touching on your NUMA question because when that is disabled on high P systems, bad things happen. Looking through the Windows documentation on handling Processor Groups and systems with more than 64 logical processors, it appears that Windows can (and will if not regulated) arbitrarily assign out "processors groups" to clusters of up to 64 logical processors. This is done with locality in mind, but LOCALITY IS NOT REQUIRED. Meaning that, if a physical processor has more than 64 logical processors, more than one processor group will be assigned to it to account for all of the logical processors present. As long as the programs are coded properly to handle NUMA setups with more than 64 logical processors (which the big, actively developed stuff mostly is) then it will operate in a logically complete fashion. Speed optimization may not be ideal, however, as the processor will be partitioned in a less than optimal manner. There appears to be at least some ability to influence how the processor groups are defined (I haven't read that far yet) which should be able to mitigate the problems here.

This is not just abstract. Multiple sites already have the Demo Intel Xeon Phi systems that have 64 cores and 256 threads booting into Windows 2012 server and executing benchmarks that show all threads active with solid performance. A 48 core processor with 96 threads is absolutely usable on Windows server. It's just a matter of MS making it official (which, we know they are working on supporting Xeon Phi natively). I have no doubts that if AMD wants to release a 48 core EPYC processor that MS will have it supported. Especially with their per-core licensing model in place.
 

Vattila

Senior member
Oct 22, 2004
805
1,394
136
Just as idle speculation. I'm wondering if SMT2 or SMT4 is in the cards, ala IBM's implementation.

I don't think there is any reliable information out in the open suggesting this, but it sure would be cool. And it would fit nicely with the design philosophy and topology outlined in my original post, in which everything is grouped by four: 4 threads constituting a powerful wide core, optimally interconnected in a CCX of four cores, stamped out on a die with 4 CCXs optimally interconnected, in a socket of 4 optimally interconnected dies, in a system of 4 optimally interconnected sockets. Balancing everything around the simple and efficient grouping by four feels very zen.

In particular, SMT4 would allow AMD to design a very wide core achieving high competitive single-thread IPC, while ensuring it will be well utilised for multi-threaded workloads in which each thread has a low IPC (e.g. due to memory access stalls).
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,418
136
Correct. And if I disable NUMA, then what happens....? (I know the answer but will ask it here for fun.)
This conversation is just really dumb, and this isn't a good point for so many reasons.

1. Why would you want to kill performance
2. If we just turn off NUMA we can already exceed 64 threads with intel v4 4S systems.
3. Even in 2012 ( cant find newer data) bare metal was only 30% of them market, generally you keep vcpu per guest low(around 4 vcpu) so you dont run into scheduling contention ( before you run into memory or io)
4. AMD have already said they are producing a 48 core server, because of the architecture, its going to be 4*X*4*2*2 or X*2*4*2*2 for thread count. 48 core is really the smallest logical increase they can make.
5.Consider if they dont make a bigger die for 7nm(relative to zeppelin) at ~100mm sq they could be pad limited.

You are wrong stop trying to pretend there is some magically reason as to why you are right.
 

kjboughton

Senior member
Dec 19, 2007
330
118
116
This conversation is just really dumb, and this isn't a good point for so many reasons.

1. Why would you want to kill performance
2. If we just turn off NUMA we can already exceed 64 threads with intel v4 4S systems.
3. Even in 2012 ( cant find newer data) bare metal was only 30% of them market, generally you keep vcpu per guest low(around 4 vcpu) so you dont run into scheduling contention ( before you run into memory or io)
4. AMD have already said they are producing a 48 core server, because of the architecture, its going to be 4*X*4*2*2 or X*2*4*2*2 for thread count. 48 core is really the smallest logical increase they can make.
5.Consider if they dont make a bigger die for 7nm(relative to zeppelin) at ~100mm sq they could be pad limited.

You are wrong stop trying to pretend there is some magically reason as to why you are right.

Why don't you ask AMD? Content creation mode essentially disables NUMA. The reason is clear: non-NUMA aware apps will not cross a NUMA boundary during execution.

Again, There is no processor count limitation with respect to NUMA; there is a Windows restriction of no more than 64 threads per processor group.

Have you played with a NUMA system? Threadripper counts....
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,418
136
Why don't you ask AMD? Content creation mode essentially disables NUMA. The reason is clear: non-NUMA aware apps will not cross a NUMA boundary during execution.

Again, There is no processor count limitation with respect to NUMA; there is a Windows restriction of no more than 64 threads per processor group.

Have you played with a NUMA system? Threadripper counts....
I love how you just ignore 85% of my post that points out how retarded this line your taking is.

I see now your the 1% of the market who is different to everyone else but thinks the rest of the market will care about your requirement.

Yes I spend all day designing / provisioning converged datacentre infrastructure ( compute, network, storage).

How about we make this simple, care to make a wager 7nm Zen 2 is/isn't 48 cores or more?
 
Reactions: Ajay

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
I don't think there is any reliable information out in the open suggesting this, but it sure would be cool. And it would fit nicely with the design philosophy and topology outlined in my original post, in which everything is grouped by four: 4 threads constituting a powerful wide core, optimally interconnected in a CCX of four cores, stamped out on a die with 4 CCXs optimally interconnected, in a socket of 4 optimally interconnected dies, in a system of 4 optimally interconnected sockets. Balancing everything around the simple and efficient grouping by four feels very zen.

In particular, SMT4 would allow AMD to design a very wide core achieving high competitive single-thread IPC, while ensuring it will be well utilised for multi-threaded workloads in which each thread has a low IPC (e.g. due to memory access stalls).
Would be highly interesting to see. The idea of a core that is wastefully wide for single threaded but still no slower and potentially even faster than an SMT2 core, built to handle four threads, always intrigued me.
 
Reactions: Vattila

Vattila

Senior member
Oct 22, 2004
805
1,394
136
By the way, this is how I envisioned 64-core EPYC:



I am trying to reconcile this with the latest rumour about a chiplet and interposer design. It might be done as follows.

Imagine that this hypothetical 64-core EPYC chip will still use 4 dies (gray) mounted and interconnected as shown (and like before) on an organic package substrate. However, these dies are not monolithic. Instead they are relatively small 28nm active interposers with all the uncore logic, as well as the interconnect for the CCXs (green), which are tiny 7nm dies mounted on the 28nm interposer.

The nice thing about this design is that the 7nm yield would be great, since a single-CCX die without uncore logic would be tiny. Also, they could test for known-good-die on three levels, before testing the whole package (i.e. test the 7nm CCX dies before mounting, and the 28nm interposers before and after mounting the CCXs). This should reduce the test failure rate for the whole package.

For the consumer chips they would just mount fewer CCXs, or fuse off dysfunctional ones that failed testing.

Could it be done? If not, where did I go wrong?
 
Last edited:
Reactions: Schmide

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
By the way, this is how I envisioned 64-core EPYC:



I am trying to reconcile this with the latest rumour about a chiplet and interposer design. It might be done as follows.

Imagine that this hypothetical 64-core EPYC chip will still use 4 dies (gray) mounted and interconnected as shown (and like before) on an organic package substrate. However, these dies are not monolithic. Instead they are relatively small 28nm active interposers with all the uncore logic, as well as the interconnect for the CCXs (green), which are tiny 7nm dies mounted on the 28nm interposer.

I presume they would need some filling material around the 7nm dies, so that the uncore parts of the interposer can make contact with the heat-spreader, for stability and efficient cooling. But this is a general problem for 2.5D designs, I guess.

The nice thing about this design is that the 7nm yield would be great, since a single-CCX die without uncore logic would be tiny. Also, they could test for known-good-die on three levels, before testing the whole package: the 7nm CCX dies and the 28nm interposers before and after mounting the CCXs. This should reduce the test failure rate for the whole package.

For the consumer chips they would just mount fewer CCXs, or fuse off dysfunctional ones that failed testing.

Could it be done? If not, where did I go wrong?
Predictions

Stays with 2 4Core CCX = 8 core basic unit as exists today.
No seperate uncore but more L3 cache
Fabric speed increases to accomodate AM4 memory limitations
Improved layout and de-bottlenecking = greater IPC + increased Clocks
More than 4 basic units for higher count EPYC CPUs = 8 x 8C die [64 cores]
EPYC on passive interposer of ~ 900mm^2 [minimal cost increase]
EPYC die fabbed at TSMC process for absolute efficiency.
Ryzen 3xxx fabbed at GloFlo process for higher clock speeds.
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
Increasing the number of basic units for EPYC would require additional IFOP connections on the chips and more reservation spots/switching targets in the internal IF uncore of the chips themselves. If you're going through all of that, it's likely going to be no more complex to add two more CCXs to the existing floorplan. While that will definitely up the transistor count, it shouldn't increase the effect beyond the existing 14/12nm true area.

Interestingly, though, if they wanted to, they could keep roughly the same basic layout of the individual chips at 7nm, but move the DRAM and IO controllers off the chip and onto a specific I/O chip, leaving the rest to be essentially CCX and IF chips on the same EMIB/MCM package. So, have 5 chips on one package, four with IF links to the 5th, and the 5th handling all the I/O between the package and the rest of the system. This way, they can change out DRAM controllers, PCI controllers, etc without having to redo the whole chip, or update the package for different applications in isolation of the cores. Having an EMIB/MCM package can allow them to run the IF links between the chips at similar speeds to what they do internally in the chips today. Consumer chips could be a mix of 2 to 4 7nm chiplets, and an I/O chiplet, and maybe contain an iGPU chiplet as well on dual CPU chiplet packages. At 7nm, but maintaining the existing AM4 socket, they'll have plenty of package size to play with for things like that. It would even be possible to integrate an HBM package in there as well. On a desktop product, cooling a package with two CPU chiplets, an iGPU chiplet, an HBM stack and an I/O chiplet would no be unreasonable. With low enough voltage and frequency targets, it could even work on mobile. Intel is already there with KL-G. Their pricing on the product is indicative of their uniqueness in the market and not entirely a product of cost of production.
 
Reactions: Gideon and Vattila

Vattila

Senior member
Oct 22, 2004
805
1,394
136
So, have 5 chips on one package, four with IF links to the 5th, and the 5th handling all the I/O between the package and the rest of the system.

This is the AdoredTV 5-chiplet hypothesis.

In my design, the 5th chip is the active interposer itself. I guess a ~200 mm² interposer on the tried-and-tested 28nm process should be able to house all the uncore logic with four 7nm CCX chiplets mounted on top. The size of a CCX is 45.5 mm² on 14LPP, and with over 2x density on the 7LP process, it should be down to half the size, allowing for some additional transistor budget for core improvements and larger caches. So these CCX chiplets would be tiny and hence yield well.

Then put four of these interposers in an MCM for 64-core EPYC and Threadripper, two for low-core-count Threadripper, and one in Ryzen 3000 — just as before.
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
Where does the uncore logic go? On a separate chiplet in the middle of a 3x3 layout on the EPYC interposer?
Sorry, I meant no separate uncore and or L3 die. Every chiplet is a complete CPU. Edited now

As I had mentioned in another thread, moving to 7nm at the earliest opportunity when yield rates will be at the lowest implies the smallest die needed. This is my fundamental starting assumption and 100mm^2 max is my guess and this excludes a 16 Core unit.

Seems like AMD can actually outbid the phone makers for these wafers on the latest node.
 

jpiniero

Lifer
Oct 1, 2010
14,838
5,456
136
As I had mentioned in another thread, moving to 7nm at the earliest opportunity when yield rates will be at the lowest implies the smallest die needed. This is my fundamental starting assumption and 100mm^2 max is my guess and this excludes a 16 Core unit.

Yield is going to suck, yes. But that's where Ryzen and Threadripper come in. I mean you won't see 16 core Ryzen next year, and they could even do what Intel did and introduce a 12 core r9 and keep the core counts of r7 and r5 similar.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |