Discussion Intel current and future Lakes & Rapids thread

eek2121 · Mar 14, 2021

uzzi38 said:
That is a lot less than you think it is.

No, it isn’t. They need more capacity. Yields aren’t the issue.

uzzi38 · Mar 14, 2021

eek2121 said:
No, it isn’t. They need more capacity. Yields aren’t the issue.

Based off? You have to remember, ICL-SP isn't quite using the same process as TGL.

Anyway, Ian's napkin maths should help illustrate how little a number we're talking about:

Update: Just some back of the napkin math here. Intel's middle die-size HCC on Skylake Xeon Scalable was ~480 mm2, with a production rate of 108 per wafer, assuming perfect yield. Let's work on the assumption that this would be a good die size for Ice Lake Xeon (more cores, denser 10nm process). If Intel's defect rate for 10nm was as good as TSMC's N7, for which we know the latter to be a rate of 0.09 defects per cm2, the actual yield would be 71 dies per wafer, which equates to 66%. If Intel was extracting 71 dies per wafer, then 115,000 dies would be around 1620 wafers. We don't know at this point if Intel is absorbing some of those defects by having more physical cores on the die than will be offered, and this doesn't take into account reduced die count configurations (e.g. a 20 core part from a 28-core die). But it's an interesting number.

Dayman1225 · Mar 14, 2021

eek2121 said:
What is not serious about it? Ice Lake SP contains up to 32 cores. They’ve shipped more than 100,000 units according to AT. If they were having yield issues they wouldn’t be shipping large dies at all.

ICL - SP is upto 40c now

DrMrLordX · Mar 14, 2021

Dayman1225 said:
ICL - SP is upto 40c now

Thought it topped out at 38c?

Anyway, 100k units after all this time is not impressive, at all. Who knows how long it took them to bin that many dice just to get product out the door? What's really disappointing to me is that we haven't seen a proper Milan vs. Ice Lake-SP showdown anywhere. Unsurprising since Intel is taking forever with Ice Lake-SP and AMD is selling all their Milan chips to hyperscalars.

NTMBK · Mar 14, 2021

eek2121 said:
What is not serious about it? Ice Lake SP contains up to 32 cores. They’ve shipped more than 100,000 units according to AT. If they were having yield issues they wouldn’t be shipping large dies at all.

Here's a graph from back in 2017:

The Booming Server Market In The Wake Of Skylake

The slowdown in server sales ahead of Intel’s July launch of the “Skylake” Xeon SP was real, and if the figures from the third quarter of this year are

www.nextplatform.com

2.5 million server sales per quarter. Ignoring the fact that a lot of those will be >1 socket, and ignoring the fact that the market is even bigger now than in 2017, Intel has shipped enough chips to serve 4% of the market. 100,000 is not a huge number in this context.

Don't forget, Intel was holding up what it claimed was a working Ice Lake SP chip all the way back in 2018. https://www.anandtech.com/show/13699/intel-architecture-day-2018-core-future-hybrid-x86/8

mikk · Mar 14, 2021

How many of the sold units were based on Intels popular mainstream platform LGA1150 or LGA1151 and how many were based on Haswell-EP or Skylake-EP?

jpiniero · Mar 14, 2021

The Ice Lake-SP sales could be over multiple quarters too.

mikk said:
How many of the sold units were based on Intels popular mainstream platform LGA1150 or LGA1151

Tough to say but it's popular enough that Dell has several models. But I would say it's still not big.

and how many were based on Haswell-EP or Skylake-EP?

You mean Broadwell-EP instead of Skylake, and close to zero. Smeltdown took care of that.

NTMBK · Mar 15, 2021

mikk said:
How many of the sold units were based on Intels popular mainstream platform LGA1150 or LGA1151 and how many were based on Haswell-EP or Skylake-EP?

That's a fair point- a small chunk will be low end servers built on the mainstream platform. But guess what? Those aren't on 10nm yet either.

lobz · Mar 15, 2021

jpiniero said:
Looks legit. Kinda crappy that the 15 W i7 only gets 2 Big even if it gets 8 small. Guess they are pushing the 28 W.

That might not even be able to challenge Renoir-U in MT workloads, if I'm not missing anything, certainly not Cezanne-U. Those guys are absolute beasts for 15W laptop chips, when it comes to performance in a lot of actual work related software.

coercitiv · Mar 15, 2021

lobz said:
That might not even be able to challenge Renoir-U in MT workloads, if I'm not missing anything, certainly not Cezanne-U. Those guys are absolute beasts for 15W laptop chips, when it comes to performance in a lot of actual work related software.

In ideal conditions they will likely surpass Cezanne, but in a strange way - either workloads with strong emphasis on ST perf, or workloads with strong emphasis on MT perf ( 10+ threads, great MT scaling). Anything in between will likely run better or more consistent on a 8+0 chip.

That's the problem with the hybrid setup: you only need 4 small cores to get the power saving benefits. Once you aim for the MT performance benefit you need lots of these little buggers, and they eat into the big core budget. To make things even harder: area limitation gets compounded with the iGPU focus, since Intel wants to keep graphics emphasis and allocate same relative real estate for the CPU as they did on TGL 4+0.

Gracemont needs to be fast and devilishly efficient to get such premium real-estate on the die.

Shivansps · Mar 15, 2021

The idea of 2 big cores + 4-8 small cores is not bad as power efficient, all purpouse gaming CPU. As gaming only needs around 2 strong cores. It is going to losse vs Renoir in MT workloads, but if the heat and power is lower it is still a very interesting idea.
It may also give the iGPU more tdp budget.

lobz · Mar 15, 2021

coercitiv said:
In ideal conditions they will likely surpass Cezanne, but in a strange way - either workloads with strong emphasis on ST perf, or workloads with strong emphasis on MT perf ( 10+ threads, great MT scaling). Anything in between will likely run better or more consistent on a 8+0 chip.

That's the problem with the hybrid setup: you only need 4 small cores to get the power saving benefits. Once you aim for the MT performance benefit you need lots of these little buggers, and they eat into the big core budget. To make things even harder: area limitation gets compounded with the iGPU focus, since Intel wants to keep graphics emphasis and allocate same relative real estate for the CPU as they did on TGL 4+0.

Gracemont needs to be fast and devilishly efficient to get such premium real-estate on the die.

Umm... you know that the little cores do *not* have HT, right? What I mean with that question: I just can't see 2 really strong cores together with 8 'maybe Skylake' cores at God knows what freq, even with perfect scaling and scheduling (don't forget, we're still talking about Microsoft here), touching 8 full fledged Zen 3 cores in an MT heavy workload.

Shivansps · Mar 15, 2021

lobz said:
Umm... you know that the little cores do *not* have HT, right? What I mean with that question: I just can't see 2 really strong cores together with 8 'maybe Skylake' cores at God knows what freq, even with perfect scaling and scheduling (don't forget, we're still talking about Microsoft here), touching 8 full fledged Zen 3 cores in an MT heavy workload.

The question is: is that even necesary? Because for games and most stuff dont need that, you are only going to see diference in benchmarks and heavy productivity apps. At the end of the day, what matter for mobile is heat and power vs overall perf. So we really need to see that first.

I can compare this to Bay Trail vs Kabini in mobile, Kabini had the CPU+IGP performance, but IGP perf was cut short due to bad decisions like ST, and at the end Bay Trail low power and TDP was great and allowed x86 into tablets for the first time with an actual good product. AMD to get rid of Kabini had to re-purpuse it as cheap desktop cpus.

So AMD will likely keep the CPU lead in mobile vs hybrid cpus, power usage and tdp is the question, so is iGPU because Vega overstayed its welcome.

lobz · Mar 15, 2021

Shivansps said:
The question is: is that even necesary? Because for games and most stuff dont need that, you are only going to see diference in benchmarks and heavy productivity apps. At the end of the day, what matter for mobile is heat and power vs overall perf. So we really need to see that first.

I can compare this to Bay Trail vs Kabini in mobile, Kabini had the CPU+IGP performance, but IGP perf was cut short due to bad decisions like ST, and at the end Bay Trail low power and TDP was great and allowed x86 into tablets for the first time with an actual good product. AMD to get rid of Kabini had to re-purpuse it as cheap desktop cpus.

So AMD will likely keep the CPU lead in mobile vs hybrid cpus, power usage and tdp is the question, so is iGPU because Vega overstayed its welcome.

Don't ever let the fact that I was specifically talking about ADL vs RNR or CZN in heavy productivity apps bother you even for a second.

coercitiv · Mar 15, 2021

lobz said:
Umm... you know that the little cores do *not* have HT, right? What I mean with that question: I just can't see 2 really strong cores together with 8 'maybe Skylake' cores at God knows what freq, even with perfect scaling and scheduling (don't forget, we're still talking about Microsoft here), touching 8 full fledged Zen 3 cores in an MT heavy workload.

I know they don't have HT. As I said, you need the opposites end of the spectrum for the hybrid to win. It's weaker between 4 and 8 threads, but picks up steam between 8 and 12 threads.

Time for "napkin graph". For the sake of convention consider ADL = 1.4x SKL, Zen3 = 1.25x SKL, Gracemont = 1x SKL, HT = +20% no matter the architecure. AFAIK ADL will have HT enabled on the big cores. Here's how performance would look like assuming clocks are the same on all cores.

After this you need to consider:

max clocks on Gracemont, lower max clocks may further exacerbate the loss between 4-8 threads
likely efficiency advantage from the small cores that may allow higher clocks on the hybrid ADL, pushing for a win in the entire 10-16 thread spectrum

lobz · Mar 15, 2021

coercitiv said:
I know they don't have HT. As I said, you need the opposites end of the spectrum for the hybrid to win. It's weaker between 4 and 8 threads, but picks up steam between 8 and 12 threads.

Time for "napkin graph". For the sake of convention consider ADL = 1.4x SKL, Zen3 = 1.25x SKL, Gracemont = 1x SKL, HT = +20% no matter the architecure. AFAIK ADL will have HT enabled on the big cores. Here's how performance would look like assuming clocks are the same on all cores.

View attachment 41101

After this you need to consider:

max clocks on Gracemont, lower max clocks may further exacerbate the loss between 4-8 threads

likely efficiency advantage from the small cores that may allow higher clocks on the hybrid ADL, pushing for a win in the entire 10-16 thread spectrum

This napkin math requires not just actual clock parity as you mentioned, but also 100% efficient and perfectly managed windows scheduling between cores and threads and in-process tasks and such. Good lock with that! To Intel, I mean

Hulk · Mar 15, 2021

coercitiv said:
I know they don't have HT. As I said, you need the opposites end of the spectrum for the hybrid to win. It's weaker between 4 and 8 threads, but picks up steam between 8 and 12 threads.

Time for "napkin graph". For the sake of convention consider ADL = 1.4x SKL, Zen3 = 1.25x SKL, Gracemont = 1x SKL, HT = +20% no matter the architecure. AFAIK ADL will have HT enabled on the big cores. Here's how performance would look like assuming clocks are the same on all cores.

View attachment 41101

After this you need to consider:

max clocks on Gracemont, lower max clocks may further exacerbate the loss between 4-8 threads

likely efficiency advantage from the small cores that may allow higher clocks on the hybrid ADL, pushing for a win in the entire 10-16 thread spectrum

What if any is the effect on overall performance for an application should some of the threads require a lesser amount of compute than other threads? Let me try to communicate my question clearly in a hypothetical simplified case.

Imagine an application that spawns three threads. Two are compute "light" and one is compute "heavy." They are dependent so they must be run more or less simultaneously.

Now imagine a CPU with two Big cores running these threads. One Big core executes the compute heavy core while the other Big core executes the two compute light cores. Of course this Big core would be constantly switching context

Now imagine a CPU with one Big core and two Little. You see where I'm going. Big core is assigned to heavy compute thread while two Little cores are each assigned one light compute thread and they have a happily running Big/Little family running this application.

So now for my question, worded as precisely as I can manage:

I realize that my hypothetical example may be quite far-fetched when it comes to reality, but could a Big/Little strategy be a better "fit" from some applications? Meaning if an application has varied compute loads across threads can the Big/Little cores could be assigned optimally to reduce context switching and equal or beat the performance of a number of Big cores with great theoretical total compute?

ondma · Mar 15, 2021

coercitiv said:
I know they don't have HT. As I said, you need the opposites end of the spectrum for the hybrid to win. It's weaker between 4 and 8 threads, but picks up steam between 8 and 12 threads.

Time for "napkin graph". For the sake of convention consider ADL = 1.4x SKL, Zen3 = 1.25x SKL, Gracemont = 1x SKL, HT = +20% no matter the architecure. AFAIK ADL will have HT enabled on the big cores. Here's how performance would look like assuming clocks are the same on all cores.

View attachment 41101

After this you need to consider:

max clocks on Gracemont, lower max clocks may further exacerbate the loss between 4-8 threads

likely efficiency advantage from the small cores that may allow higher clocks on the hybrid ADL, pushing for a win in the entire 10-16 thread spectrum

Would not the graph be different for each AL configuration? If AL in fact has higher IPC than Zen, and similar clocks, then 8+0 (or 8+X) should be faster up to a certain number of threads (8+x, depending on hyperthreading efficiency), then dropping off rapidly as Zen increases in core count while AL adds only small cores.

Edit: again, the problem is Intel is late here. If Zen 4 has a good IPC increase, it should come close to AL IPC. In that case, AL has no real advantage on the desktop. At best, it will be similar to Zen4 in lightly threaded loads and drop off rapidly past 8 cores, as AL adds only small cores, while Zen continues to add big cores.

lobz · Mar 15, 2021

To anyone thinking of me as an AMD fanboy, let me be clear: I never thought Rocket Lake would be as good as GeekBench made many of you to believe so, and I was probably right. BUT I think ADL, or more like Golden Cove is a beast and it could have fended off every single competitive attempt from AMD, had it been released as planned originally.

I'm really skeptical about the viability of big.SMALL in a desktop environment, but it's up to Intel and Apple to prove me wrong in a practical sense (which means viability beyond benchmarks).

lobz · Mar 15, 2021

Hulk said:
What if any is the effect on overall performance for an application should some of the threads require a lesser amount of compute than other threads? Let me try to communicate my question clearly in a hypothetical simplified case.

Imagine an application that spawns three threads. Two are compute "light" and one is compute "heavy." They are dependent so they must be run more or less simultaneously.

Now imagine a CPU with two Big cores running these threads. One Big core executes the compute heavy core while the other Big core executes the two compute light cores. Of course this Big core would be constantly switching context

Now imagine a CPU with one Big core and two Little. You see where I'm going. Big core is assigned to heavy compute thread while two Little cores are each assigned one light compute thread and they have a happily running Big/Little family running this application.

So now for my question, worded as precisely as I can manage:

I realize that my hypothetical example may be quite far-fetched when it comes to reality, but could a Big/Little strategy be a better "fit" from some applications? Meaning if an application has varied compute loads across threads can the Big/Little cores could be assigned optimally to reduce context switching and equal or beat the performance of a number of Big cores with great theoretical total compute?

This again brings up the problem of scheduling. Apple has an easy time getting away with home runs despite the extreme complexity of scheduling difficulties, as they control their OS and their whole ecosystem. Ironically, this fact also excludes me as a potential customer, because in the past 2 decades, every single attempt I've made to 'like' or even 'get accustomed to' using Apple products have made me hate myself very quickly.

Intel doesn't have this luxury, and even if MS were EXTREMELY willing, the strengths of Windows simply lie elsewhere, and it will be a nightmare for both companies to work on that.

VirtualLarry · Mar 15, 2021

Yeah, I have my extreme doubts about big.LITTLE working out on the Windows side as well. We've seen how (un-) responsive MS has been about "little" scheduler changes, for AMD CPUs (BullDozer/PileDriver, and later, Zen's SMT), I expect that they will react a bit quicker to Intel's technology, but still, I question how well this technology will "integrate" with the Windows application ecosystem, especially games.

Maybe one thing that Windows 10 could do, is limit scheduling of "apps" (the new "Store" apps), to the "little" cores, and leave the "big" cores completely free, for system interrupts, or Win32/64 native x64 applications that might be compute-heavy (games).

Or maybe they can do things like the way "Optimus" does on laptops, with the GPU allocation, except this time, they would allocate either "big" cores or "little" cores, according to application listings / profiles.

lobz · Mar 15, 2021

Dayman1225 said:
ICL - SP is upto 40c now

Yes, and as 32C ICL seems to be actually competitive with 32C Rome(finally), 40C ICL should have no problem competing with 64C Rome. Or Milan. Well, I mean...

lobz · Mar 15, 2021

VirtualLarry said:
Maybe one thing that Windows 10 could do, is limit scheduling of "apps" (the new "Store" apps), to the "little" cores, and leave the "big" cores completely free, for system interrupts, or Win32/64 native x64 applications that might be compute-heavy (games).

Or maybe they can do things like the way "Optimus" does on laptops, with the GPU allocation, except this time, they would allocate either "big" cores or "little" cores, according to application listings / profiles.

Hah, not bad, not bad at all! I wonder if MS would be one of the only BIG-BIG companies where brilliantly simple ideas coming from way under (sorry for describing you as way under, but you don't strike me as a prime executive preoccupied with proving his worth at ANY cost) can convince decision makers to act on them! OK I admit, I don't actually wonder! 🤣

coercitiv · Mar 15, 2021

ondma said:
Would not the graph be different for each AL configuration? If AL in fact has higher IPC than Zen, and similar clocks, then 8+0 (or 8+X) should be faster up to a certain number of threads (8+x, depending on hyperthreading efficiency), then dropping off rapidly as Zen increases in core count while AL adds only small cores.

We were talking specifically ADL 2+8 vs. Cezanne 8+0 in the 15W TDP range. (power limited, clocks going down fast in MT loads)

lobz said:
This napkin math requires not just actual clock parity as you mentioned, but also 100% efficient and perfectly managed windows scheduling between cores and threads and in-process tasks and such. Good lock with that! To Intel, I mean

I did mention earlier that it would surpass Cezanne's throughput in ideal conditions only. Running 10+ concurrent threads with high MT scaling is rather hard to do in typical consumer loads, and that before we get to max clocks and Win scheduler.

Personally I have said it before: I'm not expecting much from Intel's hybrid approach, and the main reason I'm not openly against it is allowing their small core team to shine for a change. Long term having a strong "little" core may prove very good for them. (if only to correct any bad design choices made by the "big" core team)

Hulk · Mar 15, 2021

About that thread scheduling complexity for Big/Little...

If you have a certain number of Big/Little cores it *seems* (I'm not an expert, admitted) like the goal is to run all the various CPU's with as little context switching as possible. If a core working on threads is switching like mad then it should be allocated less (this was edited I incorrectly posted "more" initially, sorry for the confusion) compute intensive threads. If the scheduler has an idea of the compute of each core it should be able to optimize the core-to-thread assignments by finding the combination that results in the least context switching. Yeah, I know easier said than accomplished but this seems like an ideal problem for machine learning as your rig would "learn" the best way to operate based on your use patterns.

Discussion Intel current and future Lakes & Rapids thread

Diamond Member

Platinum Member

Golden Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Platinum Member

No Lifer

Platinum Member

Platinum Member

Diamond Member

Diamond Member