Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 579 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
702
632
106






As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

ModelCode-NameDateTDPNodeTilesMain TileCPULP E-CoreLLCGPUXe-cores
Core Ultra 100UMeteor LakeQ4 202315 - 57 WIntel 4 + N5 + N64tCPU2P + 8E212 MBIntel Graphics4
?Lunar LakeQ4 202417 - 30 WN3B + N62CPU + GPU & IMC4P + 4E012 MBArc8
?Panther LakeQ1 2026 ??Intel 18A + N3E3CPU + MC4P + 8E4?Arc12



Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

Meteor LakeArrow Lake (N3B)Lunar LakePanther Lake
PlatformMobile H/U OnlyDesktop & Mobile H&HXMobile U OnlyMobile H
Process NodeIntel 4TSMC N3BTSMC N3BIntel 18A
DateQ4 2023Desktop-Q4-2024
H&HX-Q1-2025
Q4 2024Q1 2026 ?
Full Die6P + 8P8P + 16E4P + 4E4P + 8E
LLC24 MB36 MB ?12 MB?
tCPU66.48
tGPU44.45
SoC96.77
IOE44.45
Total252.15



Intel Core Ultra 100 - Meteor Lake



As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)



 

Attachments

  • PantherLake.png
    283.5 KB · Views: 24,014
  • LNL.png
    881.8 KB · Views: 25,501
Last edited:

MS_AT

Senior member
Jul 15, 2024
365
798
96
I think the fact that the combined performance is a meagre 9% is a testament to whatever they did not working, because if they boosted FP in a general way like on Skymont, the FP portion would have done lot better.

Not getting good gains on perf/clock is what I count as important. The P core team for forever talked a lot about what's a bottleneck or whatever but aside from Pentium M, Core 2, and Sandy Bridge, it was usually disappointing. 10% for Haswell, 10% for Skylake in an era where it was far easier to get big gains on process. What the hell were they doing?

Skymont is capable of legacy and FMA execution on all 4 ports. Even Lion Cove is only FMA on 2 pipes and other 2 are FP Add, so the gains won't be universal.
While it's commendable Skymont can do FMA on 4 ports in reality it is done only to match P core throughput. I am not sure what exactly you mean by legacy instructions but x64 did not have SIMD FMA instructions before FMA extension that was introduced at the same time that AVX was introduced. Because of that often compilers will assume FMA is present when you request AVX instruction sets as iirc at the time there was no CPU that would support FMA but would not support AVX. Why is this important? Because this was done over 10 years ago. So we either have software on the market that practically supports FMA and AVX together, what means it will almost exclusively use 256b version of FMA to match AVX or we have SSE only software that does not make use of FMA at all.

In first case Skymont will use all execution units to keep up with Raptor Cove, but Raptor can also do an addition at the same time, with Lion Cove following up with additional addition operation, while Skymont is fully occupied. In the second case FMA doesn't matter.

If we consider scalar FP operations, then yes Skymont will have an advantage over all P cores until Lion Cove that will be able to match it for mixed instruction streams [In my experience it's hard to encounter FMA only code], but due to different design goals it will loose to Lion Cove in absolute performance due to clock differences.
 

Hulk

Diamond Member
Oct 9, 1999
4,701
2,863
136
With Raptor Lake, Intel held a 24 vs 16 "Full Core" advantage which, in theory, should have put that processor ahead of Zen 5 in terms of multi-threaded performance. Instead, it took a beating:


Based on your well thought out calculations, where do you see Arrow Lake performing in these multi-thread heavy workloads/benchmarks relative to Zen 5 and Raptor Lake?
I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
Okay so Lion Cove architecture is closer to Skymont than Raptor Cove? I'm not arguing the point, just trying to understand.
No way.

-Instruction length data stored in L1i $
-Clustered decode
-OD-ILD, which decodes instruction length on the fly

-Very wide Retire, which is not just an attempt to just widen it but done because this also allows other structures to be decreased. An ideological difference of carefully adding and taking out as needed for area/power efficiency, unlike the P cores.
-More Store ALUs than Load ALUs. My guess is this benefits the clustered decode.
-Fast path hardware which can be power hungry versus Nanocode, which is having specific microcoded ROM added to each clusters. So it won't become suddenly 10x fast, but now it won't block other decode plus it can be parallelized. If you were ok with having an instruction that's dog slow, then you don't suddenly need it to be 10x faster than anything else. This is also a careful balance, unlike the brute force approach of P cores.
-The E cores go for many simple units made for a specific tasks over few powerful units that can do more, like on the P cores.

Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.

The E core team also doesn't shy away from changing things drastically. From Goldmont to Tremont, it had the L2 predecode cache. In Gracemont they took it out entirely and replaced it with the OD-ILD.

Since the L2 predecode was a rather large 128KB SRAM, taking it out was an efficient choice, and OD-ILD isn't limited by low hitrate on large datasizes, meaning it performs better. It's quite amazing to me that they took out an entire feature, added a new one while delivering substantial performance gain overall.
 

Hulk

Diamond Member
Oct 9, 1999
4,701
2,863
136
Can anyone tell me which part of Lion Cove, or any P core since Sandy Bridge for that matter was anything more than just add, add, add? Bigger units, more units, that has been the case since Haswell. Even Sandy Bridge for that matter had a core 50% larger than the predecessor. The saving grace for that chip was there were innovations and wasn't just pure expansions.
This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
26,389
15,513
136
This is what I had thought. It's been wider on the front, then wider on the back, a little "smarter" here and there.
But I know most around here are very keen on this stuff so I ask a lot of questions but I don't push back
All I have to add, is that until its released and benchmarked, we will not really know how good it is.
 
Reactions: Tlh97 and Hulk

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
The main difference in Skymont is clustered decoding without UOP cache. But this solution was used in Skymont not because it was better, but because it saved logic and complexity.
This is false. It saves a considerable amount of power, and area in the decode section, which can be used to boost elsewhere. Plus,

This is from the Intel x86 optimization manual:
This overall approach to x86 instruction decoding provides a clear path forward to very wide designs without needing to cache post-decoded instructions.
For Grace and Sky it has a load balancer so you don't need to add branch instructions, since if it doesn't happen for a long time the device adds fake branches so it can continue to execute in parallel.

Description about Skymont's decoder said that it is even capable of handling loops so it doesn't even need a loop buffer. Since x86 instructions encounter branches every 6 cycles, and it is easier to fill a 3-wide than wider ones, it'll often end up being better than the traditional approach.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
For those less knowledable, what is an OD-ILD?
It is a feature introduced on Gracemont. It stands for On-Demand Instruction Length Decoder. It does exactly as it sounds, since x86 is a variable length decode.

Here's a comment on RWT about Clustered Decode, and why it's not just a "cheap" feature:
Note 1: The above is missing the main point of having _multiple_ N-wide decoders: the average basic block (=BB) size is approximately 5 x86 instructions (4 linear instructions + 1 BB-terminating branch instruction) and if we assume that 50% of those terminating branches are at run-time resolved as TAKEN branches then it implies that _all_ x86 CPUs in the near future will be _required_ to have _multiple_ N-wide decoders.
 

511

Golden Member
Jul 12, 2024
1,038
896
106
Absolutely.

I do admit that I had forgotten that Netburst had only a 1 wide decoder. Pentium III and "Core 2" were absolutely better designs.

Reading through the litany of architectural changes in Lion Cove, it certainly appears that this core should be a rocket ship ..... but as with many designs (and I have quite a few under my belt), "It sure looked good on the white board". Actually, I am of the belief that Lion Cove (and its sister designs) will grow into a very successful core design for Intel ...... in a couple of years. Unlike the disaster that was Netburst and Bulldozer, I don't see anything fundamentally mis-calculated here, only a need for both process and design optimization.

Unfortunately, these things take time. Based on the information that we have at this time, Intel will not be "back on top" again until 18A and some design tweaks come about (a couple of years I think).

For those that think me "Anti-Intel", I am absolutely not. No sane person in the world would wish for anything other than strong competition in the market. Furthermore, and on a more personal note, I happen to be a US vet and a long time CPU architecture buff. I want a strong US IP for my country, and Intel is it.

Opinion: Intel needs to fire a bunch of business majors and focus on their product strategy (vs figuring out how to better leverage their monopoly position to maximize their profit). It is my opinion that Intel stagnated under a bunch of tight neck ties and desperately needs an engineering kick in the a$$. In their engineering lethargy, Intel have allowed TSMC to flank them severely. AMD simply hit a great combination of design and lithography advances available and executed on it. In theory, having a vertically integrated design and foundry process SHOULD have allowed Intel to dominate the industry indefinitely. It is only in Intel's horrendous lack of forward vision that AMD and TSMC have unseated them. I believe Pat G can put the company back on track ........ if he gets enough time. It's really hard to work your way through an army of pencil necks. [/end rant]
Definitely the Fab Leadership was lost due to some idiots not financing it enough and it took design down due to how inter dependent they were Pat G fixed Nodes something previous CEOs were incapable of
 
Reactions: OneEng2

reb0rn

Senior member
Dec 31, 2009
240
73
101
The first rule of fight club/CPU forum is attack the post not the poster. No personal attacks.
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.
You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it
 

OneEng2

Senior member
Sep 19, 2022
259
358
106
Let's just summarize by saying that Granite Rapids is in serious trouble.

@OneEng2

@Hulk was joking about the unreleased/unfinished Tejas which was the Netburst CPU that was supposed to reach 10 GHz.
OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast . Who knew .

@Hulk : Please forgive my stupidity. I wont underestimate you again .
I don't think we're supposed to get into AMD vs. Intel here but I don't see Raptor Lake in that review you posted.
... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").
You get many of you here just spread FUD about AMD, no one care for some stupid bench helf baked done by tom about AMD server cpu here in intel topic, nor anyone care for some ppl spreading hype over fud info, sam eone that look at one core speed on 192 core server and hype single thread speed of it

How about the Phoronix reviews involving Granite Rapids? There are at least two of them, one from when it launched (versus previous-gen server parts) and one more showcasing Granite Rapids versus the current competition. I haven't even looked at the Tom's review yet.
@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.
 

Hulk

Diamond Member
Oct 9, 1999
4,701
2,863
136
OMG! I totally missed this most excellent jest! I was somehow thinking that the mighty @Hulk had some LN cooled Intel super rig back in the day. You are so right. that was quite a miss. Turns out transistors make a lot of heat when you switch them that fast . Who knew .

@Hulk : Please forgive my stupidity. I wont underestimate you again .

... and you are again quite correct. I missed that as well (I just turned 59 this week, so I'm going to chalk it up to my newly found "old age").



@reb0rn
Would love to see benchmarks that depict GNR in a good light against Turin if you have them. It would certainly be good news for Intel if this were the case.
@OneEng2,
No worries! I didn't respond because I was pretty sure you didn't scroll down and see the punchline.
 

DavidC1

Golden Member
Dec 29, 2023
1,211
1,932
96
We all know the old line about the broken clock being right twice a day or the blind squirrel sometimes stumbling on a nut. Not MLID.
He's expecting 30-40% ST and 15-20% ST for Pantherlake desktop over Arrowlake.

I don't think Pantherlake desktop even exists? And Pantherlake is supposed to be basically an 18A shrink?

I admit, he got few things right but those numbers are purely made up.
 

sgs_x86

Junior Member
Dec 20, 2020
13
26
61
Did some further CB R24 MT testing. As you add P cores strange things happen. I ran these tests at 5GHz P, 4GHz E, just to make sure there is no throttling or other "at the limit" behavior.

Anyway, assuming Raptor Cove scores 22.6 points/GHz here are the scores as you add P cores to 16 E cores during the render.
1P+16E - 15.2 points/GHz for E's
2P+16E - 15.4
4P+16E - 14.7
6P+16E - 14.0
8P+16E - 13.1

Other than the increase from 1 to 2 P's, the IPC of the E's decreases as P's are added. Anybody have any reasoning for this behavior?

It is of course possible the IPC of the P's are also or only changing but I have found the P IPC to be relatively stable when testing various number of P's.

It's a rabbit hole not worth spending too much time on but I have a hard time leavng it alone...
I could be horribly wrong, so please correct me. When 1P core is enabled, the E-cores get to use most of the shared L3 cache. As more P-cores are enabled, their share of the L3 is reduced and so does the IPC. We know that E-cores love cache.
 
Reactions: Tlh97 and Hulk

Josh128

Senior member
Oct 14, 2022
511
865
106
Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.

So there actually may have been something to that at some point, or still could be.
 

511

Golden Member
Jul 12, 2024
1,038
896
106
Honestly, other than the perf uplifts, he got the launch timeframe right. Was he right about the iGPU? As far as the 8+32 thing, it doesnt look like that will happen-- but the fact that Intel is using "285K" as its initial top SKU seems to indicate that a "295K" is either planned or was planned. A completely new and larger die just for a halo SKU for more MT seems like a giant waste of engineering resources and silicon though if 285K already mostly beats 9950X in MT.
Agreed but missing estimate by few% and ridiculous claims are two different things
 
Reactions: lightmanek

Hulk

Diamond Member
Oct 9, 1999
4,701
2,863
136
I think it is just one of the core negatives of the ringbus. The more stops it has, the worse it performs.
Could be. But the P's perform about the same as you add them, but they have much larger L2 per core and that may be enough for CB while the E might need more and have to go to the ring more.

You can make a CB "calculator" but you need a few versions to take into account the varying performance of the E's depending on the P+E configuration.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |