Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Page 363 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Tigerick

Senior member
Apr 1, 2022
686
576
106






As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.



Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

ModelCode-NameDateTDPNodeTilesMain TileCPULP E-CoreLLCGPUXe-cores
Core Ultra 100UMeteor LakeQ4 202315 - 57 WIntel 4 + N5 + N64tCPU2P + 8E212 MBIntel Graphics4
?Lunar LakeQ4 202417 - 30 WN3B + N62CPU + GPU & IMC4P + 4E08 MBArc8
?Panther LakeQ1 2026 ??Intel 18A + N3E3CPU + MC4P + 8E4?Arc12



Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

Meteor LakeArrow Lake (20A)Arrow Lake (N3B)Arrow Lake Refresh (N3B)Lunar LakePanther Lake
PlatformMobile H/U OnlyDesktop OnlyDesktop & Mobile H&HXDesktop OnlyMobile U OnlyMobile H
Process NodeIntel 4Intel 20ATSMC N3BTSMC N3BTSMC N3BIntel 18A
DateQ4 2023Q1 2025 ?Desktop-Q4-2024
H&HX-Q1-2025
Q4 2025 ?Q4 2024Q1 2026 ?
Full Die6P + 8P6P + 8E ?8P + 16E8P + 32E4P + 4E4P + 8E
LLC24 MB24 MB ?36 MB ??8 MB?
tCPU66.48
tGPU44.45
SoC96.77
IOE44.45
Total252.15



Intel Core Ultra 100 - Meteor Lake



As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

 

Attachments

  • PantherLake.png
    283.5 KB · Views: 23,980
  • LNL.png
    881.8 KB · Views: 25,452
Last edited:

TwistedAndy

Member
May 23, 2024
98
68
46
Apple is boring because their cores are already good. To the point where Apple’s P core is more efficient and more powerful clock per clock than Skymont and LionCove.

Apple M1 was really good, but if we compare M4 to M1, the difference is nearly 8% on the same clock in Geekbench 5 (149.3% * 3.19 / 4.4 - 100% = 8.2%, link to results). If we use SPEC 2017, the IPC difference will be nearly the same (~11%).

If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.

As for power efficiency, it's the most misleading metric out there because the performance/power curve is not linear. There's a point of maximum efficiency, but device manufacturers often push the power limits higher to achieve a slight performance increase. Technically, if Apple decides to run M4 on 4 GHz instead of 4.4GHz, the power consumption will nearly match M3 (5-6W), but the performance will be only 7-8% higher in SPEC 2017.

As for Skymont, the "sweet spot" will be nearly 3-5W at 4.0-4.6 GHz. I don't think it makes sense for Intel to push it even higher, but who knows?

Skymont is great because Crestmont sucked and it can no longer be considered an Atom core. We have yet to see how Lunar Lake performs in benchmarks and applications.

Crestmont is a further improvement to Gracemont, which was a huge step compared to Tremont. Skymont is another huge step forward.

P core team need to take more ideas from Stevens's e core team.. to make the p core even more area efficient like skymont

Actually, they take some ideas from the E-cores (split INT/FP scheduler, wider retire, etc.), but it's very hard to refactor a large core. Also, Intel may decide to test some ideas on E-cores first before bringing them on P-cores.
 

Nothingness

Platinum Member
Jul 3, 2013
2,717
1,347
136
From an architectural standpoint, Skymont was built as a P-core and is not so different from Apple's P-cores or ARM Cortex-X2.

It can decode 9 instructions per cycle using three non-blocking decoding clusters. If there's a complex instruction requiring microcode reading, the other decoding clusters are not blocked. Intel calls it "nanocode," and on paper, it looks nice.

Apple P-cores can decode from 8 (M1) to 10 (M4) instructions per cycle.

The backend was built to handle 8 micro-ops per cycle. It's the same amount as P-cores in Apple M1 and M2.

Also, Intel decided to increase the retirement capability to 16 micro-ops per cycle in Skymont. It allowed Intel to use various buffers, queues, and register files more efficiently and avoid increasing their size too much. For comparison, the P-cores in Apple M4 can retire 10 micro-ops per cycle.

Another interesting part is the number of execution ports. Skymont has 26 of them, including 8 ALU ports, 4 128b FP ports, 3 load/4 store AGU, 3 load, and 2 store ports.

For comparison, the P-core of Apple M4 allegedly has 8 ALU, 4 FP ports, 1 port for FMA, 3 load/2 store AGU, 3 load, and 2 store ports.

Yes, we can't directly compare architectures just by the width of the decoder, execution width, buffer sizes, and the number of ports, but it can give us a rough picture of the capabilities.

If all other parts of the Skymont were balanced well, we could get ISO performance similar to the P-cores in Apple M2.
Let's see when Skymont is out. As you wrote, width isn't the whole story and many "details" could make the width useless. But yeah from what we know it looks solid.

FWIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.
 

TwistedAndy

Member
May 23, 2024
98
68
46
WIW as far as I know Apple P cores can issue 4 128-bit operations per cycle including FMA.

Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.

That's the same number as what we have in Skymont (with the FMA support on all of them).

As for Lion Cove, it also has four FP ports, but they are wider (256b).
 

Attachments

  • GPPNrzKWEAAQgUj.jpg
    576.2 KB · Views: 17
  • GQLH5OUXMAAbzSq.jpg
    501.8 KB · Views: 17

Nothingness

Platinum Member
Jul 3, 2013
2,717
1,347
136
Yes, M-series chips have four FP 128b ports, but I'm not sure if the FMA is supported on all of them or just one.
Proof here that there are 4 FMA: https://scalable.uni-jena.de/opt/sme/micro.html
M4 runs at 111 GFLOPS FP32: that is close the theoretical 4.4 GHz * 2 ops per FMA * 4 FP32 * 4 FMA = 141.

As for Lion Cove, it also has four FP ports, but they are wider (256b).
But only two FMA so that's the same FMA bandwidth as M1+.
 
Reactions: SarahKerrigan

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
intel4 - - - - - n5
intel3 - - - - - n4
intel20A- - - n3
intel4 - - - - - n5
intel3 - - - - - n4
intel20A- - - n3
intel18A- - - n2

Actually 20A is expected to be slightly better than N3 & 18A slightly better than N2.

Meteor Lake = RTX 2050
Lunar Lake = RTX 3050
Panther Lake = RTX 4050?
Panther Lake = RTX 4050 sounds a bit too high. Maybe around 3050 Ti or a 3060 is my guess.
 

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
P core team need to take more ideas from Stevens's e core team.. to make the p core even more area efficient like skymont
Well, looking at the kinda marginal progress the P core team is making gen over gen, i think it has reached some sort of evolutionary dead end. There isn't much future for the P cores if they don't deliver something awesome soon.

Crestmont is better than bergamo. So you are wrong lol
Crestmont? Oh no. It's definitely not bad. But not awesome in any sense.
 

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
MTL doesn't come even close to RTX 2050 in performance. It's between a GTX 1050 to a 1650 Mobile (Very game dependent).

Panther Lake will have 50% higher resources, based on Celestial uArch and will be clocked at <=3GHz. So I expect another 2x jump and performance around GTX 1660/RTX 2060 Mobile.
With RTX & XeSS on, i think it provides better frame rates than a GTX 1650 mobile which doesn't support RTX. XeSS/DLSS/FSR should be factored in whenever possible during comparisons as it's the norm now.

P-cores will probably become great again with rentable units, but we'll see.
Assuming Rentable Units is happening and is a part of s future Intel product, I don't think it'll be a part of the P core. When multiple P cores are running in parallel to execute a single RU task, it'll end up using way too much power. E cores fit best.
 
Last edited:

Magio

Junior Member
May 13, 2024
8
9
36
Actually 20A is expected to be slightly better than N3 & 18A slightly better than N2.

It depends on what aspect of the process we're looking at, both 20A and 18A have GAAFET and backside power delivery, while only N2 has GAAFET and neither N3 nor N2 have BSPD, but transistor density is expected to remain better on the TSMC side.

Backside power and GAAFET are big innovations, but it would be reasonable to expect the first processes and chipmakers to ship products leveraging them to not fully exploit their potential right out of the gate, so TSMC's (likely) superior density on N2 might be worth more than Intel's BSPD on 18A for example.
 

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
It depends on what aspect of the process we're looking at, both 20A and 18A have GAAFET and backside power delivery, while only N2 has GAAFET and neither N3 nor N2 have BSPD, but transistor density is expected to remain better on the TSMC side.

Backside power and GAAFET are big innovations, but it would be reasonable to expect the first processes and chipmakers to ship products leveraging them to not fully exploit their potential right out of the gate, so TSMC's (likely) superior density on N2 might be worth more than Intel's BSPD on 18A for example.
Actually BSPD gives more freedom to Intel GAAFETs to achieve better performance than competition. Pat was singing songs about the "beauty" of Intel RibbonFET's performance against competition on more than one occasion (the mona lisa thing). And the density difference is also not that much for equivalent libraries. 20A & 18A will make a massive difference even if they can hit some decent volume.
 

FlameTail

Diamond Member
Dec 15, 2021
3,122
1,786
106
Panther Lake = RTX 4050 sounds a bit too high. Maybe around 3050 Ti or a 3060 is my guess
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
 

TwistedAndy

Member
May 23, 2024
98
68
46
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?

There are many reasons. The most obvious one is the memory bandwidth.

CPU by itself does not require huge memory bandwidth. It should be good enough with a small latency.

For example, even in the Apple M3 Max, a single P-core can utilize ~120GB/s. All the cores can use nearly 240GB/s in total. It's a limitation of fabric, and that's fine.

On the other side, GPUs are not so sensitive to latency, but they need a lot of bandwidth. When we want to combine a CPU with a powerful GPU, we need to increase the number of memory channels. Even nVidia RTX xx50 performance requires at least a 256-bit memory bus. It does not make sense for most desktops and laptops.

For example, Strix Halo, with a pretty promising GPU, is expected to have a 256-bit bus with soldered memory.

Another problem is the cost and flexibility. There are not so many customers willing to pay much more to have xx50 class GPU on their CPU because they plan to use something more powerful (xx70, xx80, and xx90) from another vendor.
 
Last edited:

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
Everyone would love it. But I think it makes the die super expensive & only a few actually use it fully.
 

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
... from another vendor.
In mobile clients at least, I don't think people would really care that much about other vendors as long as the iGPU or tGPU is performant and feature complete. Problem with today's iGPUs is that they're all just too weak. Hope this gets sorted out in the future.
 

Ghostsonplanets

Senior member
Mar 1, 2024
529
926
96
Why are iGPUs so pathetic? (iGPUs from all vendors, not only Intel). In my humble opinion, I think the iGPU of the current generation should match the RTX xx50 of the previous generation.Is that too much to ask?
Yes? You want a <200mm² SoC part to match a dedicated 140mm² Graphics die fabbed on modern node, with modern uArch? Which also being very efficient, achieving high clocks and being supplied by dedicated memory and high amounts of bandwidth.

The SoC iGPU needs to share limited power budget, memory and bandwidth with the others parts of the SoC. The fact we have such big and reasonable fast iGPUs is already quite the feat.

To match a modern dedicated dGPU part, you need either to dedicated huge parts of your SoC to Graphics IP and change the memory subsystem to be wider. Which is what Strix Halo is doing.

But that's not a trade-off chipmakers will do. Because the vast majority of clients doesn't need such high speed graphics and the ones who need it are better supplied with a dGPU.
So you're only hitting a niche of budget gamers or the new found local LLM training users niche. That's a very small subset of users and they're best catered (if such niche is even worth appealing to) with a dedicated and expensive specialty SKU.

Therefore, iGPUs will basically become faster and wider by riding the logic and power improvements of newer nodes, increase in bandwidth due to modern memory standards and/or when necessity arises. Such as Intel and AMD widening the iGPU for them to make advantage of local low power AI workloads.

If we look at modern upcoming iGPUs, Strix Point is already using a 16 CU/1024 ALUs@2.9GHz. Panther Lake H is widening the GPU from MTL-H 8 Xe Cores/1024 ALUs@2.3 GHz to 12 Xe³ Cores/1536 ALUs@<=3GHz.

So in PTL-H case, you're already seeing a modern iGPU matching ALU count of Switch 2 and Series S GPUs, which are dedicated consoles. But at much higher clockspeeds and having generous amounts of cache and using a faster LPDDR5x standard than Switch 2 (PTL-H 8533 x Switch 2 7500MT/s). That should be good enough to surpass Switch 2 and Series S performance and a laptop with PTL-H would easily be able to run any current gen game released and yet to be released. That's good enough imo.
 
Last edited:

Doug S

Platinum Member
Feb 8, 2020
2,467
4,024
136
10nm was delayed by at least two years, and that's not counting the additional delays that Intel suffered even launching the aborted 10nm that went into Cannonlake. In reality the delay was more like 3-4 years before Intel even managed IceLake! There's simply no comparison.

It was worse than that. Intel roadmaps in 2013 showed 10nm chips being delivered in 2015, which later slipped to 2016. Then in summer 2015 they were pushed back to 2017, the start of a long line of delays until Intel finally shipped 10nm for real in Q4 2019. So it was delayed by at least four years, maybe a bit longer depending on exactly when in 2015 they originally roadmapped 10nm shipments.

To say it is nothing like TSMC's 3nm delays is a massive understatement.
 

SiliconFly

Golden Member
Mar 10, 2023
1,179
604
96
It was worse than that. Intel roadmaps in 2013 showed 10nm chips being delivered in 2015, which later slipped to 2016. Then in summer 2015 they were pushed back to 2017, the start of a long line of delays until Intel finally shipped 10nm for real in Q4 2019. So it was delayed by at least four years, maybe a bit longer depending on exactly when in 2015 they originally roadmapped 10nm shipments.

To say it is nothing like TSMC's 3nm delays is a massive understatement.
Nope. 2016 14nm kaby lake was actually supposed to be the first 10nm tick they missed due to delays. Their 2018 10nm cannon lake was a disaster due to poor yields. They has their first 10nm success with release of ice lake (sunny cove) only in 2019. Intel 10nm was delay exactly by 3 years.

And then again, they missed the next tick which should have technically happened in 2021, but happened only in 2023 (with MTL). Another two years delay.

They've lost a total of 5 years putting them well behind TSMC. Now they've picked up steam, almost caught up & are expected to get well ahead of competition next year. Process leadership.
 
Reactions: Henry swagger

poke01

Golden Member
Mar 8, 2022
1,335
1,503
106
If we consider the raw performance and power consumption, the P-core in M4 consumes more than twice as much power as the one in M1 in SPEC 2017 (7.21W vs 3.43W for INT and 8.95W vs 3.92W for FP) while being 50% faster. That's a pretty good tradeoff, but we are seeing a huge power consumption increase on the newer node (N3E vs N5). In the case of M4 Max, Apple has to throttle clocks to M3 Max levels to fit into the power package.
It’s because Apple for the first time ever is using HP libraries for their CPU1. This is FinFlex at work.

“TechInsights' analysis also revealed a hybrid library approach: UHD libraries for GPU and CPU2, and a new high-performance library for CPU1. This design optimizes for various computational demands within a unified architecture.”


This is what Intel also uses to achieve high clocks buts it burns thru power much much more on 10nm.
 

Hulk

Diamond Member
Oct 9, 1999
4,366
2,232
136
Doesn't an increase in IPC in a CPU also have to rely on the parallelism that can be extracted from the code? Wouldn't that put a limit on IPC? Or at the very least an exponential curve that has a limit?

Perhaps the "battleground" for performance will eventually have to shift from hardware to software eventually?

For example, if you have two video editors that are essentially equal in all things except performance and once runs twice as fast as the other on the same hardware then that is a "problem" for the software developers. Seems like hardware and software development must work hand-in-hand.

How much are you waiting on your computer these days?
I never wait for the following apps:
Chrome
Thunderbird
MS Office apps
Corel Draw

I am slowed down in my workflow by the following apps:
Vegas Video 21
Presonus Studio One 6
Photoshop
Topaz Photo AI (a little)
Topaz video AI (a LOT)
DxO PureRaw 3

Full disclosure, this is on my 14900K. On my Surface Laptop 2 many of these apps have me waiting quite a bit and some are unusable (Topaz AI).

But that is 8 cores of Raptor Cove at 5.5GHz vs 4 Skylake cores at less than 3GHz. Lunar Lake can't come fast enough for me.
 

DavidC1

Senior member
Dec 29, 2023
344
545
96
I'm not sure where those numbers for pipeline stages are coming from. I have huge doubts about the Apple M2's 9-cycle latency.

As for the memory subsystem latency, it's pretty similar for Apple, Intel, AMD, and others. L1 cache usually has 4-5 cycle latency, L2 - 16-20 cycles.

In general, there's no simple answer on how to achieve a better performance. The number of pipeline stages does not matter that much. Yes, I remember those debates around Pentium 3, Netburst, etc., but now the CPUs are much more complex.
This is nonsense. The focus on clocks is what kills efficiency on the P cores(aside from horrible execution). Pentium 4 has amply demonstrated that pipeline stages need more transistors than originally expected.

By aiming for lower clocks you can have less branch mispredicts and faster cycle caches. And you need less transistors, meaning more efficient.

Memory subsystem is vastly superior due to better engineering and the focus on lower clocks. It's 192KB + 128KB L1 for the A12 and successors with 3 cycle latency. It completely blows the competition away. It's massive caches with low latency is another reason why it's so power efficient.*

You should actually read into proper articles describing fundamental CPU architecture rather than just guessing. And the claims that there is no simple answer is laughable - one company clearly laps around the others for many years now, the lead so big that even after 4 years of stagnation it's still among the top. The designers clearly knew what they were doing. Do you think these guys just close their eyes and randomly decide by a ballot on what to improve?

*Having data accesses closest to the chip is what saves power. It is that simple. Apple is just executing on common sense logic. Engineers have long said SRAM is the lowest power per bit.

This is the high level basics of the best architecture
-8-10 stage pipeline, no more, cuts on area, transistors, and improves performance by lowering branch mispredicts.
-Lower clocks, which will increase over time with better process.
-Lower clocks allows making large caches with relatively low latency.
-All the decisions above also serve to lower power consumption. Large SRAM reduces pJ/bit thus requires less power per compute.
-Pair with excellent management and brilliant engineers.

It is Apple that had it for the longest time, that's why they are successful. Has nothing to do with being a fanboy or whatever. That's just stupid bias. It's merely recognizing good work where it is.
 
Last edited:
Reactions: Nothingness
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |