Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

Model	Code-Name	Date	TDP	Node	Tiles	Main Tile	CPU	LP E-Core	LLC	GPU	Xe-cores
Core Ultra 100U	Meteor Lake	Q4 2023	15 - 57 W	Intel 4 + N5 + N6	4	tCPU	2P + 8E	2	12 MB	Intel Graphics	4
?	Lunar Lake	Q4 2024	17 - 30 W	N3B + N6	2	CPU + GPU & IMC	4P + 4E	0	8 MB	Arc	8
?	Panther Lake	Q1 2026 ?	?	Intel 18A + N3E	3	CPU + MC	4P + 8E	4	?	Arc	12

Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

	Meteor Lake	Arrow Lake (20A)	Arrow Lake (N3B)	Arrow Lake Refresh (N3B)	Lunar Lake	Panther Lake
Platform	Mobile H/U Only	Desktop Only	Desktop & Mobile H&HX	Desktop Only	Mobile U Only	Mobile H
Process Node	Intel 4	Intel 20A	TSMC N3B	TSMC N3B	TSMC N3B	Intel 18A
Date	Q4 2023	Q1 2025 ?	Desktop-Q4-2024 H&HX-Q1-2025	Q4 2025 ?	Q4 2024	Q1 2026 ?
Full Die	6P + 8P	6P + 8E ?	8P + 16E	8P + 32E	4P + 4E	4P + 8E
LLC	24 MB	24 MB ?	36 MB ?	?	8 MB	?
tCPU	66.48
tGPU	44.45
SoC	96.77
IOE	44.45
Total	252.15

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

Ghostsonplanets · Jun 7, 2024

SiliconFly said:
I assume the 14% includes the latency. I maybe wrong though.

The quoted IPC is for Lion Cove on Lunar Lake when compared to Redwood Cove on Meteor Lake.

How will Arrow Lake DT fare is anyone guess. There will be implemetation changes and higher clocks, but ARL also inherit MTL tile structure. We'll need to wait until Innovation Days in September.

Hulk · Jun 7, 2024

Ghostsonplanets said:
The quoted IPC is for Lion Cove on Lunar Lake when compared to Redwood Cove on Meteor Lake.

How will Arrow Lake DT fare is anyone guess. There will be implemetation changes and higher clocks, but ARL also inherit MTL tile structure. We'll need to wait until Innovation Days in September.

ARL will do much better than MTL even though it will probably be on a similar tile layout due to the much faster memory subsystem of desktop vs. mobile. We've seen this before, the numbers for the same cores on the different platforms aren't really comparable unless the entire workload fits in cache.

In addition I have a feeling that the "L1.5" cache added to Lion Cove is probably there in part to offset the tile latency. This thread is like science. Something is unknown, then you know it, and then because you know it there are new unknowns.

invisible_city · Jun 7, 2024

Hulk said:
ARL will do much better than MTL even though it will probably be on a similar tile layout due to the much faster memory subsystem of desktop vs. mobile. We've seen this before, the numbers for the same cores on the different platforms aren't really comparable unless the entire workload fits in cache.

In addition I have a feeling that the "L1.5" cache added to Lion Cove is probably there in part to offset the tile latency. This thread is like science. Something is unknown, then you know it, and then because you know it there are new unknowns.

Yeah, it would be disappointing if we had IPC regression with Arrow Lake, and since they've known about the Meteor Lake issue for at least 2 years, they should have had time to find some solution. Also, given that the Skymont vs Raptor Cove IPC figures that were given at Computex likely came from Arrow Lake hardware, it seems that if there is a latency hit, whatever they're doing is hiding the issue. We'll see about Lion Cove Arrow Lake implementation.

Joe NYC · Jun 7, 2024

SiliconFly said:
I assume the 14% includes the latency. I maybe wrong though.

The 14% includes latency advantage from MTL to LNR. Say 10% + 4% = 14%

From RPL to ARL, there will be latency disadvantage. Say 10% - 4% = 6%

Joe NYC · Jun 7, 2024

Hulk said:
ARL will do much better than MTL even though it will probably be on a similar tile layout due to the much faster memory subsystem of desktop vs. mobile. We've seen this before, the numbers for the same cores on the different platforms aren't really comparable unless the entire workload fits in cache.

In addition I have a feeling that the "L1.5" cache added to Lion Cove is probably there in part to offset the tile latency. This thread is like science. Something is unknown, then you know it, and then because you know it there are new unknowns.

Is ARL getting a new SoC tile? I thought SoC tile that houses memory controllers was kept from MTL.

Hulk · Jun 7, 2024

Joe NYC said:
Is ARL getting a new SoC tile? I thought SoC tile that houses memory controllers was kept from MTL.

We don't know the exact configuration of the ARL package but from what Intel reps have implied it won't be like LNL but more like MTL.

lightisgood · Jun 8, 2024

Hulk said:
We don't know the exact configuration of the ARL package but from what Intel reps have implied it won't be like LNL but more like MTL.

MSI displayed the AI PC of ARL-S in Computex 2024.
It looks like ARL-S will have new SoC tile.

AI-focused MSI desktop has a 1080p touch screen built into the front of its chassis

The MSI MEG Vision X has a powerful touch UI built into the front of its chassis.

www.tomshardware.com

> A built-in microphone and speaker will let you issue voice commands either to MSI's app or to Copilot.

TESKATLIPOKA · Jun 8, 2024

Wolverine2349 said:
50% is big. Though I remember hearing Arrow Lake Lion Cove cores would have a clock speed regression and would top at 5.5GHz and its IPC was only 10-15% better. If it clocks at 5.8GHz that exceeds expectations and really is not a clock speed regression from 13th and 14th Gen. Yes I know 14th Gen can tehcially go to 6GHz or some cases 6.2GHz, but it so so unstable at those settings it does not count. 5.8GHz is more like for 13th and 14th Gen max when stable and if Arrow Lake has that no clock speed regression afterall compared to Raptor Cove.

I don't know what is the boost clock of Lion Cove. It can be more or less than 5.8GHz. My point was that there is still a big performance difference between P vs E-cores in case you have enough power.

SiliconFly · Jun 8, 2024

TESKATLIPOKA said:
I don't know what is the boost clock of Lion Cove. It can be more or less than 5.8GHz. My point was that there is still a big performance difference between P vs E-cores in case you have enough power.

In Arrow Lake, LNC might be up to 50% faster than SKT due to clock & µarch differences.

TESKATLIPOKA · Jun 8, 2024

SiliconFly said:
In Arrow Lake, LNC might be up to 50% faster than SKT due to clock & µarch differences.

And? I know this.
This is exactly what I wrote to @Wolverine2349 in my previous post here.
In what you quoted, I just mentioned to him, that I don't know the exact boost of Lion Cove in Arrow Lake and 5.8ghz was just my guess.

The Hardcard · Jun 8, 2024

I scanned the thread and didn’t see this. Sorry if I missed it and it is a repost.

An attendee posted the actual full presentations given by a Lion Cove architect, and the head architect of Skymont at Computex. He also includes in the video the actual slides, rather than phone video of the slides on a screen in the room.

I didn’t hear anything that wasn’t in the articles, however, it is interesting to me at least to hear it straight from the designers mouths.

Lion Cove

Skymont

inf64 · Jun 8, 2024

The Hardcard said:
I scanned the thread and didn’t see this. Sorry if I missed it and it is a repost.

An attendee posted the actual full presentations given by a Lion Cove architect, and the head architect of Skymont at Computex. He also includes in the video the actual slides, rather than phone video of the slides on a screen in the room.

I didn’t hear anything that wasn’t in the articles, however, it is interesting to me at least to hear it straight from the designers mouths.

Lion Cove

Skymont

It really seems that there is a chance ARL-S has HT enabled on its P cores, based on that video. Interesting decisions for P core in LNL.

mikk · Jun 8, 2024

inf64 said:
It really seems that there is a chance ARL-S has HT enabled on its P cores, based on that video. Interesting decisions for P core in LNL.

From all the sources I have seen there is no HT on ARL-S. Every source says this and we got some from Intel too. It is very very unlikely. From what we know only some server variants might ship with HT enabled.

invisible_city · Jun 8, 2024

mikk said:
From all the sources I have seen there is no HT on ARL-S. Every source says this and we got some from Intel too. It is very very unlikely. From what we know only some server variants might ship with HT enabled.

Yeah that seems likely. We’ll see, Intel needs to have a blowout product asap. I’m sure they know that releasing a small incremental improvement with no AVX512 and no obvious replacement (e.g. NPU offload, AVX10) is a non-starter.

DrMrLordX · Jun 8, 2024

Thunder 57 said:
At least they figured out they weren't serving their intended purpose and are axing them in favor of a much more potent e-core.

The problem is that with Lunar Lake, it looks like they have to power up the entire e-core "island" at a minimum. They don't have the ability to put two or three e-cores to sleep. You win some, you lose some.

Arrow Lake-U will probably still have the LP-e cores, running (hopefully) at higher clocks thanks to being on Intel 3 without burning any additional power. But Arrow Lake-U will be significantly worse than Lunar Lake in any performance-intensive application. It's going to be a really awkward situation.

Magio said:
We know that the wafer orders Intel has with TSMC are significant

Intel spent billions years ago to secure TSMC N3. They'll get whatever wafers they want, especially when you consider how little wafer area is required per compute tile on Arrow Lake and Lunar Lake. And it doesn't seem that any of Intel's enterprise CPUs will be sharing the same wafers so it's consumer all the way down.

Thunder 57 · Jun 8, 2024

DrMrLordX said:
The problem is that with Lunar Lake, it looks like they have to power up the entire e-core "island" at a minimum. They don't have the ability to put two or three e-cores to sleep. You win some, you lose some.

Arrow Lake-U will probably still have the LP-e cores, running (hopefully) at higher clocks thanks to being on Intel 3 without burning any additional power. But Arrow Lake-U will be significantly worse than Lunar Lake in any performance-intensive application. It's going to be a really awkward situation.

Intel spent billions years ago to secure TSMC N3. They'll get whatever wafers they want, especially when you consider how little wafer area is required per compute tile on Arrow Lake and Lunar Lake. And it doesn't seem that any of Intel's enterprise CPUs will be sharing the same wafers so it's consumer all the way down.

Right but if two LP-e cores aren't enough for anything other than idle it hardly matters. I thought LP-e cores were gone in all future designs? Could be wrong.

invisible_city · Jun 8, 2024

Thunder 57 said:
Right but if two LP-e cores aren't enough for anything other than idle it hardly matters. I thought LP-e cores were gone in all future designs? Could be wrong.

All Lunar Lake e cores are LP cores. Just this time they’re good

Thunder 57 · Jun 8, 2024

invisible_city said:
All Lunar Lake e cores are LP cores. Just this time they’re good

Pretty sure Skymont is an e-core, not a LP-ecore.

Ghostsonplanets · Jun 8, 2024

Thunder 57 said:
Pretty sure Skymont is an e-core, not a LP-ecore.

The way they're implemented is as if they're LPE cores. No access to the L3 and treated as a separated Compute Island to shut down P Core island.

invisible_city · Jun 8, 2024

Ghostsonplanets said:
The way they're implemented is as if they're LPE cores. No access to the L3 and treated as a separated Compute Island to shut down P Core island.

Which is why they compared them to LP E cores on Meteor Lake. The Skymont island in Lunar Lake is a spiritual successor that doesn’t suck.

Hulk · Jun 8, 2024

I found a really great talk about the Lion Cove core changes and had a go at transcibing it, well kind of. Some transcribe, some notes. Couple spots with (?) I could follow. Let me know if you can figure those parts out. Good info here that I didn't see published.

Hyperthreading

HT can add 30% MT performance for only a 20% increase in power and a 10% increase in die area.

Compared to a core that includes HT the same core without HT can use 15% less power for the same ST performance and be 10% smaller in area.

20 years ago, when software threads often outnumbered core count and prior to the advent of E cores or processors with high core count, HT was a good way to increase MT performance. But now with E cores and P cores, each optimized by area and power to excel at ST and MT performance, respectively, HT isn’t the best way to increase overall performance on the desktop. In addition, in most desktop applications there are sufficient P and E cores to handle all of the threads created by the software.

HT is still useful in server situations where the software thread count is extremely high and all available compute cores can be utilized. But for Lunar Lake the idea was to remove any transistors that don’t increase the goodness of the CPU and MT is better handled by E cores than HT.

Smarter Thermal Management and 16.67MHz Bins

While die shrinks have allowed for more transistors to be placed in smaller and smaller areas, along with this advancement in comes increase in thermal density that must be dealt with. Historically thermal issues in the core were either dealt with by frequency throttling or reducing frequency. Since these “guard band” settings were static and were determined in the lab pre-launch they had to be set conservatively to protect the core under the highest environmental conditions and heaviest compute loads. This invariably leaves some performance on the table in most “normal” operating conditions.

Intel is now using a novel approach to thermal management by using a network based self-tuning controller that adapts to the real time operations conditions of the actual workload being run and takes into account the environmental conditions and thermal solution being used. The controller is thereby able to maximize all of the thermal headroom with tighter frequency control to maximize performance. Efficiency is also increased as the frequency bins have been reduced from 100MHz step to 16.67MHz, again allowing for maximum performance and efficiency from the core.

For example, if conditions permit operation at 3.05GHz, but not 3.1GHz the old system would have to hold frequency at 3GHz, the new system will be able to operate at close to 3.05GHz. The overall performance benefit is approximately 2%.

Front End Improvements

The front end of a CPU is responsible for fetching x86 instructions and decoding them into micro-operations. In order to adequately supply the core with instructions the Out-Of-Order part of the front end must be able to accurately determine the correct code blocks from which to generate instructions. Lion Cover fundamentally changes the branch prediction scheme to significantly widen the prediction block up to 8x the previous generation without sacrificing performance accuracy. This has two important benefits. First, it allows the BPU (Branch Prediction Unit) to run ahead and prefetch code lines into the instruction cache alleviating possible instruction cache misses. In this context cache request bandwidth towards the L2 was increased in Lion Cove 3x to capitalize on the BPU running ahead.

Second, wider prediction blocks allow the increase in instruction fetch bandwidth in instruction fetch bandwidth and indeed the instruction fetch bandwidth was doubled from 64 bytes per cycle to 128 bytes per cycle and the decode bandwidth was increased from 6 to 8 instructions per cycle. These instructions are steered towards the micro-op queue and are also built into the micro-of cache since code lines are often reused the micro-op cache allows for efficient low latency and high bandwidth supply of previously decoded micro-ops toward the OoO engine without having to power up the fetch and decode pipeline.

The Lion Cove micro-op cache grew from 4,000 micro-ops in Redwood Cove to 5.250 in Lion Cove and the read bandwidth was increased to supply 12 micro-ops per cycle verses 8 for Redwood Cove. Finally, the micro-op cache grew from 144 to 192 entries facilitating the service of longer or larger code loops in a power efficient manner. The OoO engine is responsible for scheduling micro instructions for execution in a manner which maximized parallelism thus increasing IPC.

Prior generation P cores employed a monolithic scheduling scheme where a single scheduler was tasked with determining the data readiness of all micro-op types and scheduling and dispatching them across all execution ports. This scheme was exceedingly hard to scale and incurred significant hardware overhead.

Lion Cove solves this by splitting the OoO engine into two domains. Integer, which also holds address generation units for memory operations, and vector. These two domains now have independent renaming structures catering to optimize the bandwidth and independent schedulers catering to optimized portability. This allows future expansion of each of these domains independently of each other and provides opportunities on workloads are on only one of these domains.

Lion Cove increases the allocation rename bandwidth from 6 to 8 micro-ops per cycle and the OoO depth or instruction window was increased from 512 to 576 micro-ops. In addition, the physical register files were enlarged appropriately versus prior generation. Lion Cove retires 12 micro-ops per cycle versus eight previously.

Execution

Lion Cove increases the total number of execution ports from 12 to 18. On the integer side, 6 ALU’s are complemented by 3 shift units and 3 64-bit multipliers operating at 3 cycles latency and 1 cycle throughput. 3 branches can be resolved in parallel per cycle. Onthe vector side Lion Cove has four 256 bit ALU’s plus two 256 bit FMA’s operating at 4 cycle latency and two 56-bit floating point dividers with significantly improved latency and throughput for both single and double precision operations vs prior generation.

Crypto acceleration hardware for AES, Shaw and SM3 and 4 resides in the vector stack.

Memory Subsystem

The memory subsystem is a key part of a performant microarchitecture. At the heart of that core memory subsystem are the data caches. Caches are all about striking the perfect balance between bandwidth latency and capacity given a certain area and power budget. Lion Cove significantly re architects the core’s memory subsystem to allow for sustainable high bandwidth with low average latency while still keeping built-in scalability and flexibility to increase cache capacity.

The first data level cache was completely redesigned to allow full operation at 4 cycles latency vs 5 cycles in the previous generation. Lion Cove also introduces a new three level cache hierarchy by inserting an intermediate 192KB cache between the 1st and 2nd level caches.

This has two key benefits. First and foremost, it decreases the average load to use latency by the core, which increase IPC. Second, it allows us to grow the L2 cache capacity to keep a larger portion of the data set closer inside the core without paying the IPC penalty of the added L2 cache latency and indeed the L2 online curve grows to 2.5MB on Lunar Lake and 3 MB on Arrow Lake along with several other L2 controller optimization as well as an increase in L1 fill buffers to 24 and L2 mis-queues(?) to 80, Lion Cove shows a significant improvement in its capacity to consume external bandwidth and this as you know is key to running performant AI workloads.

In other memory subsystem enhancements the first level DTLB (Data Translation Lookaside Buffer) was increased to support coverage for 128 pages vs 96 pages previously and in order to improve load execution in the shadow of older stores Lion Cove adds a 3rd store address generation unit. It employs a new fine grain memory disambiguation algorithm to safely conflicts (audio dropout) and enhances the stored load forwarding scheme to allow a young load to collect and stitch data from any number of older pending resolved stores as well as from the data cache.

Lion Cove drives a significant double-digit IPC improvement over a wide spectrum of fork loads (?). Having optimized for lower TDP’s (Total Design Power) on Lunar Lake, Lion Cove delivers more than 18% PNP (Performance at Power) at that low TDP.

invisible_city · Jun 8, 2024

Hulk said:
I found a really great talk about the Lion Cove core changes and had a go at transcibing it, well kind of. Some transcribe, some notes. Couple spots with (?) I could follow. Let me know if you can figure those parts out. Good info here that I didn't see published.

Hyperthreading

HT can add 30% MT performance for only a 20% increase in power and a 10% increase in die area.

Compared to a core that includes HT the same core without HT can use 15% less power for the same ST performance and be 10% smaller in area.

20 years ago, when software threads often outnumbered core count and prior to the advent of E cores or processors with high core count, HT was a good way to increase MT performance. But now with E cores and P cores, each optimized by area and power to excel at ST and MT performance, respectively, HT isn’t the best way to increase overall performance on the desktop. In addition, in most desktop applications there are sufficient P and E cores to handle all of the threads created by the software.

HT is still useful in server situations where the software thread count is extremely high and all available compute cores can be utilized. But for Lunar Lake the idea was to remove any transistors that don’t increase the goodness of the CPU and MT is better handled by E cores than HT.

Smarter Thermal Management and 16.67MHz Bins

While die shrinks have allowed for more transistors to be placed in smaller and smaller areas, along with this advancement in comes increase in thermal density that must be dealt with. Historically thermal issues in the core were either dealt with by frequency throttling or reducing frequency. Since these “guard band” settings were static and were determined in the lab pre-launch they had to be set conservatively to protect the core under the highest environmental conditions and heaviest compute loads. This invariably leaves some performance on the table in most “normal” operating conditions.

Intel is now using a novel approach to thermal management by using a network based self-tuning controller that adapts to the real time operations conditions of the actual workload being run and takes into account the environmental conditions and thermal solution being used. The controller is thereby able to maximize all of the thermal headroom with tighter frequency control to maximize performance. Efficiency is also increased as the frequency bins have been reduced from 100MHz step to 16.67MHz, again allowing for maximum performance and efficiency from the core.

For example, if conditions permit operation at 3.05GHz, but not 3.1GHz the old system would have to hold frequency at 3GHz, the new system will be able to operate at close to 3.05GHz. The overall performance benefit is approximately 2%.

Front End Improvements

The front end of a CPU is responsible for fetching x86 instructions and decoding them into micro-operations. In order to adequately supply the core with instructions the Out-Of-Order part of the front end must be able to accurately determine the correct code blocks from which to generate instructions. Lion Cover fundamentally changes the branch prediction scheme to significantly widen the prediction block up to 8x the previous generation without sacrificing performance accuracy. This has two important benefits. First, it allows the BPU (Branch Prediction Unit) to run ahead and prefetch code lines into the instruction cache alleviating possible instruction cache misses. In this context cache request bandwidth towards the L2 was increased in Lion Cove 3x to capitalize on the BPU running ahead.

Second, wider prediction blocks allow the increase in instruction fetch bandwidth in instruction fetch bandwidth and indeed the instruction fetch bandwidth was doubled from 64 bytes per cycle to 128 bytes per cycle and the decode bandwidth was increased from 6 to 8 instructions per cycle. These instructions are steered towards the micro-op queue and are also built into the micro-of cache since code lines are often reused the micro-op cache allows for efficient low latency and high bandwidth supply of previously decoded micro-ops toward the OoO engine without having to power up the fetch and decode pipeline.

The Lion Cove micro-op cache grew from 4,000 micro-ops in Redwood Cove to 5.250 in Lion Cove and the read bandwidth was increased to supply 12 micro-ops per cycle verses 8 for Redwood Cove. Finally, the micro-op cache grew from 144 to 192 entries facilitating the service of longer or larger code loops in a power efficient manner. The OoO engine is responsible for scheduling micro instructions for execution in a manner which maximized parallelism thus increasing IPC.

Prior generation P cores employed a monolithic scheduling scheme where a single scheduler was tasked with determining the data readiness of all micro-op types and scheduling and dispatching them across all execution ports. This scheme was exceedingly hard to scale and incurred significant hardware overhead.

Lion Cove solves this by splitting the OoO engine into two domains. Integer, which also holds address generation units for memory operations, and vector. These two domains now have independent renaming structures catering to optimize the bandwidth and independent schedulers catering to optimized portability. This allows future expansion of each of these domains independently of each other and provides opportunities on workloads are on only one of these domains.

Lion Cove increases the allocation rename bandwidth from 6 to 8 micro-ops per cycle and the OoO depth or instruction window was increased from 512 to 576 micro-ops. In addition, the physical register files were enlarged appropriately versus prior generation. Lion Cove retires 12 micro-ops per cycle versus eight previously.

Execution

Lion Cove increases the total number of execution ports from 12 to 18. On the integer side, 6 ALU’s are complemented by 3 shift units and 3 64-bit multipliers operating at 3 cycles latency and 1 cycle throughput. 3 branches can be resolved in parallel per cycle. Onthe vector side Lion Cove has four 256 bit ALU’s plus two 256 bit FMA’s operating at 4 cycle latency and two 56-bit floating point dividers with significantly improved latency and throughput for both single and double precision operations vs prior generation.

Crypto acceleration hardware for AES, Shaw and SM3 and 4 resides in the vector stack.

Memory Subsystem

The memory subsystem is a key part of a performant microarchitecture. At the heart of that core memory subsystem are the data caches. Caches are all about striking the perfect balance between bandwidth latency and capacity given a certain area and power budget. Lion Cove significantly re architects the core’s memory subsystem to allow for sustainable high bandwidth with low average latency while still keeping built-in scalability and flexibility to increase cache capacity.

The first data level cache was completely redesigned to allow full operation at 4 cycles latency vs 5 cycles in the previous generation. Lion Cove also introduces a new three level cache hierarchy by inserting an intermediate 192KB cache between the 1st and 2nd level caches.

This has two key benefits. First and foremost, it decreases the average load to use latency by the core, which increase IPC. Second, it allows us to grow the L2 cache capacity to keep a larger portion of the data set closer inside the core without paying the IPC penalty of the added L2 cache latency and indeed the L2 online curve grows to 2.5MB on Lunar Lake and 3 MB on Arrow Lake along with several other L2 controller optimization as well as an increase in L1 fill buffers to 24 and L2 mis-queues(?) to 80, Lion Cove shows a significant improvement in its capacity to consume external bandwidth and this as you know is key to running performant AI workloads.

In other memory subsystem enhancements the first level DTLB (Data Translation Lookaside Buffer) was increased to support coverage for 128 pages vs 96 pages previously and in order to improve load execution in the shadow of older stores Lion Cove adds a 3rd store address generation unit. It employs a new fine grain memory disambiguation algorithm to safely conflicts (audio dropout) and enhances the stored load forwarding scheme to allow a young load to collect and stitch data from any number of older pending resolved stores as well as from the data cache.

Lion Cove drives a significant double-digit IPC improvement over a wide spectrum of fork loads (?). Having optimized for lower TDP’s (Total Design Power) on Lunar Lake, Lion Cove delivers more than 18% PNP (Performance at Power) at that low TDP.

Thanks. The more I think about it, the more I think Arrow Lake even with 24 cores is going to be a great deal for most users.

I want more, so I’m looking forward for the refresh hopefully with 40 cores.

But yeah, I’d rather have 15% more ST on every workload than 20% MT given how workload dependent that 20% is, and how it means that some threads are dramatically less useful than others.

Hulk · Jun 9, 2024

invisible_city said:
Thanks. The more I think about it, the more I think Arrow Lake even with 24 cores is going to be a great deal for most users.

I want more, so I’m looking forward for the refresh hopefully with 40 cores.

But yeah, I’d rather have 15% more ST on every workload than 20% MT given how workload dependent that 20% is, and how it means that some threads are dramatically less useful than others.

The HT discussion in my post made a lot of sense to me. I have a 14900K and it's rare, really rare that I'm hitting all 24 cores really hard, and I have HT turned off. In anything outside of CB HT is useless for me. ARL would be a great upgrade with stronger P cores and much stronger E cores. Of course I'm going to wait and see what Zen 5 brings when reviews are actually published for both.

DrMrLordX · Jun 9, 2024

Hulk said:
Compared to a core that includes HT the same core without HT can use 15% less power for the same ST performance and be 10% smaller in area.

~10% die area is committed to HT? That's hard to believe. Also the extra power consumption only occurs when the HT core is forced to handle two threads, no?

Thunder 57 · Jun 9, 2024

DrMrLordX said:
~10% die area is committed to HT? That's hard to believe. Also the extra power consumption only occurs when the HT core is forced to handle two threads, no?

10% sounded weird to me too. Intel said it was 5% back in the P4 days. I figured it has shrunk since then compared to other structures.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Senior member

Diamond Member

Member

Platinum Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Golden Member

Platinum Member

Member

Diamond Member

Diamond Member

Member

Lifer

Platinum Member

Member

Platinum Member

Senior member

Member

Diamond Member

Member

Diamond Member

Lifer

Platinum Member