Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Tigerick · Aug 22, 2022

As Hot Chips 34 starting this week, Intel will unveil technical information of upcoming Meteor Lake (MTL) and Arrow Lake (ARL), new generation platform after Raptor Lake. Both MTL and ARL represent new direction which Intel will move to multiple chiplets and combine as one SoC platform.

MTL also represents new compute tile that based on Intel 4 process which is based on EUV lithography, a first from Intel. Intel expects to ship MTL mobile SoC in 2023.

ARL will come after MTL so Intel should be shipping it in 2024, that is what Intel roadmap is telling us. ARL compute tile will be manufactured by Intel 20A process, a first from Intel to use GAA transistors called RibbonFET.

Comparison of upcoming Intel's U-series CPU: Core Ultra 100U, Lunar Lake and Panther Lake

Model	Code-Name	Date	TDP	Node	Tiles	Main Tile	CPU	LP E-Core	LLC	GPU	Xe-cores
Core Ultra 100U	Meteor Lake	Q4 2023	15 - 57 W	Intel 4 + N5 + N6	4	tCPU	2P + 8E	2	12 MB	Intel Graphics	4
?	Lunar Lake	Q4 2024	17 - 30 W	N3B + N6	2	CPU + GPU & IMC	4P + 4E	0	12 MB	Arc	8
?	Panther Lake	Q1 2026 ?	?	Intel 18A + N3E	3	CPU + MC	4P + 8E	4	?	Arc	12

Comparison of die size of Each Tile of Meteor Lake, Arrow Lake, Lunar Lake and Panther Lake

	Meteor Lake	Arrow Lake (20A)	Arrow Lake (N3B)	Lunar Lake	Panther Lake
Platform	Mobile H/U Only	Desktop Only	Desktop & Mobile H&HX	Mobile U Only	Mobile H
Process Node	Intel 4	Intel 20A	TSMC N3B	TSMC N3B	Intel 18A
Date	Q4 2023	Q1 2025 ?	Desktop-Q4-2024 H&HX-Q1-2025	Q4 2024	Q1 2026 ?
Full Die	6P + 8P	6P + 8E ?	8P + 16E	4P + 4E	4P + 8E
LLC	24 MB	24 MB ?	36 MB ?	12 MB	?
tCPU	66.48
tGPU	44.45
SoC	96.77
IOE	44.45
Total	252.15

Intel Core Ultra 100 - Meteor Lake

As mentioned by Tomshardware, TSMC will manufacture the I/O, SoC, and GPU tiles. That means Intel will manufacture only the CPU and Foveros tiles. (Notably, Intel calls the I/O tile an 'I/O Expander,' hence the IOE moniker.)

Nothingness · Saturday at 1:22 PM

SiliconFly said:
So, AVX512 is more of a niche.

On this I partially agree. But when it's useful, it can bring more speedups than two or three generations of Intel CPU microarchitectures.

I remember people used to say the same about AVX2 (especially during the 10 years braindead Intel segment marketing made sure it wasn't available on all of their CPUs). Of course as vectors get wider, their use is more and more difficult, and more and more niche at first sight. But then you see good programmers making use of these extensions to benefit more and more applications.

I'm curious to see what AVX10.2 will bring to the table with its improved ISA and likely 256-bit first implementation.

9949asd · Saturday at 1:47 PM

265k memory controller running at 4300mhz??? WTF? That’s ddr5 8600 gear1………

DrMrLordX · Saturday at 2:00 PM

SiliconFly said:
So, AVX512 is more of a niche.

It's more like this (and I'm grossly oversimplifying things here):

Let's say you're dealing with fp32 data. How many operations can you squeeze out of your code where you can carry out addition and/or multiplication operations on sixteen of those data points at once without any dependencies? Can you do this repeatedly without necessarily having to do something else in the same thread, other than something simple like iterating a loop counter? If the answer to those questions is "a lot" and "yes", then if you code for it, you can potentially make use of AVX512! Of course AVX512 isn't just AVX256/AVX2 on steroids. It does have other functions.

Regardless, you need to know what you're doing to actually leverage SIMD like AVX512. Even if you take the easy route and rely on autovectorization, your code has to be set up so the compiler or runtime engine can figure out how to align all the code properly and produce what you want. And that is (or can be) niche.

There's probably a fair amount of code out there that could be rewritten to make better use of AVX2 and even AVX512. The question is, who's going to go to all that effort?

511 · Saturday at 2:06 PM

DrMrLordX said:
There's probably a fair amount of code out there that could be rewritten to make better use of AVX2 and even AVX512. The question is, who's going to go to all that effort?

Definitely not AMD as good their HW may be their sw relies on Intel on CPU side

SiliconFly · Saturday at 2:07 PM

511 said:
Definitely not AMD as good their HW may be their sw relies on Intel

Maybe they should jointly develop as it'll benefit both.

511 · Saturday at 2:09 PM

SiliconFly said:
Maybe they should jointly develop as it'll benefit both.

Yes that's what i would like it will make x86 even more stronger than simply relying on Intel alone

Nothingness · Saturday at 2:13 PM

SiliconFly said:
Maybe they should jointly develop as it'll benefit both.

Intel has a large software team for CPU (though it remains to be seen what will remain of it after the layoffs). I don't think AMD has a lot of SW developers for CPU so they indirectly benefit from Intel work.

9949asd · Saturday at 2:16 PM

9949asd said:
265k memory controller running at 4300mhz??? WTF? That’s ddr5 8600 gear1………View attachment 107548 View attachment 107549

Maybe it’s a bug i don’t know …

MS_AT · Saturday at 2:53 PM

SiliconFly said:
Looks like AVX512 is not useful for general consumers after all!

I am afraid it's more useful than Cinebench itself is.

You confuse usefulness of AVX512 vs historical availability of AVX512. Software is usually written for lowest common denominator and since AVX512 HW wasn't generally available so nobody was writing consumer software capable of using it. What is more Skylake-X implementation was suffering from instruction licences so sprinkling bits of AVX512 here and there carried performance penalty for code that used AVX512 sparingly. Intel fixed that with Icelake and AMD never had the problem to begin with. But this initial situation gave AVX512 opinion of something being HPC niche.

The reason why Intel dropped AVX512 with Alder Lake was their shift to use E-cores and initial strategic mistake. The mistake being even though AVX512 instructions apply to 128b,256b,512b register widths, the 512b width is mandated and other are optional. And adding 512b support while keeping 128b execution width for E-cores to save space would led to terrible performance for reasons other than quadruple pumping.

And Intel doesn't consider AVX512 niche or useless as evidenced by their push for AVX10 that will be available on consumer hardware and it will make all the fancy AVX512 instructions available for targets supporting at least 256b execution width that is much easier to implement on E-cores as double pumped 128 units than AVX512 would be possible. If anything you could think of 512b execution width being niche.

And why AVX512 is useful? Because it makes it easier to work with vectors due to new predicate instructions and masking in general, therefore I would argue it's easier to add support for AVX512 than for AVX2 to your program. Why people are not doing that? Once again, there was not enough HW on the market capable of using it, to make it worthwhile. Even with runtime dispatch based on detected CPU features it adds validation costs etc, so you need to make sure you can get return on this investment.

The additional benefit of SIMD is that it lowers the burden on the decoders as you are able to do more work with fewer instructions. Zen4 nicely showed that, and this is more important for x64 than it is for ARM, where decode is relatively easier to implement than for x64.

And the compatibility moat for x64 would be that much harder for ARM to overcome if Intel assumed NVidia approach [same features are supported on consumer and enterprise products but some are slower on consumer hardware. Still you can use consumer hw for development. That was not possible with AVX512 that was Xeon only at the beginning].

hemedans · Saturday at 3:38 PM

Tup3x said:
Normal user doesn't really do neither though. If normal user does transcoding or video encoding, it's probably done using hardware acceleration.

many people do Emulation, some popular emulators have hundreds of million downloads.

Hulk · Saturday at 5:24 PM

MS_AT said:
The reason why Intel dropped AVX512 with Alder Lake was their shift to use E-cores and initial strategic mistake. The mistake being even though AVX512 instructions apply to 128b,256b,512b register widths, the 512b width is mandated and other are optional. And adding 512b support while keeping 128b execution width for E-cores to save space would led to terrible performance for reasons other than quadruple pumping.

Great post. Can you clarify this paragraph a bit more? Was the mistake that E cores didn't have AVX512 or that AVX512 was not correctly developed upon inception?

itsmydamnation · Saturday at 6:20 PM

Hulk said:
Great post. Can you clarify this paragraph a bit more? Was the mistake that E cores didn't have AVX512 or that AVX512 was not correctly developed upon inception?

its that AVX512 deliberately made choices that make splitting 512 vectors into small vectors hard , operations that cross "lanes" / the major binary boundaries (128/256). Those operations are useful but an overall flexability detriment.

Wolverine2349 · Saturday at 8:23 PM

9949asd said:
265k memory controller running at 4300mhz??? WTF? That’s ddr5 8600 gear1………View attachment 107548 View attachment 107549

Is that a good sign. Maybe Arrow Lake can really do Gear 1 DDR5 at such speeds? It would be insanely fast then right?

Aeonsim · Saturday at 8:33 PM

AVX512 isn't useless for the standard user. Plenty of basic functions, such as parsing JSON (lots of websites), text processing, sorting, base64 encoding/decoding and similar, can substantially benefit (30-60% faster) when rewritten to use AVX512 operations. The problem is simply the lack of penetration in the CPU market due to Intel's confused approach to AVX12. This has meant there has been little benefit for the massive amount of legacy code and libraries to switch to using the instructions.

9949asd · Saturday at 9:29 PM

Wolverine2349 said:
Is that a good sign. Maybe Arrow Lake can really do Gear 1 DDR5 at such speeds? It would be insanely fast then right?

No idea that’s a bug or not.

DrMrLordX · Saturday at 9:55 PM

511 said:
Definitely not AMD as good their HW may be their sw relies on Intel on CPU side

Intel clearly didn't push the envelope here much either!

alcoholbob · Saturday at 10:38 PM

Hulk said:
Great post. Can you clarify this paragraph a bit more? Was the mistake that E cores didn't have AVX512 or that AVX512 was not correctly developed upon inception?

Probably a feature that requires going into the BIOS to turn off e-cores just ended up being something Intel didn't want to deal with, and have to validate. Or they were going have to do something stupid like AMD having users open Xbox Game Bar...

Wolverine2349 · Saturday at 10:50 PM

9949asd said:
No idea that’s a bug or not.

Well interesting if true or not?

I have heard that DDR5 cannot do Gear 1 1:1 on either AMD or Intel?

Though on Zen 4, isn't DDR5 actually Gear 1 with the UCLK (memory controller). I mean I have 7800X3D and UCLK is at 3000 and my RAM speed is at 3000 (6000 MT/s for DDR). Or is that not Gear 1 because the Infinity Fabric is not 1:1

I mean Intel is still similar to AMD in latency with Raptor Lake with 6000 tuned RAM despite needing to run in Gear 2??

Maybe is it possible since the IMC in Arrow Lake is not longer part of ring bus Gear is possible? Or is it true that even AMD its not really Gear 1 or can you not say that cause it works so differently than Intel?

Is the Infinity Fabric on Zen 3 and 4 and 5 like the ring bus on Intel 12th to 14th Gen? Or is IF more like the IMC or no??

Cause if AMD is truly Gear 1 on Zen 4 and 5 with DDR5 6000 (3000 MHz), Intel must just have so much better latency the fact that they can equal AMD ns with half the memory clock?

alcoholbob · Saturday at 11:09 PM

Intel's ringbus system has way lower latency than AMD's due to their chiplet design. It's more than enough to offset the fact that AMD can run RAM in Gear 1 and Intel is forced to run in Gear 2 with DDR5.

This may also explain why Intel seems to benefit less (in gaming anyway) from lower RAM latency than just brute forcing more bandwidth, whereas AMD seems to benefit more from lower first word latency in their RAM.

Wolverine2349 · Saturday at 11:15 PM

alcoholbob said:
Intel's ringbus system has way lower latency than AMD's due to their chiplet design. It's more than enough to offset the fact that AMD can run RAM in Gear 1 and Intel is forced to run in Gear 2 with DDR5.

This may also explain why Intel seems to benefit less (in gaming anyway) from lower RAM latency than just brute forcing more bandwidth, whereas AMD seems to benefit more from lower first word latency in their RAM.

Is Arrow Lake still going to have Intel ring bus system or something else?

Is AMD FCLK/Infinity fabric like Intel ring bus or no?

Thunder 57 · Sunday at 12:05 AM

Tup3x said:
Normal user doesn't really do neither though. If normal user does transcoding or video encoding, it's probably done using hardware acceleration.

GPU accleration either looks like crap or has a much larger file size. It's fine for streaming. For archiving, nothing beats CPU transcoding.

SiliconFly · Sunday at 4:32 AM

Thunder 57 said:
GPU accleration either looks like crap or has a much larger file size. It's fine for streaming. For archiving, nothing beats CPU transcoding.

On Intel platforms, most encoders/decoders typically use the built-in Quick Sync accelerators for transcoding. It's CPU based (not GPU based).

Encoders/decoders that use Quick Sync don't need AVX512. Only those that don't use Quick Sync (very few) may benefit from AVX512 acceleration if it's coded properly with AVX512 supported.

Imho, this is one of the primary reasons AVX512 is dead in the water I believe.

Thunder 57 · Sunday at 4:36 AM

SiliconFly said:
On Intel platforms, most encoders/decoders typically use the built-in Quick Sync accelerators for transcoding. It's CPU based (not GPU based).

Encoders/decoders that use Quick Sync don't need AVX512. Only those that don't use Quick Sync may benefit from AVX512 acceleration if it's coded properly with AVX512 supported (which is very few).

Imho, this is one of the primary reasons AVX512 is dead in the water I believe.

Quicksync is great, I loved it on my Ivy Bridge, It still resulted in a larger file size though. Whether that matters, well thats up to the user to decide,

hemedans · Sunday at 4:36 AM

SiliconFly said:
On Intel platforms, most encoders/decoders typically use the built-in Quick Sync accelerators for transcoding. It's CPU based (not GPU based).

Encoders/decoders that use Quick Sync don't need AVX512. Only those that don't use Quick Sync (very few) may benefit from AVX512 acceleration if it's coded properly with AVX512 supported.

Imho, this is one of the primary reasons AVX512 is dead in the water I believe.

Quicksync is gpu based,

naukkis · Sunday at 4:39 AM

SiliconFly said:
On Intel platforms, most encoders/decoders typically use the built-in Quick Sync accelerators for transcoding. It's CPU based (not GPU based).

Intel quicksync ain't cpu based but dedicaded hardware decoder/encoder block.

Discussion Intel Meteor, Arrow, Lunar & Panther Lakes Discussion Threads

Senior member

Attachments

Diamond Member

Member

Lifer

Senior member

Golden Member

Senior member

Diamond Member

Member

Senior member

Senior member

Diamond Member

Platinum Member

Senior member

Junior Member

Member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Golden Member

Platinum Member

Senior member

Senior member