Discussion Intel current and future Lakes & Rapids thread

IntelUser2000 · Dec 17, 2021

uzzi38 said:
an unrealistic load (Yakuza 0 but at 4x render resolution) it capped at 13W. So Tiger Lake should generally still be maxed out in most games within reasonable power limits for a -U chip.

Time Spy at 1500 points means the scene is heavily GPU bound and is stuttering with the CPU underutilized.

In games, you'd never do that so the CPU starts to matter. I'd assume the continual increasing TDP would help in real world gaming higher than GPU power consumption suggests.

They struggled with this balance for a while in the earlier generations where people would complain the CPU Turbo would take too much away from the GPU power just to give an example.

Not saying you are wrong but just FYI.

mikk · Dec 17, 2021

uzzi38 said:
It's over double the performance of mobile Renoir/Cezanne (highest 5800U score is 1300pts), how is that not impressive?

I guess you're comparing to Tiger Lake, but if anything Tiger Lake overachieves in the benchmark, because the chip is definitely not nearly 50% faster in games.

Vega underperformed in Timespy, as I said it's too early to judge.

Exist50 · Dec 17, 2021

IntelUser2000 said:
I don't buy this unless they want to kill their process ambitions.

Meteorlake: Intel 4
Arrowlake: Intel 3
Lunarlake: Intel 20A
Novalake: Intel 18A

Intel 3 is a modification of their Intel 4 process so it makes sense to use that. They might use TSMC N3 elsewhere.

"5 process nodes in 4 years"
"Leadership by 2025"

Process is to me an unquestionable part of their plan. If they "skip" Intel 3 then 20A and 18A is dead, period. And I mean Intel 3 has to be a significant part of their process. If it turns like an FPGA or Cannonlake it's dead as a doorknob. No, it has to be mass produced like Tigerlake. There's no such thing as skipping a process. It's expecting a baby to walk before they crawl.

And where do you see Meteorlake fitting there huh? It just disappears into the thin air?

You're thinking of a simple lineup, which is why some information seems contradictory and/or confusing.

Just imagine, say, that most of the Arrow Lake rumors are about a new halo product. Then things start making much more sense. That giant iGPU, the use of expensive (and limited volume?) N3, the timeline vs MTL, etc. Then spin a later, cheaper, mainstream CPU tile on Intel 4 or 3 for the mainstream, giving time to align with the roadmap. Imo, this is the more viable path to Intel matching their roadmap claims.

dullard said:
Intel 3: Lunar Lake late 2024

The current rumors were for Lunar Lake being N3. I suspect substantial, maybe complete overlap between the Arrow Lake and Lunar Lake gens.

dullard said:
And if they do hit that goal, then 20A would have the process leadership in 2025 (although not for very long).

Keep in mind that Intel's new naming is aligned with PnP vs TSMC, not density.

dullard · Dec 17, 2021

Exist50 said:
The current rumors were for Lunar Lake being N3. I suspect substantial, maybe complete overlap between the Arrow Lake and Lunar Lake gens.

It is possible that Arrow Lake and Lunar Lake are essentially the same chip. They both are listed as Lion Cove and Skymont. One would be made mostly by Intel and one would be made mostly by TSMC. That would explain the rumors of Arrow Lake being TSMC N3 or Intel 4 or Intel 3. That would also explain the rumors of Lunar Lake being Intel 3 or TSMC N3. That duplication gives Intel a backup in case something goes wrong. And they can have higher production if it all goes right.

IntelUser2000 · Dec 17, 2021

mikk said:
Vega underperformed in Timespy, as I said it's too early to judge.

Renoir and Cezanne is noticeably more efficient than Tigerlake for the CPU part, but Iris Xe is a better GPU. Hence I speculated in Timespy where the fps is so low the CPU doesn't matter it underperforms due to the slightly weaker GPU. But the better perf/w means in actual games it can perform relatively better since more can be allocated to the CPU instead.

igor_kavinski · Dec 17, 2021

dullard said:
It is possible that Arrow Lake and Lunar Lake are essentially the same chip. They both are listed as Lion Cove and Skymont. One would be made mostly by Intel and one would be made mostly by TSMC. That would explain the rumors of Arrow Lake being TSMC N3 or Intel 4 or Intel 3. That would also explain the rumors of Lunar Lake being Intel 3 or TSMC N3. That duplication gives Intel a backup in case something goes wrong. And they can have higher production if it all goes right.

Intel could have shipping CPUs using both TSMC and its own in-house process, just like Apple did for A9 with Samsung and TSMC.

Exist50 · Dec 17, 2021

dullard said:
It is possible that Arrow Lake and Lunar Lake are essentially the same chip. They both are listed as Lion Cove and Skymont. One would be made mostly by Intel and one would be made mostly by TSMC. That would explain the rumors of Arrow Lake being TSMC N3 or Intel 4 or Intel 3. That would also explain the rumors of Lunar Lake being Intel 3 or TSMC N3. That duplication gives Intel a backup in case something goes wrong. And they can have higher production if it all goes right.

Not a bad thought, but no. Arrow Lake is being developed by the Oregon team, and Lunar Lake by the Israeli.

uzzi38 · Dec 18, 2021

mikk said:
Vega underperformed in Timespy, as I said it's too early to judge.

Relative to what?

It beats the GT1030 (~1100pts) and RX550 (~1200pts). Tiger Lake performs more like an RX 560 (~1850pts) or a GTX1050 (~1750pts) in the benchmark which again, is very clearly overachieving.

Seriously I'm not sure what data you're looking at.

mikk · Dec 18, 2021

dullard said:
It is possible that Arrow Lake and Lunar Lake are essentially the same chip. They both are listed as Lion Cove and Skymont. One would be made mostly by Intel and one would be made mostly by TSMC. That would explain the rumors of Arrow Lake being TSMC N3 or Intel 4 or Intel 3. That would also explain the rumors of Lunar Lake being Intel 3 or TSMC N3. That duplication gives Intel a backup in case something goes wrong. And they can have higher production if it all goes right.

Exactly the same P+E core architecture is a bit odd if there is a full year between Arrow and Lunar, we can't rule it out though. This is just speculation at this point. Some would expect at last refreshed cores like Raptor Cove to Golden Cove. If Arrow is based on Lunar architecture this is huge for a late 2023 part. According to MLID some details in this roadmap are wrong, the Cove+Mont for Arrow/Lunar is a candidate I think. Actually I would expect that Arrow is more similar to Meteor in regards to architecture, maybe the leaker mixed this up. Meteor and Arrow coming in the same year, they share the exact same graphics architecture and process node is not fundamentally different (old 7nm vs 7nm+).

rainy · Dec 18, 2021

mikk said:
Vega underperformed in Timespy, as I said it's too early to judge.

I think, there's possible a different explanation: it's definitely easier for Intel optimize their drivers for synthetic benchmark like TS rather than for various/multiple games, when AMD could care less about it.

Btw, I remember well such scenario from the past: Intel IGPs performing quite well in benchmarks and not so much in real games.

mikk · Dec 18, 2021

rainy said:
I think, there's possible a different explanation: it's definitely easier for Intel optimize their drivers for synthetic benchmark like TS rather than for various/multiple games, when AMD could care less about it.

It's not so much versus Intel, it's GCN versus RDNA2. Vega is quite poor in Timespy. A timespy doubling over Vega 8 is no guarentee there is a doubling in real world. Also timespy is not that bandwidth starved and of course CPU doesn't matter there whereas TGL-U CPU is slower in real world gaming (4/8 is not so great for gaming anymore). Furthermore TGL-U relies on much higher clock speeds to compensate for the low core count which means higher power consumption.

The iGPU improvements from AMD were disappointing in the last few years which allowed Intel to catch up with Xe LP. With Rembrandt they will without doubt have a big lead in 2022 but one thing is clear imho: AMD cannot use the same number of shaders 8 years long like they did with GCN. Intel is just too serious about graphics and with the tile approach in MTL+ they are finally much more flexible, plus they can use a much better GPU process node. They have (at least) 3 GPU upgrades in a row from MTL onwards.

uzzi38 · Dec 18, 2021

mikk said:
Also timespy is not that bandwidth starved

WHAT

Time Spy GT2 is hugely memory bandwidth bound. Only GT1 is lighter on the memory sub-system, and anyway the memory bandwidth increase is only like 30% vs Vega.

IntelUser2000 · Dec 18, 2021

mikk said:
of course CPU doesn't matter there whereas TGL-U CPU is slower in real world gaming (4/8 is not so great for gaming anymore). Furthermore TGL-U relies on much higher clock speeds to compensate for the low core count which means higher power consumption.

I doubt thread count matters at all at this low level of performance. Zen 2/3 is simply much more power efficient than even Tigerlake. You can see that Tigerlake is less efficient until 70W+ power level according to Hardwareunboxed tests.

Hopefully that means we'll actually see some meaningful gains with Alderlake in games because the hybrid means the efficiency will go up.

RTX · Dec 22, 2021

Does Sapphire Rapids already have 2µm L/S between the tiles or is that for future generations?

igor_kavinski · Dec 22, 2021

Optical Chip Promises 350x Speedup Over RTX 3080 in Some Algorithms | Tom's Hardware (tomshardware.com)

Intel needs to buy this company ASAP. Otherwise nVidia may gobble them up and kill them off, like they did with 3dfx.

moinmoin · Dec 22, 2021

igor_kavinski said:
Optical Chip Promises 350x Speedup Over RTX 3080 in Some Algorithms | Tom's Hardware (tomshardware.com)

Intel needs to buy this company ASAP. Otherwise nVidia may gobble them up and kill them off, like they did with 3dfx.

GPUs are generic compute. More specific (but limited in variance scope) compute heavily profits of specialization in FPGAs and ASICs.

This PACE thing seems in-between, more toward the latter:
"The PACE is a somewhat narrow engine when it comes to what exact workloads it can execute."

coercitiv · Dec 22, 2021

An in-depth review of Gracemont, very interesting info:
https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/

moinmoin · Dec 22, 2021

coercitiv said:
An in-depth review of Gracemont, very interesting info:
https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/

Really insightful text. Thanks for linking.

I like the way the authors summarize the tested cores regarding branch prediction:
"Zen 3 and Golden Cove are both optimized for maximum performance, running at high clock speed to reduce the absolute-time latency for most operations. However, reducing latency in some areas is much harder, the L2 BTB is great example of this, and AMD chose to add an extra pipeline stage instead of paying the power and area cost of making it faster.

Meanwhile, Gracemont has more moderate performance aims and runs below 4 GHz. But lower clocks don’t mean everything gets slower and Intel was able to use fewer pipeline stages for L2 BTB access, resulting in similar absolute-time latency. Finally, Golden Cove goes all out for speed, keeping L3 BTB latency at 3 cycles even at over 5 GHz."

I wonder why/how two instruction decoders are preferable to having a micro-op cache in a design focused on efficiency. They state a cache is not deemed necessary without performant AVX? I guess it's actually not an either/or, and the two decoders can be seen as a more flexible split bigger decoder, see later quote: "Predecode info cached to avoid variable length decoding. 2×3 config reduces length decoding power/area when it’s needed"

Funny how in the instructions per cycle graph the curves for GMT and Zen 2 as well as GLC and Zen 3 are so similar respectively.

Didn't know this yet: "And funny enough, DRAM bandwidth is higher with DDR4 than DDR5, because of high DDR5 latency. This suggests Gracemont has conservative prefetchers that aim to save power by not transferring data unless absolutely necessary, at the expense of bandwidth."

IntelUser2000 · Dec 22, 2021

moinmoin said:
I wonder why/how two instruction decoders are preferable to having a micro-op cache in a design focused on efficiency. They state a cache is not deemed necessary without performant AVX? I guess it's actually not an either/or, and the two decoders can be seen as a more flexible split bigger decoder, see later quote: "Predecode info cached to avoid variable length decoding. 2×3 config reduces length decoding power/area when it’s needed"

Micro-op cache makes sense in big core designs where it needs to perform high per clock and clock close to 5GHz. The die and power overhead of the uop cache is small in light of the gigantic uarch.

But note that in E core designs the table turns. Uop caches store decoded instructions so a 4K instruction is actually a multiple of that size. Also, there is a pipeline penalty in case of a miss.

Intel went from 16 stages in Nehalem to 14-19 stages in Sandy Bridge, where the worst case scenario adds 2 stages and best case scenario allows skipping 2. In this case Intel is essentially using the uop cache to increase clock speed using more pipeline stages while minimizing the perf/clock penalty.

Adding pipeline stages add quite a bit of complexity in reality in addition to noticeably reducing performance. In order for the uop cache hit rate to be high it'll have to be large. So by avoiding the uop cache they avoid the pipeline penalty and the comparatively large area needed for it. It's a multi-faceted design decision that takes into account all parameters.

Gracemont should be a 13-stage pipeline, which means compared to the big cores it has a rather significant advantage due to less branch mispredictions which in turn means more efficiency.

Intel went for clustered decode approach because straight up increasing from 3 to even 4 wide issue increases complexity quadratically, and some argue even exponentially. So going from 3 to 4 may be 60% increase in decode area and power.

The clustered decode means it can essentially double the issue rate without paying for the area penalty. Intel claims close to linear scaling in area/power. And it seems based on that article Gracemont can reliably hit 5-issue rate which is quite fantastic.

mikk · Dec 23, 2021

There is an interview with the P core Chief Architect from Intel on HWL (english version at the bottom): https://www.hardwareluxx.de/index.p...els-chief-architect-des-performance-core.html

I wonder if he means Raptor Lake when he says this:

Now, don't be surprised if in the next generation, you see different cache hierarchy, different sizes.

He also confirms that all P architectures since Skylake came from Haifa and the next generation as well.

I get a lot of great feedback and it just energizes us to to keep the momentum and continue with the success of Alder Lake and Sapphire Rapids and into the next generation that we're also designing in Haifa.

It's obvious Exist50 was right two years ago when he claimed the Oregon core team was liquidated including Ocean Cove.

Exist50 said:
A good friend. For that matter, I should add that the entire Oregon Core (capital C) team was liquidated as the culmination of Ocean Cove. Sunny Cove is from IDC, as is Willow Cove, and Golden Cove, and the one after that, and the one after that.

IntelUser2000 · Dec 23, 2021

@mikk I think the whole Haifa vs. Oregon made a difference as late as Haswell but after that I'm pretty sure it matters little.

If you see the subsequent designs, there has been lot of engineers from both sides that have been reassigned or moved to the other team. Whatever team/thing/idea was there up to the Otellini era has been rebuilt/reimagined after the Kraznich debacle.

We only cared about "Haifa vs Oregon" since they made names after Core 2 and we got sneak peak into the two teams' philosophies. Oregon was about systems design(Hyperthreading, QPI, memory controller, etc) and what you may call uncore while Haifa was about the uarch.

Looking back into Brian Kraznich's legacy he sounded like the Hulk gone crazy and just wanted to gut everything good about the company. Before you say one person can't make all the difference, that's true, but it's far easier to destroy than build. The Intel you see today can't be compared to Intel pre-2014.

mikk · Dec 25, 2021

According to digitimes Intel and Apple have plans to use TSMC 4nm in 2022. If the 3nm rumors are true for Meteor Lake GPU tile, Intel might switch to 4nm because of the 3nm delays.

https://twitter.com/x/status/1474308871774609411

DrMrLordX · Dec 25, 2021

Wait

Intel is taking N4 wafers too? Hmm. Wonder what they're doing with those . . .

IntelUser2000 · Dec 25, 2021

mikk said:
According to digitimes Intel and Apple have plans to use TSMC 4nm in 2022. If the 3nm rumors are true for Meteor Lake GPU tile, Intel might switch to 4nm because of the 3nm delays.

Everything that goes up must come down. When it comes to cycles, it reaches a peak before it's largest decline.The opposite is often true. That's why noone can underestimate anyone, because as long as the "entity" is alive they'll come back flourishing.

On a side note:

So Sierra Forest-AP exists. Some rumors are saying it uses a heavily modified Gracemont along with AVX-512. Sounds like the revival of Xeon Phi like some Intel guys were suggesting on twitter.

The Knights Landing Xeon Phi on 14nm used heavily modified Silvermont cores with larger OoOE buffers and stuff to specifically enhance HPC workloads along with full AVX-512.

I liked the idea of the socketed Xeon Phi. Unfortunately the 14nm issues meant it came out 6 month later and used 10% more power and performed 10% less. Originally it would have been with Haswell-EP for half of it's lifespan, but it came out shortly before Broadwell-EP.

Hopefully this time it comes with no delays, but it's also important the successors get out in time in lockstep with regular Xeons.

384 modified Gracemont with AVX-512 running at 2.5GHz and 30TFlops DP anyone?

NostaSeronx · Dec 25, 2021

IntelUser2000 said:
So Sierra Forest-AP exists. Some rumors are saying it uses a heavily modified Gracemont along with AVX-512.

You mean heavily modified Crestmont.

Intel 4 - Crestmont;
Grand Ridge Adds from Intel 7 - Gracemont:
- Accelerator Interfacing Architecture (AiA) — New instructions to improve the efficiency of submitting work to and synchronizing work between the compute cores and the dedicated accelerators. They support native dispatch, signaling, and synchronization from the user space. They also enable a coherent, shared memory space between the cores and acceleration engines, and deliver concurrently shareable processes, containers, and virtual machines (VMs).

Sierra Forest Adds from Intel 7 - Gracemont:
- Intel® Advanced Matrix Extensions (AMX)
- Intel® Advanced Vector Extensions 512 (AVX-512)

Discussion Intel current and future Lakes & Rapids thread

Elite Member

Diamond Member

Platinum Member

Elite Member

Elite Member

Lifer

Platinum Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Platinum Member

Elite Member

Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Elite Member

Diamond Member

Lifer

Elite Member

Diamond Member