Discussion Intel current and future Lakes & Rapids thread

Page 393 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

HurleyBird

Platinum Member
Apr 22, 2003
2,725
1,342
136
Not sure how schematic it is vs reality, but the fact they added 8 Gracemont cores rather than 4 Golden cores (guess estimating) for that area hints the contrary. They can't be that large or it would be pointless.

I don't think that the Gracemont cores are that large, but even if they were I wouldn't necessarily agree with your assertion. Whether or not they would be "worthless" versus adding additional GC cores would come down to the perf/watt difference. Besides lower power consumption for small tasks, power density eventually becomes a limiting factor also.
 

dullard

Elite Member
May 21, 2001
25,203
3,617
126
Dell now has Rocket Lake up on their website. April 19 delivery date for most models (a week or so earlier for the Alienware Aurora R12). At least for now, the price is a lot higher for their typical i5 model. Instead of the $500 range for an i5, they are starting at $750. But that will change when they update the Inspiron model.
 

dullard

Elite Member
May 21, 2001
25,203
3,617
126
Dell needs to fix their Rocket Lake text:
Unleash the power to create: With Intel®’s latest 11th gen, up to 10 core, 20 threads i9 processor you can keep at your most demanding workloads for even longer.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
They have a real small core architecture which is a much better fit, they don't need Skylake.

Someone gets it.

Gracemont is a continuation of the dual-cluster decode introduced with Tremont.

The -mont line substantially diverges from -core.
-The small core decoders directly use most of the x86 instructions rather than changing them to internal instructions - continuation of what the original Bonnell Atom did.
-Sunny Cove increases the L1 data cache, while in Gracemont it'll increase the L1 instruction cache. Likely the L1I increase helps with the doubled decode.
-Starting from Tremont, it uses dual cluster decode which according to the chief architect it saves space in comparison to the uop cache.
-Starting from Goldmont it also has a predecode cache. A massive 64KB on Goldmont Plus.

I expect Gracemont cores to be substantially bigger than Tremont, but they'll still be barely over 1mm2. The Tremont cores are like 0.7mm2. Sunny Cove core is 5-6x the size of Tremont, not 3-4x.

It may perform like Skylake but it'll use fraction of the space and much lower power.

It's not the addition of AVX-512 that makes the Core cores bloated. It's the ridiculous focus on clock speeds that's the issue. Xeon Phi shows the AVX-512 units can be far, far smaller.*

*Of course there might be the factor that the Core design itself might be bloated and inefficient in general.
 
Last edited:

dmens

Platinum Member
Mar 18, 2005
2,271
917
136
The small core decoders directly use most of the x86 instructions rather than changing them to internal instructions - continuation of what the original Bonnell Atom did.

Nope, every Atom has its own internal uop mapping just like the big cores.

It may perform like Skylake

Fat chance, it will not be even remotely close.

It's not the addition of AVX-512 that makes the Core cores bloated. It's the ridiculous focus on clock speeds that's the issue.

The main source of bloat is the area resources required to implement depth of execution speculation. If the CPU pipeline is designed appropriately for the target clock frequency there is no reason for an extraordinary area cost. There is no free lunch, you are not going to get performance without spending area... therefore all this talk of big core level performance from atoms is pure fantasy.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
By the way Rocketlake is trash. Good thing they got Tigerlake out rather than Rocketlake-U. It would have been a laughingstock of the industry.

Relaxing of the L3 and other latencies is the cost of trying to backport a 10nm process to an older, inefficient 14nm one.

ALD topping at a maximal 20% ST gain? The average IPC gains is thus possibly significantly lower. I expected the 20% figure to be an average IPC gain.

I don't see what's the big deal here. Did anyone really expect higher clocks? They only reach 5.3GHz by taking away any overclocking headroom. If they back down clocks a bit that's a good thing since it'll get power and thermals under control.

Also the new microcode shows 18.5% in SpecInt and 21% in SpecFP.

MLiD actually claimed 2x MT with Alderlake mobile, not desktop. He personally expects Alderlake 8+8 to be like a 12-13 core Rocketlake part.

Gracemont is roughly 1/3rd of Golden Cove considering:
-2/3rd clocks
-2/3rd perf/clock
-20% loss without HT
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Nope, every Atom has its own internal uop mapping just like the big cores.


Fat chance, it will not be even remotely close.

You want a bet or something? Tremont is already at Ivy Bridge levels. Another 30% gets us to Skylake.

Do you know what makes that possible? Because ARM cores can do it.

Nope. Let's not even get that far. Their own "little" core team is owning them.

There is no free lunch, you are not going to get performance without spending area... therefore all this talk of big core level performance from atoms is pure fantasy.

Oh right, cause there's no such thing as optimization. That's why since Zen they beat Intel to a pulp in perf/mm2. And that's why M1 almost beats it in overall performance with again, a fraction of the size.

If it was just about the ISA, why does the GPU in M1 perform so fantastic in perf/mm2?

Because there are other factors such as: motivation, execution, optimism about the future and the project you are working on, which are all human factors not "ISA".

Also speaking of competition, I believe in the Zen 4 rumors being 30% faster per clock. With Genoa offering 50% more cores, their performance target must be 2x over Milan. They'll need it against Ampere Altra.
 

dmens

Platinum Member
Mar 18, 2005
2,271
917
136

LOL, that is micro-op fusion. It is when two uops are fused coming out of the fast decoder for rename and dispatch. It has absolutely nothing to do with executing x86 instructions directly since there is still an internal uop mapping, and the instruction still has to go through the fast decoder just like everything else. By the way, the big cores did that exact same thing, before the first atom was even a design concept. See this quote: "With the Pentium M Intel began fusing certain micro-ops." Merom (Core 2 Duo) derived directly from Pentium M Yonah and carried the concept over.

You want a bet or something? Tremont is already at Ivy Bridge levels. Another 30% gets us to Skylake.

Sure, any time. I love how people just throw around double digit gains like nothing. Here, ivybridge still 30% faster than Tremont in GB5:


They won't spend the area/power to increase small core perf because it will blow through the budgets.

Do you know what makes that possible? Because ARM cores can do it.

Well yes, ARM has several crucial advantages over x86 after all. Just because ARM can do it, does not mean Atom can.

Oh right, cause there's no such thing as optimization. That's why since Zen they beat Intel to a pulp in perf/mm2. And that's why M1 almost beats it in overall performance with again, a fraction of the size.

What are you talking about? Firestorm cores are huge. Zen 3 CPU cores are 5mm2 apiece. We are not talking about massive size differentials here. You are right that optimization can make a huge difference. But it won't turn a 1mm2 core into a 5mm2 core.

If it was just about the ISA, why does the GPU in M1 perform so fantastic in perf/mm2?

Because there are other factors such as: motivation, execution, optimism about the future and the project you are working at, which are all human factors not "ISA".

Sorry but no amount of intangibles will turn an atom size core into a fat core in performance. Not gonna happen. The M1 GPU is also huge by the way. It just also happens to be a really well designed, highly optimized GPU with extremely tight-knit API support. Software support is more important to GPU than CPU so I don't know why you would drag GPU into this, but whatever.
 

RTX2080

Senior member
Jul 2, 2018
322
511
136
The title by Gamersnexus is a bit ruthless:


In realworld pure cpu workload like Blender, Adobe Premier, 11700k to 10700k has well below ~10% advantage, looks worse than those theoretical tests like Cinebench/CPU-Z which could have ~15-20%......
 

jpiniero

Lifer
Oct 1, 2010
14,832
5,444
136
The main source of bloat is the area resources required to implement depth of execution speculation. If the CPU pipeline is designed appropriately for the target clock frequency there is no reason for an extraordinary area cost. There is no free lunch, you are not going to get performance without spending area... therefore all this talk of big core level performance from atoms is pure fantasy.

Talking about IPC more than actual pure performance. Obviously Gracemont isn't going to clock to 5 Ghz. Surely there's some amount of density gain they could realize even if it lowers max frequency.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
-The small core decoders directly use most of the x86 instructions rather than changing them to internal instructions - continuation of what the original Bonnell Atom did.
-Sunny Cove increases the L1 data cache, while in Gracemont it'll increase the L1 instruction cache. Likely the L1I increase helps with the doubled decode.
-Starting from Tremont, it uses dual cluster decode which according to the chief architect it saves space in comparison to the uop cache.
-Starting from Goldmont it also has a predecode cache. A massive 64KB on Goldmont Plus.

Increasing L1i is a must to improve decode rate. Same with predecode L2 cache, a must for such architecture.
uOP cache ( as done by Sandy+/ZEN, not P4 ) while complex, it saves energy by not having to decode over and over again, at some point those savings + not having to have those massive decoder supporting structures wins over.

I feel like big part of Core bloat comes from sizing of various buffers: ROB, int PRF, FP PRF, uOP cache, branch prediction unit buffers, multi level I/D TLBs, TLBs that have variuos page sizes, store and load queues.
Atoms used to cut corners in all those structures and is making great use of diminishing returns of their sizing. For example large pages in TLB? Only added in Goldmont? 3 decoders? Also recent addition.

Combine these cuts with obvious cuts to execution resources, vector size support and caches sizes, ports and path widths => tiny die area is the end result.

I expect Gracemont cores to be substantially bigger than Tremont, but they'll still be barely over 1mm2. The Tremont cores are like 0.7mm2. Sunny Cove core is 5-6x the size of Tremont, not 3-4x.

Atom is more like "dial transistor budget to get performance targets they need" product. It's not like in 2014 it was a secret to them that having iTLB with large page support would help performance, or design of such TLB was beyond their capabilities. it was conscious decision to forgo better performance for die size savings.

What I don't share with You is optimism at performance targets. Honestly that hybrid 1+4 cpu was a disaster and that is understatement already. And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot. But look at disastrous performance compared to other mobile CPUs in for example Cinebench R15? R20?

So Intel's Alder Lake 2xMT performance is very likely a pipe dream on 10nm, probably based on comparison with some hilariuos wattage constrained mobile CPU.

The desktop reality will be ~11 Golden Coves in Cinebench and waaaaaaay behind AMD 16 core cpus. And since it will struggle in poster childs of linear scaling, memory not-touching MT loads, the best way to use it will be disabling those Atom clusters and not having to deal with scheduling problems.
 

ondma

Platinum Member
Mar 18, 2018
2,768
1,350
136
Increasing L1i is a must to improve decode rate. Same with predecode L2 cache, a must for such architecture.
uOP cache ( as done by Sandy+/ZEN, not P4 ) while complex, it saves energy by not having to decode over and over again, at some point those savings + not having to have those massive decoder supporting structures wins over.

I feel like big part of Core bloat comes from sizing of various buffers: ROB, int PRF, FP PRF, uOP cache, branch prediction unit buffers, multi level I/D TLBs, TLBs that have variuos page sizes, store and load queues.
Atoms used to cut corners in all those structures and is making great use of diminishing returns of their sizing. For example large pages in TLB? Only added in Goldmont? 3 decoders? Also recent addition.

Combine these cuts with obvious cuts to execution resources, vector size support and caches sizes, ports and path widths => tiny die area is the end result.



Atom is more like "dial transistor budget to get performance targets they need" product. It's not like in 2014 it was a secret to them that having iTLB with large page support would help performance, or design of such TLB was beyond their capabilities. it was conscious decision to forgo better performance for die size savings.

What I don't share with You is optimism at performance targets. Honestly that hybrid 1+4 cpu was a disaster and that is understatement already. And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot. But look at disastrous performance compared to other mobile CPUs in for example Cinebench R15? R20?

So Intel's Alder Lake 2xMT performance is very likely a pipe dream on 10nm, probably based on comparison with some hilariuos wattage constrained mobile CPU.

The desktop reality will be ~11 Golden Coves in Cinebench and waaaaaaay behind AMD 16 core cpus. And since it will struggle in poster childs of linear scaling, memory not-touching MT loads, the best way to use it will be disabling those Atom clusters and not having to deal with scheduling problems.
Obviously, 8+8 will not compete with 16 real Zen cores. I think the real target is 12 core zen performance. Hopefully, Intel will price AL to somewhat reflect this.
 
Reactions: Hulk

jpiniero

Lifer
Oct 1, 2010
14,832
5,444
136
Ian's article says Meteor Lake is using Foveros. Article also says they are talking about fabbing products (including CPU tiles) at external foundries as well as 7 nm.

Plus they are going to try the foundry system again. Good luck with that.
 

mikk

Diamond Member
May 15, 2012
4,172
2,210
136
Alder Lake for desktop followed by mobile, as expected. They have shipped 40 millions of Tigerlake chips so far.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
What I don't share with You is optimism at performance targets. Honestly that hybrid 1+4 cpu was a disaster and that is understatement already. And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot. But look at disastrous performance compared to other mobile CPUs in for example Cinebench R15? R20?

It's Lakefield that's a disaster. Even the Sunny Cove core performance sucks on that chip. 110 on Cinebench R15 ST best case, when Icelake can get 180.

The 6W N6000 gets same ST cores as the Sunny Cove cores in Lakefield. And it gets almost 800 points in MT, which compares well to the 900 points some low end Icelake i5 laptops get: https://www.notebookcheck.net/Acer-...Also-a-good-subnotebook-with-i5.468200.0.html

That's a 50%+ gain over Goldmont Plus based N5000 which corresponds to the 30% perf/clock + clock speed gains.

Sure, any time. I love how people just throw around double digit gains like nothing. Here, ivybridge still 30% faster than Tremont in GB5:

Ivy Bridge also clocks 44% higher in that comparison. In Geekbench it's actually at Haswell levels.

What are you talking about? Firestorm cores are huge. Zen 3 CPU cores are 5mm2 apiece. We are not talking about massive size differentials here. You are right that optimization can make a huge difference. But it won't turn a 1mm2 core into a 5mm2 core.

Zen 3 is only 3.2mm2. Sunny Cove without L2 and FIVR is 4.4mm2, not to mention it's a worse performing core.

We know on the GPU side ARM chips have a far better perf/mm2. Icelake level GPU performance is achieved at less than half the size with the ARM chips.

And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot.

If you call 0.7-0.8mm2 a lot sure.

Talking about IPC more than actual pure performance. Obviously Gracemont isn't going to clock to 5 Ghz. Surely there's some amount of density gain they could realize even if it lowers max frequency.

Exactly. Even AMD as a company that seriously struggled does far better. Compared to Zen cores Intel chips need 50% or more to achieve similar performance.
 
Last edited:

dullard

Elite Member
May 21, 2001
25,203
3,617
126
What I don't share with You is optimism at performance targets. Honestly that hybrid 1+4 cpu was a disaster and that is understatement already. And it used those Tremont cores that have already dialed structure sizes up on 10nm a lot. But look at disastrous performance compared to other mobile CPUs in for example Cinebench R15? R20?
Lakefield had problems, yes. But complaining about rendering performance on a 7 W device is probably the worst possible argument you could make against Lakefield. Honestly who buys a laptop with low performance and long battery life with the intention of fast image rendering? That would be like a high-end restaurant buying their ingredients from McDonald's down the street, doing poorly, then complaining that therefore the McDonald's food could not possibly be successful.
 

gdansk

Platinum Member
Feb 8, 2011
2,489
3,379
136
Plus they are going to try the foundry system again. Good luck with that.
It will be a success based on how much effort they put into it. Intel's 14nm would be a massive improvement for many designs. The majority of TSMC revenue comes from 16nm and older processes I think Intel's soon-to-be-spare 14nm could compete well in that space if they help potential customers get their designs working on the process.
 

RTX

Member
Nov 5, 2020
90
40
61
2.8ghz 80W W-1390
1.5ghz 35W W-1390T
2.9ghz 80W W-1370
3.3ghz 80W W-1350
4.0ghz 125W W-1350P

 

jpiniero

Lifer
Oct 1, 2010
14,832
5,444
136
Might be some typos. The site suggests that Rocket Lake W will work on Z490 which isn't the case for Comet Lake W.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |