Question Discussion over P-core and E-core vs AMDs regular vs C-core

Hulk · Feb 12, 2024

BorisTheBlade82 said:
Sadly, that slide is misleading. It already takes into account, that for the same MT performance more Zen4c cores are used and that they clock lower than a comparable Zen4 SKU. Looking at the V/f curve, Zen4c is quite disappointing IMHO.

I fully expect AMD to specialize their two cores more in the coming generations, where the c-cores show real efficiency improvements in terms of Work per Joule. This might only be the case for Integer.
All in all, in the client space the heterogenous approach makes sense to me due to Amdahl's law. Server space is different, because there it does not apply in the same way. There, many smaller cores might become the standard while big cores might only be used for special workloads and applications that get licensed by core count usage.

These types of promotional charts are generally terrible but this one sets a new low in my opinion. I mean we expect a lack of numerical values, no scale, and other missing parameters, but this one really shows nothing except efficiency is "better" and frequency is "lower." Oh yeah, and size is "smaller." Absolutely no useful information like what is the difference in frequency? How does efficiency compare at various iso power and iso frequency points?

So according to this chart Zen 4C is half the area of Zen 4, is that correct?

Frequency for 4C should be about 35% lower, correct?

Power efficiency about 50% better at some frequency/workload, correct?

If 4C is half the area, and 50% more efficient then that is quite impressive so I would expect AMD to just come out and state those numbers.

Abwx · Feb 12, 2024

BorisTheBlade82 said:
Sadly, that slide is misleading. It already takes into account, that for the same MT performance more Zen4c cores are used and that they clock lower than a comparable Zen4 SKU. Looking at the V/f curve, Zen4c is quite disappointing IMHO.

I fully expect AMD to specialize their two cores more in the coming generations, where the c-cores show real efficiency improvements in terms of Work per Joule. This might only be the case for Integer.
All in all, in the client space the heterogenous approach makes sense to me due to Amdahl's law. Server space is different, because there it does not apply in the same way. There, many smaller cores might become the standard while big cores might only be used for special workloads and applications that get licensed by core count usage.

You didnt read the thing accurately, it cant clock as high but has higher efficency at same frequency.

jpiniero · Feb 12, 2024

Abwx said:
You didnt read the thing accurately, it cant clock as high but has higher efficency at same frequency.

That's not what the chart says. Breakeven point is at 17.5 W although it's close enough that it's well worth the area savings.

Abwx · Feb 12, 2024

jpiniero said:
That's not what the chart says. Breakeven point is at 17.5 W although it's close enough that it's well worth the area savings.

PHX2 is a compromise because mixing Zen 4 and Zen 4C yield something that is not optimal, FI the voltage must be aligned with Zen 4 even if Zen 4C was to require a slightly lower voltage.

For an accurate comparison one has to check by disabling the 2 regular cores, that s what Phoronix wanted to do but so far he said that the regular cores couldnt be disabled as of now.

BorisTheBlade82 · Feb 12, 2024

Abwx said:
You didnt read the thing accurately, it cant clock as high but has higher efficency at same frequency.

As I said before, the V/f curve shows no difference to write home about - but that does not tell the whole story. The comparison of PHX2 vs. 6c Zen4 is almost in the margin of error. So far I have not seen any efficiency comparisons of Bergamo vs. Genoa with equal core count and frequency. If you have any, I would be very thankful. The Phoronix review only showed what everybody knows: More cores at the same TDP give better efficiency due to clocking slower.

Compare this to Apple: Their small cores are significantly more efficient in terms of more work per Joule. That is what I had hoped for Zen4c as well. and that is what I hope for future generations.

Abwx · Feb 12, 2024

BorisTheBlade82 said:
The comparison of PHX2 vs. 6c Zen4 is almost in the margin of error.

You re looking the wrong way, comparison should be done at same throughput, that is at same frequency, so look at the horizontal delta between the two chips, 12W are required by PHX1 to match PHX2@10W, 20% better perf/Watt at same throughput is hardly in the margin of error, and that s with two regular Zen 4 cores bringing down the chip efficency.

Saylick · Feb 12, 2024

Aptly timed C&C article on Zen 4 vs Zen 4c cores with their review of the ASUS ROG Ally:

AMD’s Mild Hybrid Strategy: Ryzen Z1 in ASUS’s ROG Ally

Editor’s Note: ASUS sent us the ROG Ally sample – our first review sample from a company – in order to test the Ryzen Z1 SOC inside the device.

chipsandcheese.com

BorisTheBlade82 · Feb 13, 2024

Abwx said:
You re looking the wrong way, comparison should be done at same throughput, that is at same frequency, so look at the horizontal delta between the two chips, 12W are required by PHX1 to match PHX2@10W, 20% better perf/Watt at same throughput is hardly in the margin of error, and that s with two regular Zen 4 cores bringing down the chip efficency.

Oh well, I had several long and close looks at this slide
And sadly, the ISO-throughput comparison does not show 20%, but more like 13-15% - at least the region where the horizontal line applied by myself crosses Zen4 looks more like around 11.4w to me.

But I really do not want to be nit-picky. Yes, Zen4c seems a bit more power efficient, you are right in that. So let's just say, that I am a little disappointed because I had expected more.

FlameTail · Feb 13, 2024

HUGE: Intel-P, Apple-P, AMD Zen, Oryon Phoenix
BIG: Cortex X, AMD Zen C
Medium: Cortex A7xx, Apple-E, Intel-E
little: Cortex A5xx

How accurate is this classification?

marees · Nov 10, 2024

According to YouTuber Moore's Law Is Dead, Intel could already be working on the (future) abolition of the E cores . The plan is to achieve a "Unified Core" architecture for the successors to " Nova Lake ", nominally combining the E cores with the P cores. This could mean that there are no more E cores at all - but this could also mean feature equality, as exists between Zen5 and Zen5c. According to the YouTuber, it is still unclear whether this will come directly after Nova Lake - or only with the successor to Nova Lake, although the code names for this are still missing. In general, everything after Nova Lake is still very much up in the air at Intel and therefore not yet finally decided, so changes are still possible at any time. Changes that have already been made are said to be the abandonment of the Royal Core project and thus also the rejection of the code names "Royal Core" and "Cobra Core" used for it. As usual, this should be taken with a pinch of salt, especially long-term statements on roadmap projects are never a sure thing.

Intel Core Roadmap Update (October 2024)
(2025) Panther Lake = 18A / Cougar Cove P-Core (+5-13% IPC) / Darkmont E-Core
(2026) Nova Lake = 18A-P / Coyote Cove P-Core ( +9-18% IPC) / Arctic Wolf E-Core
(2027+) ????? Lake = 14A-P / Griffin Cove P-Core (+10-20% IPC) / ????? E-Core
Griffin Cove could be a "Unified Core" that eliminates E-Cores from the roadmap, and at a minimum Griffin-Next is expected to be a "Unified" architecture. The "Unified Core" may get some of the architectural features Beast Lake was supposed to get, but which features it will get are not yet decided. In general, I have been warned that everything after Nova Lake is NOT finalized, and likely to have political battles fought over their final design choices. Cobra Core was canceled when Beast Lake was cancelled. It was "Royal Core 2.0"
Source: Moore's Law Is Dead @ YouTube on October 19, 2024

News des 2./3. November 2024 | 3DCenter.org

Medial weit verbreitet werden derzeit die Daten der neuesten Steam-Hardware-Statistik, welche für den Oktober 2024 einen deutlichen Anstieg speziell der 60er Modelle von nVidia anzeigen. Allerdings sehen schon die ersten Zahlen ungewöhnlich aus, denn

www-3dcenter-org.translate.goog

Intel Griffin Cove CPUs in 2027+ with 'Unified Core' teased: E-Cores eliminated from desktop

Intel's next-gen Panther Lake (Intel 18A) drops in 2025, Nova Lake (Intel 18A-P) in 2026, next-gen CPU on Intel 14A-P with Griffin Cove P-Cores teased.

www.tweaktown.com

LightningZ71 · Nov 10, 2024

I don't really think that even Intel really wanted to do what they did with Alder Lake. It was certainly a reaction to competition with what they had in the tool box. They couldn't hit the performance/density ratio that AMD was getting on N7/N6/N5 with Intel 7 and I4/3 was late.

What will be interesting is if they do a full divergence between client and server, or will just make them mildly altered copies with added function units. As for client with homogenous logic cores, they can still use finflex (or it's similar techs), targeted relaxed logic density and enlarged caches to have "Prime" cores as well as high density throughput ones. There is an argument to be made that Arrow Lake might be a better product with 8 of it's E-cores clusters where two of them were layed out as above and able to get to around 5Ghz. The ring bus would have half the stops that it has and could run at higher speeds. There would even be more die space for more L3 cache all around.

OneEng2 · Nov 10, 2024

Markfw said:
First my take. Intel needed a way to be competitive with AMD in multicore benchmarks and performance scenarios but were limited by space and power due to foundry limitations, so they created E-cores.

AMD wanted more cores with less power, so they reduced the speed to allow for more cores in a small space, and a few other changes, but essentially kept all the same capabilities, like avx-512.

My opinion is that AMDs solution is much easier to implement and much more sane and capable in todays world. Please give your thoughts, and this discussion is more about methods than companies.

Wow. I am really late to this thread, but I love the subject.

In order to achieve higher PERFORMANCE (IPC x Clock) you need a design that has a longer pipeline (to avoid misalignment at the end of each stage in ILP execution in superscalar designs) which makes the core bigger. Additionally, the type of transistor you use to achieve high clock speeds are different than to achieve power efficiency. Many of us remember the good ole Netburst years where Intel got out over their skis trying to get the clock speeds up at the expense of IPC and Power ... so there are definitely limits to this approach (and I think we have pretty much settled in now).

Ideally, you would have a longer, more complex P core that did this while having a lower pipelined core for the E core. I agree with others here that keeping the instructions compatible should DEFINATELY have been a design decision at Intel.

I think the above logic is how Intel got where it is today..... and I think they are wrong.

First, in order to compete in DC, you have to have high performance per CORE. Many DC software packages are licensed PER CORE, not per thread, and not within some performance metric.... per CORE. These software licenses are usually recurring on a yearly basis, so the pain never goes away. This makes the cost of the hardware a minor portion of the total cost of ownership.

In a DC product, performance is limited by the power you can draw from a single socket. Turin's socket power limit is 700W I believe. This brings up the 2nd big design decision that must be made. in DC performance per Watt is very important since the design will ultimately be power bound (you can always just tack on more cores in a socket)

More advanced processors are superscalar (can execute many instructions in parallel). To maximize peak performance, the number of execution units must be much bigger than the average need for execution units. This leaves a lot of performance in a MT load just sitting around. To maximize performance per Core, and performance per watt, and performance per area (the latter being a lesser concern in DC), you need SMT .... but SMT greatly complicates the core design from the load to the dispatch and everything in between.... but it pays great dividends for the reasons I mentioned above AND the fact that in MT loads, SMT gets you 40% (for AMD) more performance for ~15% more transistors, and a trivial amount of power (it is nearly free in power I think).

Now, I would argue, that if you have all these features in your E Core (which from my above arguments, you really do need), is it really an E core.... or is it just a thinned down P core? This is what AMD has implemented.

Now, I do believe that there are some loads where having a max core count with each core having lower peak performance do exist .... thus Turin D. In these cases (where you can avoid the huge cost of annual per core licensing) you can consider a much smaller and simpler core design that is much smaller ..... then just pack a metric crap ton of them on a socket ..... BUT, you still are limited by the socket power, so these cores MUST still be very high performance per watt or it doesn't make sense.

Everyone here marvels at Skymonts performance per area; however, I am really curious to see how its performance per watt is compared to Zen 5c ... because in DC, who cares how big the die is? They are more than willing to pay for it.

Finally, why have I fixated on DC? The simple answer is that this is where the highest growth and highest margins are (product management 101).

Can you just have different design teams for each market segment for processors? Sure, if you are the US government and have no need of profit. If you are in free market competition, it isn't enough to have the best product in every space, if the price of the product exceeds the market price (and you thus go bankrupt).

There you are. My 20 cents (I wrote too much to consider it 2 cents worth ).

NostaSeronx · Nov 10, 2024

OneEng2 said:
In order to achieve higher PERFORMANCE (IPC x Clock) you need a design that has a longer pipeline (to avoid misalignment at the end of each stage in ILP execution in superscalar designs) which makes the core bigger. Additionally, the type of transistor you use to achieve high clock speeds are different than to achieve power efficiency. Many of us remember the good ole Netburst years where Intel got out over their skis trying to get the clock speeds up at the expense of IPC and Power ... so there are definitely limits to this approach (and I think we have pretty much settled in now).

The increased pipeline was to reduce work time per stage. The reduction also reduced the size of the core given the same frequency.

Not sure for Pentium 4, but other same-era "speed-demon" processors cores focused on slow transistors rather than fast transistors. Pretty much every design reduced the HP/LVT transistor amount for high amounts of LL/LP/RVT transistors.

Micro Magic, Inc in July 2023 did this: "Now Micro Magic announces a Quad-Core multiprocessor capable of 46,000 CoreMarks at 5GHz."
[It intends to offer the RISC-V core to customers using an IP licensing model.]

Where the above is likely using ultra fast/fast transistors purely to get into the NTC/ULV-region, 0.4V @ 1.5 GHz and 0.35V @ 1 GHz. As well as, the scaling efficiency drops hard to the point of taking 0.7V to get +4 GHz-ish.

So, speed-demon/wire-speed style processors will come out again. Rather than being premium-target they are likely to be bottom-target instead. Where large OoO cores are a premium and small UHF cores aren't.

~~~~
Pentium 4: 400 MT/s to 1066 MT/s, ~150 MB/s SATA HDD
....
LPDDR5x : 9200 MT/s and LPDDR6 : 14400 MT/s, 7877 -> 14,000+ MB/s NVMe NAND. We aren't even in the same era where the external factor is a problem.

Pentium 4 vs III and K8+K9(____hounds)
ROB-like structure: 126-entry, 40-entry, 72-entry
Scheduler-like structure: 76+41/64+32+32~entries, 20-entry, 8+8+8/36~entries
Register file structure: 128+128, no need for III, 44+120~entries

The general issue for older speed-demons, is that they were stuck with large OoO structures. As the bottleneck for them was RAM+Storage bandwidth.
BP: 8KB for P4, 4KB for K7/K8 where a modern processor can use 0.25KB for the same misprediction rate as the those processors.

As the wire-density increases and some complexity becomes free-to-use. There is no real limit for speed-demons in the modern era.

GTracing · Nov 11, 2024

My take is somewhere in the middle.

Having two separate architectures requires more engineering resources. As outsiders we don't have a lot of insight into how much time and effort Intel spends on their e-cores vs how much AMD spends on their dense cores. Intel's strategy definitely requires more, but how much? I can't say. Though Arrow Lake might not have had as many issues if less people were working on Skymont.

I think the best solution is an architecture that can remove or shrink different blocks in the core. This doesn't increase energy efficiency as much since there's some things you can't change this way (pipeline length for one), but it's a good middle ground between area efficiency and engineering time.

AMD has a bit of this with the shrunk SIMD pipelines in Strix Point. I bet that AMD will have more of that as time goes on.

Likewise, I don't see Intel maintaining completely separate designs if/when they come out with a unified core. As much as I like Skymont, I think it's inclusion on desktop CPUs is a crutch to make up for Lion Cove's horrible area efficiency. If you have a good p-core, then e-cores don't make sense in high performance desktop processors.

I think separate architectures makes more sense for Apple, ARM, and Qualcomm since they have chips in phones and are targetting higher power efficiency.

igor_kavinski · Nov 11, 2024

GTracing said:
Having two separate architectures requires more engineering resources.

It also gives the company more flexibility in terms of product offerings. Intel's mistake was shoehorning E-cores into a design meant to compete against AMD's powerful P-cores. It's a mistake they refused to admit, just like when they refused to admit their Pentium 4 mistake and kept churning out space heaters for years. A 32 or even 40 core Gracemont core based SKU meant for the MT workload crowd would've been heralded as a game changer for creatives and professionals. Intel has tons of interesting technology but they can't for the life of them figure out how best to utilize it. They get behind braindead ideas due to the political influence of some high ranking people in their midst and end up betting the whole company on years of dumb execution.

Meteor Late · Nov 11, 2024

GTracing said:
My take is somewhere in the middle.

Having two separate architectures requires more engineering resources. As outsiders we don't have a lot of insight into how much time and effort Intel spends on their e-cores vs how much AMD spends on their dense cores. Intel's strategy definitely requires more, but how much? I can't say. Though Arrow Lake might not have had as many issues if less people were working on Skymont.

I think the best solution is an architecture that can remove or shrink different blocks in the core. This doesn't increase energy efficiency as much since there's some things you can't change this way (pipeline length for one), but it's a good middle ground between area efficiency and engineering time.

AMD has a bit of this with the shrunk SIMD pipelines in Strix Point. I bet that AMD will have more of that as time goes on.

Likewise, I don't see Intel maintaining completely separate designs if/when they come out with a unified core. As much as I like Skymont, I think it's inclusion on desktop CPUs is a crutch to make up for Lion Cove's horrible area efficiency. If you have a good p-core, then e-cores don't make sense in high performance desktop processors.

I think separate architectures makes more sense for Apple, ARM, and Qualcomm since they have chips in phones and are targetting higher power efficiency.

This is a misconception, for example, Oryon M (smaller core) is clearly less efficient than Oryon L, except maybe at the lowest end of the curve.
Really, smaller cores are about performance per area, used mainly to boost MT performance for cheap. Qualcomm is not doing anything different than what Intel is doing with Skymont with their M core.

GTracing · Nov 11, 2024

Meteor Late said:
This is a misconception, for example, Oryon M (smaller core) is clearly less efficient than Oryon L, except maybe at the lowest end of the curve.
Really, smaller cores are about performance per area, used mainly to boost MT performance for cheap.

What misconception did I make?

Meteor Late · Nov 11, 2024

GTracing said:
What misconception did I make?

That in phones they are targeting higher power efficiency with the smaller cores.

GTracing · Nov 11, 2024

Meteor Late said:
That in phones they are targeting higher power efficiency with the smaller cores.

Apple's and ARM's little cores are more power efficient than their big cores, especially Apple's.

The Apple A15 SoC Performance Review: Faster & More Efficient

www.anandtech.com

alcoholbob · Nov 11, 2024

In multicore workloads (14+ threads) when E-cores are overclocked to 5GHz on arrow lake, the all core P-core clockspeed drops down to around the same frequency. It appears the P-cores aren't much faster than the E-cores on a per-clock basis in these workloads, so my guess is if Intel can clock up the E-cores they could in theory get rid of the P-cores if they wanted to pretty soon.

Even in gaming when users set their 285Ks to 1+16, it seems pretty competitive with stock (8+16 configs) in fps.

FlameTail · Nov 11, 2024

GTracing said:
Apple's and ARM's little cores are more power efficient than their big cores, especially Apple's.

*energy-efficient

OneEng2 · Nov 13, 2024

511 said:
Lion cove is the biggest waste of sand in ARL silicon

So you believe Arrow Lake would be okay without lion cove?

Hulk said:
But then again, if ARL was 4+24 the ST performance would be basically the same, lightly threaded apps would have performed basically the same because of the narrow gap performance-wise between Lion Cove and Skymont, but Intel would have had a massive feather in their cap because MT performance in apps that could have utilized all threads would have been through the roof. Over 2800 CB R24 MT.

Go for a win (MT) and a loss (ST) rather than a tie (MT) and loss (ST). They could have also said they focuses on Skymont and MT performance because that is where software is heading. They will work on ST in the next iteration. It would at least have made some sense.

Did they win in MT? On the desktop and laptop maybe, but in DC?

Meteor Late said:
This is a misconception, for example, Oryon M (smaller core) is clearly less efficient than Oryon L, except maybe at the lowest end of the curve.
Really, smaller cores are about performance per area, used mainly to boost MT performance for cheap. Qualcomm is not doing anything different than what Intel is doing with Skymont with their M core.

Adding a higher count of less performance cores in DC doesn't work at all for software licensing reasons. The die space is cheap in comparison.

Hulk · Nov 13, 2024

OneEng2 said:
So you believe Arrow Lake would be okay without lion cove?

Did they win in MT? On the desktop and laptop maybe, but in DC?

Adding a higher count of less performance cores in DC doesn't work at all for software licensing reasons. The die space is cheap in comparison.

I don't know MT is very close depending on the app. Also seems like Zen 5 has more frequency headroom. Like I said it's close. Zen 5 is a master stroke.

Hulk · Nov 13, 2024

Are these statements accurate regarding throughput, or what many here call "IPC" in overall performance?

Lion cove is slightly better than Zen 5 in integer.
Zen 5 is significant better than Lion Cove in floating point.

Skymont and Raptor Cove are about equal in integer and significant better than Gracemont.
Skymont and Raptor Cove are about equal in floating point and massively better than Gracemont.

Zen 5 is a little better than Skymont in integer.
Zen 5 is significantly better than Skymont in floating point.

OneEng2 · Nov 13, 2024

Hulk said:
I don't know MT is very close depending on the app. Also seems like Zen 5 has more frequency headroom. Like I said it's close. Zen 5 is a master stroke.

The processor overall is as you outlined, but a single core of Zen5 is much higher performance in MT than a single Skymont or Lion Cove core. This is very important in DC.

Hulk said:
Are these statements accurate regarding throughput, or what many here call "IPC" in overall performance?

Lion cove is slightly better than Zen 5 in integer.
Zen 5 is significant better than Lion Cove in floating point.

Skymont and Raptor Cove are about equal in integer and significant better than Gracemont.
Skymont and Raptor Cove are about equal in floating point and massively better than Gracemont.

Zen 5 is a little better than Skymont in integer.
Zen 5 is significantly better than Skymont in floating point.

I think so, but I don't believe IPC is as comprehensive as performance per core. In MT for DC I believe this is the most important metric.

In laptop and desktop where die cost is very important, performance per area seems like a better metric.

I am not sure how important IPC is compared to these metrics.

Question Discussion over P-core and E-core vs AMDs regular vs C-core

Diamond Member

Lifer

Lifer

Lifer

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Senior member

Platinum Member

Senior member

Diamond Member

Senior member

Lifer

Senior member

Senior member

Senior member

Senior member

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member