Info TOP 20 of the World's Most Powerful CPU Cores - IPC/PPC comparison

Richie Rich · May 15, 2020

Added cores:

A53 - little core used in some low-end smartphones in 8-core config (Snapdragon 450)
A55 - used as little core in every modern Android SoC
A72 - "high" end Cortex core used in Snapdragon 625 or Raspberry Pi 4
A73 - "high" end Cortex core
A75 - "high" end Cortex core
Bulldozer - infamous AMD core

Geekbench 5.1 PPC chart 6/23/2020:

Pos	Man	CPU	Core	Year	ISA	GB5 Score	GHz	PPC (score/GHz)	Relative to 9900K	Relative to Zen3
1	Nuvia	(Est.)	Phoenix (Est.)	2021	ARMv9.0	2001	3.00	667.00	241.0%	194.1%
2	Apple	A15 (est.)	(Est.)	2021	ARMv9.0	1925	3.00	641.70	231.8%	186.8%
3	Apple	A14 (est.)	Firestorm	2020	ARMv8.6	1562	2.80	558.00	201.6%	162.4%
4	Apple	A13	Lightning	2019	ARMv8.4	1332	2.65	502.64	181.6%	146.3%
5	Apple	A12	Vortex	2018	ARMv8.3	1116	2.53	441.11	159.4%	128.4%
6	ARM Cortex	V1 (est.)	Zeus	2020	ARMv8.6	1287	3.00	428.87	154.9%	124.8%
7	ARM Cortex	N2 (est.)	Perseus	2021	ARMv9.0	1201	3.00	400.28	144.6%	116.5%
8	Apple	A11	Monsoon	2017	ARMv8.2	933	2.39	390.38	141.0%	113.6%
9	Intel	(Est.)	Golden Cove (Est.)	2021	x86-64	1780	4.60	386.98	139.8%	112.6%
10	ARM Cortex	X1	Hera	2020	ARMv8.2	1115	3.00	371.69	134.3%	108.2%
11	AMD	5900X (Est.)	Zen 3 (Est.)	2020	x86-64	1683	4.90	343.57	124.1%	100.0%
12	Apple	A10	Hurricane	2016	ARMv8.1	770	2.34	329.06	118.9%	95.8%
13	Intel	1065G7	Icelake	2019	x86-64	1252	3.90	321.03	116.0%	93.4%
14	ARM Cortex	A78	Hercules	2020	ARMv8.2	918	3.00	305.93	110.5%	89.0%
15	Apple	A9	Twister	2015	ARMv8.0	564	1.85	304.86	110.1%	88.7%
16	AMD	3950X	Zen 2	2019	x86-64	1317	4.60	286.30	103.4%	83.3%
17	ARM Cortex	A77	Deimos	2019	ARMv8.2	812	2.84	285.92	103.3%	83.2%
18	Intel	9900K	Coffee LakeR	2018	x86-64	1384	5.00	276.80	100.0%	80.6%
19	Intel	10900K	Comet Lake	2020	x86-64	1465	5.30	276.42	99.9%	80.5%
20	Intel	6700K	Skylake	2015	x86-64	1032	4.00	258.00	93.2%	75.1%
21	ARM Cortex	A76	Enyo	2018	ARMv8.2	720	2.84	253.52	91.6%	73.8%
22	Intel	4770K	Haswell	2013	x86-64	966	3.90	247.69	89.5%	72.1%
23	AMD	1800X	Zen 1	2017	x86-64	935	3.90	239.74	86.6%	69.8%
24	Apple	A13	Thunder	2019	ARMv8.4	400	1.73	231.25	83.5%	67.3%
25	Apple	A8	Typhoon	2014	ARMv8.0	323	1.40	230.71	83.4%	67.2%
26	Intel	3770K	Ivy Bridge	2012	x86-64	764	3.50	218.29	78.9%	63.5%
27	Apple	A7	Cyclone	2013	ARMv8.0	270	1.30	207.69	75.0%	60.5%
28	Intel	2700K	Sandy Bridge	2011	x86-64	723	3.50	206.57	74.6%	60.1%
29	ARM Cortex	A75	Prometheus	2017	ARMv8.2	505	2.80	180.36	65.2%	52.5%
30	ARM Cortex	A73	Artemis	2016	ARMv8.0	380	2.45	155.10	56.0%	45.1%
31	ARM Cortex	A72	Maya	2015	ARMv8.0	259	1.80	143.89	52.0%	41.9%
32	Intel	E6600	Core2	2006	x86-64	338	2.40	140.83	50.9%	41.0%
33	AMD	FX-8350	BD	2011	x86-64	566	4.20	134.76	48.7%	39.2%
34	AMD	Phenom 965 BE	K10.5	2006	x86-64	496	3.70	134.05	48.4%	39.0%
35	ARM Cortex	A57 (est.)	Atlas	0	ARMv8.0	222	1.80	123.33	44.6%	35.9%
36	ARM Cortex	A15 (est.)	Eagle	0	ARMv7 32-bit	188	1.80	104.65	37.8%	30.5%
37	AMD	Athlon 64 X2 3800+	K8	2005	x86-64	207	2.00	103.50	37.4%	30.1%
38	ARM Cortex	A17 (est.)		0	ARMv7 32-bit	182	1.80	100.91	36.5%	29.4%
39	ARM Cortex	A55	Ananke	2017	ARMv8.2	155	1.60	96.88	35.0%	28.2%
40	ARM Cortex	A53	Apollo	2012	ARMv8.0	148	1.80	82.22	29.7%	23.9%
41	Intel	Pentium D	P4	2005	x86-64	228	3.40	67.06	24.2%	19.5%
42	ARM Cortex	A7 (est.)	Kingfisher	0	ARMv7 32-bit	101	1.80	56.06	20.3%	16.3%

TOP 10 - Performance Per Area comparison at ISO-clock (PPA/GHz)

Copied from locked thread. They try to avoid people to see this comparison how x86 is so bad.[/B]

Pos	Man	CPU	Core	Core Area mm2	Year	ISA	SPEC PPA/Ghz	Relative
1	ARM Cortex	A78	Hercules	1.33	2020	ARMv8	9.41	100.0%
2	ARM Cortex	A77	Deimos	1.40	2019	ARMv8	8.36	88.8%
3	ARM Cortex	A76	Enyo	1.20	2018	ARMv8	7.82	83.1%
4	ARM Cortex	X1	Hera	2.11	2020	ARMv8	7.24	76.9%
5	Apple	A12	Vortex	4.03	2018	ARMv8	4.44	47.2%
6	Apple	A13	Lightning	4.53	2019	ARMv8	4.40	46.7%
7	AMD	3950X	Zen 2	3.60	2019	x86-64	3.02	32.1%

It's impressive how fast are evolving the generic Cortex cores:

A72 (2015) which can be found in most SBC has 1/3 of IPC of new Cortex X1 - They trippled IPC in just 5 years.
A73 and A75 (2017) which is inside majority of Android smart phones today has 1/2 IPC of new Cortex X1 - They doubled IPC in 3 years.

Comparison how x86 vs. Cortex cores:

A75 (2017) compared to Zen1 (2017) is loosing massive -34% PPC to x86. As expected.
A77 (2019) compared to Zen2 (2018) closed the gap and is equal in PPC. Surprising. Cortex cores caught x86 cores.
X1 (2020) is another +30% IPC over A77. Zen3 need to bring 30% IPC jump to stay on par with X1.

Comparison to Apple cores:

AMD's Zen2 core is slower than Apple's A9 from 2015.... so AMD is 4 years behind Apple
Intel's Sunny Cove core in Ice Lake is slower than Apple's A10 from 2016... so Intel is 3 years behind Apple
Cortex A77 core is slower than Apple's A9 from 2015.... but
New Cortex X1 core is slower than Apple's A11 from 2017 so ARM LLC is 3 years behind Apple and getting closer

GeekBench5.1 comparison from 6/22/2020:

added Cortex X1 and A78 performance projections from Andrei here
2020 awaiting new Apple A14 Firestorm core and Zen3 core

Updated:

EDIT:
Please note to stop endless discussion about PPC frequency scaling: To have fair and clean comparison I will use only the top (high clocked) version from each core as representation for top performance.

Doug S · Jul 1, 2020

Carfax83 said:
Serious question. Do you know of a single high core count CPU (8 cores and greater) that has had a massive L2 cache attached to it without a L3?

Why is that relevant? Apple has an L3, so I'm unclear why you are looking for an example without one. When you ask "do you know of a single high core count CPU (8 cores and greater)" without even mentioning any qualification on cache sizes you have already limited discussion to a handful of designs, all within the recent past, because it is only quite recently that it has been possible to fit 8 high performance cores on a single chip.

But you can look at something like POWER6 which had only two cores per chip but was designed to scale up to 64 cores in a single system. It had large caches like Apple's - 64K/64K L1 and 4 MB L2 per core, as well as a 32 MB L3 shared between the two cores. So I guess it fits your strange qualification of "no L3" as there is no global cache shared among the up to 32 individual dual core CPUs. They did add a global L4 cache made from eDRAM on a future iteration I believe.

JoeRambo · Jul 1, 2020

Doug S said:
I'm not claiming you can cut and paste additional cores in, but adding cores is a pretty well understand problem. The hard part is going from one to two, once you've accomplished that adding a third or a 33rd core is a lot easier.

That is not the case unfortunately. Two cores sharing a blob of L2 can use it for coherency and be done with it, but You just can't keep scaling this architecture without nasty tradeoffs. Sooner or later you need to move to private L2 per core + keep tags what's in said core + caches complex somehow and connect it further to the chip.
Then there is a question of inclusivity and so on. For example with 2-3 cores with 1-2MB of L2 each, you can probably get away with 8-16MB of L3 and use inclusive L3 to help with coherency. Can't really scale it much further, as L3 runs out of space, and even when ignoring size constraints, cores start evicting each others data.

Apple is of course much further advanced in multi core integration, cause they have Big.little with different core clusters and also system cache ( L3ish ) that also deals with GPU. But claiming that just because they have L1, L2, L3 caches in their chip, they are ready to scale to desktop and server is not correct. P4 EE also had 2MB of L3 back in 2000's and came from Xeon origins with coherency and snooping sorted out by FSB and memory. We all know how it performed versus proper chips

Richie Rich · Jul 1, 2020

Carfax83 said:
You keep forgetting that clock frequency is the other half of the equation for CPU performance. Intel and AMD obviously see the value in designing a more balanced architecture than just focusing on width like the ARM CPUs do.

x86 cores are not more balanced. It's high frequency focused while sacrificing a lot of IPC (P4 like).
Apple's cores are wider and slower just like K8. Which design won? That better one.

But you are right. ARM mobile SoC are not balanced either. We know that ARM Cortex A72 manufactured on high performance process version was able to manage 4.2 GHz (from 1.5 GHz in Raspberry Pi 4). As soon as somebody will manufacture ARM cores on HP process and clock it to 4.5 GHz at 5nm and with this huge IPC advantage will destroy x86.

What is more easier?

- manufacture high IPC core at HP process and clock it higher, or
- rewrite libraries, invent new techniques allowing higher IPC, redesign pipeline and whole core?

ARM and Apple did the hard work already because it's necessary for ultra-low power smart phone market. Intel and AMD has the hard work in front of them still. There no doubt x86 will loose majority in servers in 2 years.

The open question is: Will Intel and AMD be able to adapt before they go to bankruptcy?

DrMrLordX · Jul 1, 2020

Doug S said:
Why is that relevant? Apple has an L3

Well they have an SLC which is the same but different. Also if I recall correctly, A13 (or at least A12) uses part of the shared L2 to maintain coherency between core clusters. Not the SLC.

Richie Rich · Jul 1, 2020

Don't you want wait two year for DDR5 in desktop and you want DDR5 in desktop now? You have to buy Chinese OnePlus 8, connect via USB-C -> HDMI cable to monitor and KB&M and you are ready to go. 4x Cortex A77 will beat most laptop ultra low power x86 CPUs at 15W TDP as a bonus

The OnePlus 8, OnePlus 8 Pro Review: Becoming The Flagship

www.anandtech.com

x86 and desktops are lightyears retarded that even Chinese phone makers are two years ahead. That's embarrassing.

DrMrLordX · Jul 1, 2020

Richie Rich said:
x86 and desktops are lightyears retarded that even Chinese phone makers are two years ahead. That's embarrassing.

. . . that's LPDDR5.

insertcarehere · Jul 1, 2020

Carfax83 said:
I agree, but it must be said that Intel's failed 10nm process didn't do x86 any favors. If Intel had been successful with 10nm, one might wonder whether we would even be having this conversation.

AMD is using the same fabs and (functionally) nodes as Apple and Zen 2 still requires lots of frequency to match Apple's core in performance, I fail to see how Intel not botching 10nm suddenly means Sunny Cove doesn't suffer the same fate.

name99 · Jul 1, 2020

Carfax83 said:
You keep forgetting that clock frequency is the other half of the equation for CPU performance. Intel and AMD obviously see the value in designing a more balanced architecture than just focusing on width like the ARM CPUs do.

And how is that working out for them?
Width (and more generally brainiac design) is more sustainable going forward. This has been clear since at least when Apple started down this path (which is why, from day one practically, Apple went all-in on width).
The speed demons can only maintain those frequencies for brief spurts, and they're having to dial back the frequencies or, best case, hold them constant, as processes get smaller. How is that a sensible horse on which to bet?

The issue is not "Apple/ARM is unbalanced and x86 has the balance right"; the issue is "Apple/ARM have the balance correct for the world of tomorrow; meanwhile x86 has optimized for today but they have nowhere to go on the frequency front"...
(And, BTW, that obsession with frequency means x86 lost mobile, then tablets and IoT, and now they're starting to lose the stack above mobile -- both on desktop starting with Apple, and in data warehouse, where those 5GHz frequencies are already meaningless because they generate so much heat.)

name99 · Jul 1, 2020

Carfax83 said:
Serious question. Do you know of a single high core count CPU (8 cores and greater) that has had a massive L2 cache attached to it without a L3?

As I said before, I'm not a chip architect and I don't even work in the tech industry. But I follow the industry fairly closely and I can't recall a single high core count CPU using that cache hierarchy.

Just because something is possible, doesn't mean it makes sense or should be done. I don't doubt that a high core count CPU with an enormous L2 cache and no L3 could be designed, but would it be as effective as the multilevel cache systems that Intel, AMD and IBM use?

I don't think so.

I'm not sure where you are going with this. You know that the A13 has a 16MB System Cache, right? It's called a system cache because a substantial part of its role is communication between different units (ie GPU, NPU, ISP, CPUs, ...) but it's performing essentially the same role as an L3 would perform if, say, you replaced each of the GPU, NPU, ISP, ... with a CPU block consisting of, let's say, 4 CPUs connected to an L3.

In other words, that functionality is ALREADY THERE. Why are you then so convinced that imposing it will (in some vague and unspecified way) cripple performance?

name99 · Jul 1, 2020

JoeRambo said:
That is not the case unfortunately. Two cores sharing a blob of L2 can use it for coherency and be done with it, but You just can't keep scaling this architecture without nasty tradeoffs. Sooner or later you need to move to private L2 per core + keep tags what's in said core + caches complex somehow and connect it further to the chip.
Then there is a question of inclusivity and so on. For example with 2-3 cores with 1-2MB of L2 each, you can probably get away with 8-16MB of L3 and use inclusive L3 to help with coherency. Can't really scale it much further, as L3 runs out of space, and even when ignoring size constraints, cores start evicting each others data.

Apple is of course much further advanced in multi core integration, cause they have Big.little with different core clusters and also system cache ( L3ish ) that also deals with GPU. But claiming that just because they have L1, L2, L3 caches in their chip, they are ready to scale to desktop and server is not correct. P4 EE also had 2MB of L3 back in 2000's and came from Xeon origins with coherency and snooping sorted out by FSB and memory. We all know how it performed versus proper chips

Some questions to ask yourself:
- What do you think Apple MEANS when they keep talking about "Unified Memory Architecture"?
- How do the various devices that participate in this Unified Memory Architecture, like ISP, GPU, NPU, media blocks, maintain coherence today?
- How is that problem any different from the problem you assert that they still have to solve?

Like so many x86 folk, you don't know computing, all you know is x86, and you assume the x86 way is the only way of doing things. Things like inclusivity are not a fact about computing, they are a particular design choice made by INTEL (not even AMD). And not an especially good design choice (like so many of Intel's choices, it was a cheapo choice that looks good in the short run but boxes you in for the long run).
Apple's cache hierarchy today does not care about strict inclusivity/exclusivity, and that will probably stay the case going forward -- once you've done the work to get that generality correct, why give it up?

Doug S · Jul 1, 2020

name99 said:
And how is that working out for them?
Width (and more generally brainiac design) is more sustainable going forward. This has been clear since at least when Apple started down this path (which is why, from day one practically, Apple went all-in on width).
The speed demons can only maintain those frequencies for brief spurts, and they're having to dial back the frequencies or, best case, hold them constant, as processes get smaller. How is that a sensible horse on which to bet?

I think it has been clear a lot longer than that. The first Alpha chips were the original "speed demons" that caused the coin to be termed, hitting 200 MHz when most RISCs as well as Intel's CPUs clocked only a third as fast. But we can look at their evolution, and see they became wider a lot faster than the clock was increased because they quickly realized that was the path to greater performance. The never-released 21464 was to be 8 way superscalar, wider than anything even today.

It was shown again by Intel itself with the P4, where they intended to pursue frequency to 10 GHz, but were forced to throw in the towel on that path and start over with their "mobile" architecture (which in turn directly traced its lineage to the Pentium Pro) they had been using on laptops since the P4 proved unsuitable there before it flamed out on the desktop/server.

LightningZ71 · Jul 1, 2020

Just like going for higher and higher clocks has diminishing returns, so does going wider and wider. Eventually, you get to a point where you've gone so wide that you've reached a point where it would have been better to stay at a slightly simpler design and just go for two separate cores instead. I've seen theoretical design work that shows that a fully provisioned core that provides for the most effective SMT4 implementation is over twice as many transistors as two separate cores that do a great job for SMT2 and can reach higher frequencies. If you're primary focus is on thermal and energy efficiency, then you have a hard cap on your clocks in ANY lithography node (though, it varies from node to node). If you have a hard frequency cap, then you're only choices are to go wider per core, or go with more cores. Eventually, you get to a point that the OS can no longer handle the overhead of managing higher and higher numbers of cores in an efficient manner, and you get to a cap on cores. You also get to a point that adding more and more cores makes the internal wiring of the CPU overly complicated and inefficient, wasting too much die space and energy/heat in the process, so you hit a ceiling there, and that applies on a varying basis to each node. The other issue with going wider and wider on an individual core is the front end. The more execution units that you install, the more complicated you have to make the front end to managing choosing and dispatching to each path. At SMT4, this is not trivial and requires a lot of deep buffering. Having a lower frequency makes it somewhat easier to keep the different paths filled with instructions as it gives more time to the front end to sort things, but, there's still a limit.

In the end, everyone went to the path that made the most sense for what they were working with. x86 is a mess of an instruction set. It takes a lot to make it perform well. No one is arguing that point. ARM is, by its very nature, somewhat simpler and easier to manage in instruction decode/dispatch. Going wider on x86 won't net you the same gains that going wider on ARM does. It also has significant costs above and beyond what it does on ARM just due to the front end differences.

Richie Rich · Jul 1, 2020

LightningZ71 said:
Just like going for higher and higher clocks has diminishing returns, so does going wider and wider. Eventually, you get to a point where you've gone so wide that you've reached a point where it would have been better to stay at a slightly simpler design and just go for two separate cores instead. I've seen theoretical design work that shows that a fully provisioned core that provides for the most effective SMT4 implementation is over twice as many transistors as two separate cores that do a great job for SMT2 and can reach higher frequencies. If you're primary focus is on thermal and energy efficiency, then you have a hard cap on your clocks in ANY lithography node (though, it varies from node to node). If you have a hard frequency cap, then you're only choices are to go wider per core, or go with more cores. Eventually, you get to a point that the OS can no longer handle the overhead of managing higher and higher numbers of cores in an efficient manner, and you get to a cap on cores. You also get to a point that adding more and more cores makes the internal wiring of the CPU overly complicated and inefficient, wasting too much die space and energy/heat in the process, so you hit a ceiling there, and that applies on a varying basis to each node. The other issue with going wider and wider on an individual core is the front end. The more execution units that you install, the more complicated you have to make the front end to managing choosing and dispatching to each path. At SMT4, this is not trivial and requires a lot of deep buffering. Having a lower frequency makes it somewhat easier to keep the different paths filled with instructions as it gives more time to the front end to sort things, but, there's still a limit.

In the end, everyone went to the path that made the most sense for what they were working with. x86 is a mess of an instruction set. It takes a lot to make it perform well. No one is arguing that point. ARM is, by its very nature, somewhat simpler and easier to manage in instruction decode/dispatch. Going wider on x86 won't net you the same gains that going wider on ARM does. It also has significant costs above and beyond what it does on ARM just due to the front end differences.

You are way too pessimistic. We didn't reach any cap/wall you mention here. Even the frequency cap is limit only for x86 world. ARM chips goes faster and faster with every new node. But x86 is stuck at 5GHz since Sandy Brigde. And the true reason is less smart design of those x86 CPUs. ARM designer has to think twice before spending any additional transistors.

A77 ........... +20% IPC in SPECint2006 ......... +17% transistors
A78 ........... +7% IPC ........................................ - 5% transistors
Ice Lake .... +18% IPC ................................... + 38% transistors

Bad and inefficient design. Nothing else.

coercitiv · Jul 1, 2020

Richie Rich said:
Even the frequency cap is limit only for x86 world.

Here we go.

LightningZ71 · Jul 1, 2020

Wow Richie... Just wow. The frequency cap of any given node (for each architecture) is related to the maximum stable operating frequency of the slowest critical path. Its is instruction set agnostic. However, I will grant you that having a front end that has to decode CICS instructions into essentially a bunch of RISC instructions to then send the final micro-ops into the various execution units serves as a complexity barrier, creating complex critical paths of its own that can put a cap on frequencies. So, yes, x86, being more complex, can effectively limit frequency on a given node, but, it is not, itself, the cause of that cap.

DrMrLordX · Jul 1, 2020

name99 said:
And how is that working out for them?

For AMD? Pretty well, actually.

Doug S · Jul 1, 2020

LightningZ71 said:
Wow Richie... Just wow. The frequency cap of any given node (for each architecture) is related to the maximum stable operating frequency of the slowest critical path. Its is instruction set agnostic. However, I will grant you that having a front end that has to decode CICS instructions into essentially a bunch of RISC instructions to then send the final micro-ops into the various execution units serves as a complexity barrier, creating complex critical paths of its own that can put a cap on frequencies. So, yes, x86, being more complex, can effectively limit frequency on a given node, but, it is not, itself, the cause of that cap.

The CISC decode wouldn't enter the critical path for frequency, it just makes the pipeline a bit longer. That can sap performance by increasing branch misprediction penalty and stuff like that, but it won't cap your frequency. It is power that does that.

But you're 100% right, there's nothing special about ARM or RISC in general that allows it to hit a higher clock rate than CISC. Richie must be reading some 30 year old Hennessey and Patterson or something - that may have been true long ago when the transistor budget was several orders of magnitude smaller but it is not true today nor has been true anytime in the past couple decades.

Doug S · Jul 1, 2020

Ignore this....is there no way to delete an accidental post?

myocardia · Jul 1, 2020

Doug S said:
Ignore this....is there no way to delete an accidental post?

No, there isn't. We just type "double post" in the second one when that happens.

name99 · Jul 2, 2020

LightningZ71 said:
Just like going for higher and higher clocks has diminishing returns, so does going wider and wider. Eventually, you get to a point where you've gone so wide that you've reached a point where it would have been better to stay at a slightly simpler design and just go for two separate cores instead. I've seen theoretical design work that shows that a fully provisioned core that provides for the most effective SMT4 implementation is over twice as many transistors as two separate cores that do a great job for SMT2 and can reach higher frequencies. If you're primary focus is on thermal and energy efficiency, then you have a hard cap on your clocks in ANY lithography node (though, it varies from node to node). If you have a hard frequency cap, then you're only choices are to go wider per core, or go with more cores. Eventually, you get to a point that the OS can no longer handle the overhead of managing higher and higher numbers of cores in an efficient manner, and you get to a cap on cores. You also get to a point that adding more and more cores makes the internal wiring of the CPU overly complicated and inefficient, wasting too much die space and energy/heat in the process, so you hit a ceiling there, and that applies on a varying basis to each node. The other issue with going wider and wider on an individual core is the front end. The more execution units that you install, the more complicated you have to make the front end to managing choosing and dispatching to each path. At SMT4, this is not trivial and requires a lot of deep buffering. Having a lower frequency makes it somewhat easier to keep the different paths filled with instructions as it gives more time to the front end to sort things, but, there's still a limit.

In the end, everyone went to the path that made the most sense for what they were working with. x86 is a mess of an instruction set. It takes a lot to make it perform well. No one is arguing that point. ARM is, by its very nature, somewhat simpler and easier to manage in instruction decode/dispatch. Going wider on x86 won't net you the same gains that going wider on ARM does. It also has significant costs above and beyond what it does on ARM just due to the front end differences.

What does SMT performance have to do with single-threaded performance, which is what we are discussing? All your anecdote about SMT4 says is that SMT (even SMT2) is a stupid idea -- which I have been saying for years!
(There IS a very specific way to implement something that superficially looks like SMT, but which avoids the flaws of standard SMT, but that's for another thread.)

It is certainly true that extracting more performance from going wider requires exponentially more transistors. However since every process advance (for those of us still ON process advances) provides exponentially more transistors, this is not exactly a problem.

Will it last forever? No, of course not. No-one is claiming that.
What's being claimed is that RIGHT NOW, and for the foreseeable future, going wider is a better bet than higher frequency.

As for Intel, you are letting them off way too easy. Intel's failures are the result of a deliberate choice to prioritize finance and marketing over daring engineering. A strategy that works great if your goal is to make lots of money today -- and not so great if your goal is to keep the company relevant for the next twenty years...

FOR EXAMPLE Intel had the chance, when they decided to create a mobile chip, to
- go with full x86
- go with a simplified x86; call it x86v8! Something that would make the job of porting compilers, libraries, OS's easy, but not be binary compatible
- go with a start from scratch modern-design ISA
They CHOSE the first option. No-one force them to. There was no body of existing code that they had to support. They could have done anything, but they chose the cheap easy option. Plenty of us at the time said that was stupid, said it would fail in exactly the ways it did fail.
This was not "who could possibly have predicted?", it was very stupid very greedy management making bad decisions.

Carfax83 · Jul 2, 2020

Doug S said:
Why is that relevant? Apple has an L3, so I'm unclear why you are looking for an example without one. When you ask "do you know of a single high core count CPU (8 cores and greater)" without even mentioning any qualification on cache sizes you have already limited discussion to a handful of designs, all within the recent past, because it is only quite recently that it has been possible to fit 8 high performance cores on a single chip.

We just spent the last two pages discussing the merits and demerits of large L1 and L2 caches in high core count CPUs. I used the A13 as an example of a low core count CPU with a cache hierarchy that's optimized for single threaded workloads. I had no idea that Apple also used an SLC as a sort of L3 cache. I wish you would have told me that earlier, as it would have saved me a lot of writing!

But you can look at something like POWER6 which had only two cores per chip but was designed to scale up to 64 cores in a single system. It had large caches like Apple's - 64K/64K L1 and 4 MB L2 per core, as well as a 32 MB L3 shared between the two cores. So I guess it fits your strange qualification of "no L3" as there is no global cache shared among the up to 32 individual dual core CPUs. They did add a global L4 cache made from eDRAM on a future iteration I believe.

That was my point. I have never heard of a high core count CPU with just an L1 and L2 cache. They always add an L3/L4 or something.

I didn't know that the A series used a pseudo L3 cache, so I assumed you believed that Apple could design a high core count CPU with very large L1 and L2 caches and nothing else and it would be highly performant in multithreaded workloads.

DrMrLordX · Jul 2, 2020

name99 said:
As for Intel, you are letting them off way too easy. Intel's failures are the result of a deliberate choice to prioritize finance and marketing over daring engineering.

I don't know if IA64 qualified as "daring engineering", but boy they sure threw a lot of resources into Itanium. Once bitten, twice shy.

Carfax83 · Jul 2, 2020

insertcarehere said:
AMD is using the same fabs and (functionally) nodes as Apple and Zen 2 still requires lots of frequency to match Apple's core in performance, I fail to see how Intel not botching 10nm suddenly means Sunny Cove doesn't suffer the same fate.

AMD was years behind Intel in terms of IPC. People often forget that Zen 1 had a 40% increase in IPC compared to their previous Excavator core, and even that massive leap didn't close the gap with Intel. It took Zen 2 to more or less close the IPC gap (against an Intel core almost 5 years old at this point), though Intel still has an advantage due to higher frequency and a lower latency architecture in certain workloads while Zen 2 has the advantage in compute bound workloads.

Zen 3 should be the tie breaker, so that will be a much better comparison point for what x86 can deliver.....in the near future that is.

As for Intel, if they hadn't botched the 10nm process, we wouldn't be on Sunny Cove cores right now. We would be on Golden Cove or something better, and Golden Cove is reputed to have another 25% or so IPC increase from Willow Cove, which means 50% from Skylake which is what is currently being compared against the Apple cores.

Kocicak · Jul 2, 2020

IBM Power processors???

Carfax83 · Jul 2, 2020

name99 said:
And how is that working out for them?

I'd say it worked rather well, even for Intel. Did you read the 10900K reviews? I was surprised at how well it performed against the 3900x, even though it uses a microarchitecture from nearly 5 years ago.

The sheer frequency advantage allows Comet Lake to be very competitive against Zen 2.

Width (and more generally brainiac design) is more sustainable going forward. This has been clear since at least when Apple started down this path (which is why, from day one practically, Apple went all-in on width).

Well you might be happy to know that Intel (and probably AMD) agree with you, because Golden Cove is purported to have a 25% increase in IPC over Willow Cove, and the next core will probably be a much bigger jump than anything we saw during the Skylake era.

And if they can hit close to 5ghz frequencies with those cores, then that performance is going to be amazing!

The speed demons can only maintain those frequencies for brief spurts, and they're having to dial back the frequencies or, best case, hold them constant, as processes get smaller. How is that a sensible horse on which to bet?

Isn't CPU performance IPC X Frequency? So obviously frequency isn't something to be sneered at.

Info TOP 20 of the World's Most Powerful CPU Cores - IPC/PPC comparison

Senior member

Platinum Member

Golden Member

Senior member

Lifer

Senior member

Lifer

Senior member

Senior member

Senior member

Senior member

Platinum Member

Golden Member

Senior member

Diamond Member

Golden Member

Lifer

Platinum Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Lifer

Diamond Member

Senior member

Diamond Member