Question Cortex X2,A710 A510 LESS EFFICIENT than Cortex X1,A78,A55 !!??

FlameTail · Nov 8, 2022

In this Geekerwan video, he comes to the conclusion that the newer Cortex X2, A710 and A510 are less efficient than their predecessor X1, A78 and A55 respectively.

Please do watch the video and comment what you think.

I personally find this weird, as it also goes against ARM's performance/efficiency claims, but it's not entirely unbelievable.

Maybe this is why Google decided to stick with X1 and A78 cores for the Tensor G2?

Edit: Skip to 13:10 in the above video, if you do not want to watch the whole video. There he shows some excellent Performance-Power graphs, which is the basis for this post. (in fact the subject of this incredible and well-crafted video is all about Performance-Power curves)

FlameTail · Nov 8, 2022

FlameTail said:
I personally find this weird, as it also goes against ARM's performance/efficiency claims, but it's not entirely unbelievable.

For one-

1. They upgraded the ISA to ARMv9 so there might be some optimisation issues (i guess?)

2. The slides ARM presented during their announcement of Cortex X2/A710/A510 involoved comparing a CPU with the new cores but with 8 MB L3 whereas the old CPU which they were comparing had only 4 MB L3. So their was definitely something finicky about their performance claims.

SteinFG · Nov 8, 2022

My guess is that A710 is a redesign of arm core, and made this way in order to iterate on it in the future, while A78 is a core that was refined multiple times, and there's nothing to refine.
If that is the case, after A710 we'll see good generational improvements. And I really hope we'll see good generational improevments from ARM.

soresu · Nov 10, 2022

SteinFG said:
My guess is that A710 is a redesign of arm core, and made this way in order to iterate on it in the future, while A78 is a core that was refined multiple times, and there's nothing to refine.
If that is the case, after A710 we'll see good generational improvements. And I really hope we'll see good generational improevments from ARM.

No - both X2 and A710 are a continuance of the A76 Austin lineage.

I'm less certain of the design lineage of the X3 and A715, but the 2023 super/big cores should definitely be Sophia Antipolis (A9/A17/A73/A75) projects and a generally ground up design effort.

They are codenamed Hunter (A720 ?) and Hunter-ELP (X4).

X3 finally saw the 1st 6 wide ARM Ltd CPU core, but as the Sophia team has typically been more conservative with core width it will be interesting to see where they go with X4, and even more so with A720 as the big cores have been 4 wide since A76 now.

FlameTail · Nov 10, 2022

So what's the significance of all these different teams ? I would like to learn more about them.

soresu · Nov 10, 2022

FlameTail said:
So what's the significance of all these different teams ? I would like to learn more about them.

Austin (Texas) team made A15/A57/A72 in their first round - then from a ground up redesign they made A76/A77/A78/X1/A710/X2 and possibly A715/X3 too for their second round.

Sophia Antipolis (France) made A9/A12/A17/A73/A75 in their first round.

Cambridge (UK) made A7/A53/A55/A510.

I think Neoverse E1/A65 was a collaboration between the Cambridge team and a different American team in Chandler (Arizona) which I haven't heard much about otherwise.

I haven't got a clue who does the Neoverse specific uncore work, albeit as most of it is based on the Austin teams output so far I'd warrant that Austin is also where that is done.

The next ground up redesign is expect from Sophia again, but it was originally expected for the Matterhorn (A710/X2) generation while we are now looking at it being A720/X4, so clearly something is up 🤔

All that being said ARM Ltd seems to use a very modular process where a piece of one design (branch predictor etc) finished by one team can be used in a different teams µArch which maximises the efficiency of the overall design effort.

(if only the business side of ARM worked in such harmony at the moment😅)

The red spirit · Nov 10, 2022

I just wonder what effect does RAM and storage have on performance. Particularly memory can sway results. Also Tensor chip didn't seem to have max perf/watt as goal and it focused a lot on NPU and perhaps photo processing. Neither of which is your typical benchmark task. In the end Android performance could also be heavily affected by scheduler of CPU and GPU. So it makes me wonder how great their variable elimination was.

FlameTail · Nov 12, 2022

https://www.reddit.com/r/Android/comments/ro7boy

Golden Reviewer's figures also agree that the new A510 is worse than A55.

I think we can safely say their was no testing anomaly with the Geekerwan power curves

soresu · Nov 15, 2022

FlameTail said:
https://www.reddit.com/r/Android/comments/ro7boy

Golden Reviewer's figures also agree that the new A510 is worse than A55.

I think we can safely say their was no testing anomaly with the Geekerwan power curves

This could possibly explain the delay with Sophia's nu cores as they are supposed to be ISA compatible with the nu low power core Hayes.

ie it may be that a redesign of one core, or an early bring up of its successor happened to course correct for the A510 snafu.

What a botch job, this will handicap v9-A rollout for years unless they really cut IP licensing costs for Hayes.

FlameTail · Nov 16, 2022

But why is this happening though ? The newer cores should be more efficient, as ARM themselves claimed.

What is going on here? Does anyone have an explanation? All the more do I miss Andrei.

It doesn't seem to be a node issue or vendor implementation problem (as can be seen in the Geekerwan video) because both the Dimensity 9000 and SD 8g1's individual cores exhibit worse efficiency than their predecessors, which are using an older core uArch and on an older node!

FlameTail · Nov 16, 2022

Especially since ARM claimed a 30% efficiency boost for the Cortex A710, but that has hardly materialised.

I think we can agree the A510 cores a bit of a dud, as Andrei himself noted in his review.

FlameTail · Nov 16, 2022

And even the performance gains seem nutty.

SD 888 scores ~3600 points in Geekbench multi core.
SD 8g1 scores ~3800 points in Geek bench multi.

That's a measly ~6% improvement despite the 8g1 having newer and improved cores, 50% more L3 cache, and higher clock speeds across the board.

FlameTail · Nov 19, 2022

I just had a thought, could the increased caches be the reason for the decreased efficiency?

Geekerwan compares the per core power curves of X1/A78/A55 against the X2/A710/A510 in SD 888 vs SD 8g1, and Dimensity 8100 vs Dimensity 9000.

Both newer chips (9000, 8g1) have more cache compared to their predecessors. Could this be the reason for the decreased efficiency?

But, Geekbench isn't a cache intensive test suite is it? And anyway, could increased caches cause such a huge increase in power consumption?
I mean, the newer chips with more cache, are using newer, more efficient nodes as well.

But then again I am no expert. Can anyone provide their take on this?

BorisTheBlade82 · Nov 19, 2022

@FlameTail
Generally, Caches have very low power consumption - compared to logic. Moreover, a larger cache should make a core more efficient as this reduces more expensive RAM reads. IIRC this is one of the main reasons Apple uses a LLC for CPU/GPU.

Doug S · Nov 19, 2022

BorisTheBlade82 said:
@FlameTail
Generally, Caches have very low power consumption - compared to logic. Moreover, a larger cache should make a core more efficient as this reduces more expensive RAM reads. IIRC this is one of the main reasons Apple uses a LLC for CPU/GPU.

Apple uses an LLC mainly to unify their memory hierarchy. They want the GPU to be able to snoop the CPU's cache so you don't have to explicitly flush caches to have the latest data visible on the GPU, NPU etc. Basically allowing the programmer to treat a GPU or NPU core as another CPU core with a really wacky instruction set.

They could get the same efficiency by dedicating that LLC as an L3 for the CPUs - and get better latency out of it too. The SIZE of the LLC is for efficiency, but the fact it is an LLC that covers more than just the CPU rather than just an L3 is not about efficiency.

FlameTail · Nov 19, 2022

Doug S said:
Apple uses an LLC mainly to unify their memory hierarchy. They want the GPU to be able to snoop the CPU's cache so you don't have to explicitly flush caches to have the latest data visible on the GPU, NPU etc. Basically allowing the programmer to treat a GPU or NPU core as another CPU core with a really wacky instruction set.

They could get the same efficiency by dedicating that LLC as an L3 for the CPUs - and get better latency out of it too. The SIZE of the LLC is for efficiency, but the fact it is an LLC that covers more than just the CPU rather than just an L3 is not about efficiency.

Tbh Apple uses such huge L2 caches in their CPUs that i don't think an L3 is really necessary. The A16 bionic has whopping 16 MB of L2.

To put that into perspective, That's more L2 cache than the L3 and SLC COMBINED of today's flagship SoCs. (eg: Dimensity 9000; 8 MB L3, 6 MB SLC)

Doug S · Nov 20, 2022

FlameTail said:
Tbh Apple uses such huge L2 caches in their CPUs that i don't think an L3 is really necessary. The A16 bionic has whopping 16 MB of L2.

To put that into perspective, That's more L2 cache than the L3 and SLC COMBINED of today's flagship SoCs. (eg: Dimensity 9000; 8 MB L3, 6 MB SLC)

If there was no SLC they would still need a higher caching layer to manage data moving between the L2s of the big and little cores. Forcing that sharing to be done via flushes to DRAM would make switching tasks from big to little or vice versa a lot more expensive.

FlameTail · Nov 21, 2022

So ig Google was *smart* to not put the X2/A710/A510 in the Tensor G2.

I hope they skip the X2/A710 generation and directly go to X3/A715 for Tensor G3.

FlameTail · Nov 21, 2022

BorisTheBlade82 said:
@FlameTail
Generally, Caches have very low power consumption - compared to logic. Moreover, a larger cache should make a core more efficient as this reduces more expensive RAM reads. IIRC this is one of the main reasons Apple uses a LLC for CPU/GPU.

Can you provide a rough estimation of how much power 1 MB of cache consumes?

Someone on another thread claimed that the 64 MB 3D-V cache on AMD chips consumes ~4W. If i do the math, 1 MB consumes a miniscule 0.0625W. Not sure about the accuracy though. And I do know the power depends on the process node as well.

BorisTheBlade82 · Nov 22, 2022

@FlameTail
Sorry, I have no specific numbers. But these sound reasonable. I would assume the V-cache to even be above average because of the need for TSVs.
Just for perspective: A Rembrandt APU with 16 MByte L3 and 8 cores can be run at around 2 GHz with 12w total package power.

SpudLobby · Dec 1, 2022

BorisTheBlade82 said:
@FlameTail
Sorry, I have no specific numbers. But these sound reasonable. I would assume the V-cache to even be above average because of the need for TSVs.
Just for perspective: A Rembrandt APU with 16 MByte L3 and 8 cores can be run at around 2 GHz with 12w total package power.

Is this in context of a CPU workload with all the cores active? Wondering where you got this. I could believe that at those frequencies though - checks out.

At any rate, more cache would only hurt static/idle power and even then not very much, even with TSV cache using some older nodes. Realistically, a Risen APU with e.g. 48MB of L3 via TSV would improve the dynamic power draw due to improving L3 hit rates and keeping data movement across the package to DRAM lower.

Thala · Dec 2, 2022

SpudLobby said:
At any rate, more cache would only hurt static/idle power and even then not very much, even with TSV cache using some older nodes. Realistically, a Risen APU with e.g. 48MB of L3 via TSV would improve the dynamic power draw due to improving L3 hit rates and keeping data movement across the package to DRAM lower.

Static power of SRAM is highly dependent on the minimum cycle time. So lower clocked LLC has much less leakage per MByte than say L1$ - we are talking factors here.
In general you save power when avoiding expensive DRAM access.

BorisTheBlade82 · Dec 3, 2022

SpudLobby said:
Is this in context of a CPU workload with all the cores active? Wondering where you got this. I could believe that at those frequencies though - checks out.

Took me a couple of days - here is a screenshot of my own Renoir (4700U) locked at 12w cTDP. It achieves around 1.8Ghz on all 8 cores (without SMT, if this means something to you).

Question Cortex X2,A710 A510 LESS EFFICIENT than Cortex X1,A78,A55 !!??

Diamond Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Attachments

Diamond Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Senior member

Senior member

Golden Member

Senior member

Attachments