Discussion Future ARM Cortex + Neoverse µArchs Discussion

soresu · Jan 26, 2025

adroc_thurston said:
Well that's pointless.
If you want an OoO little, you pick the mainline cortex-A.

I think that they mean a narrower 2-3 wide OoO design like earlier A7x cores, but on a modern fab node with all the modern v9.x-A ISA trimmings for big little consistency.

An A73-75 level CPU core on 5nm or smaller would be very efficient and most importantly low area.

Anything wider and you might as well just jump up to the latest A7xx as you said.

soresu · Jan 26, 2025

NostaSeronx said:
There is LSC/FC with Restricted Out-of-Order capability...
View attachment 115540
+ FSC: https://dl.acm.org/doi/fullHtml/10.1145/3499424
So, there are alternatives to pure Out-of-Order and pure In-Order. I see rOoO/sOoO taking over in-place of InO/OoO.

Cortex-A600 can't come out sooner.

In concept reminds me a bit of the forward and deferred rendering technique improvements in computer gfx.

That being said, needz full references and expanded acronym names!

Best practise to use acronyms only after you establish meaning.

Saves need to explain things later.

LSC = Load Slice Core (PDF paper link)

FC = Freeway Core (PDF paper link)

FSC = Forward Slice Core (PDF paper link, web article linked in above post)

Food for thought though, these papers will be mostly academic musings on µArch design, devoid of the decades worth of large engineering team investment in refinement of ideas that in order and out of order have had since their invention across the IT industry.

What on the face of it may seem to be revolutionary gains seen in these papers could easily turn out to be most of the low hanging fruit the baseline architectures have to offer.

Very interesting none the less - hope none of it turns out like cluster multi threading 😅

Edit: FC might actually stand for Freeflow Core instead, a different µArch proposal to Freeway.

2 papers published in the same year with very similar proposal names....

One paper literally references the other and thus has no excuse 😒

DZero · Jan 26, 2025

adroc_thurston said:
Well that's pointless.
If you want an OoO little, you pick the mainline cortex-A.

Consumption / performance ratio. Cortex A is really wide but has high consumption rate. Also the cost of the licences are not that low.

There is space for an A6XX line. Just copy what Huawei did with their small Taishan.

adroc_thurston · Jan 26, 2025

DZero said:
Consumption / performance ratio

It's really really good a720 onwards.

DZero said:
Cortex A is really wide but has high consumption rate. Also the cost of the licences are not that low.

Cost-sensutive segments will always stick to A5x since poverty is poor.

DZero · Jan 26, 2025

adroc_thurston said:
It's really really good a720 onwards.

Cost-sensutive segments will always stick to A5x since poverty is poor.

Well... Huawei thinks otherwise and if Mediatek or Qualcomm sees a cheaper core that is out of order but at the price of the A5x, they would insta jump.

adroc_thurston · Jan 26, 2025

DZero said:
Huawei thinks otherwise

They are ughhhh. Cucked. Out of mainline core or ISA licensing.

DZero said:
if Mediatek or Qualcomm sees a cheaper core that is out of order but at the price of the A5x

Out of order but the same price as A5x? Oxymoron.

soresu · Jan 27, 2025

adroc_thurston said:
Out of order but the same price as A5x? Oxymoron.

Assuming for a second that they meant power/area (and not ARM Ltd's licensing costs), then it's not that far out of the realms of possibility at some point in the future.

Having looked further into the research Nosta brought up earlier it actually seems functionally possible to get within <10% of OoO perf for a given pipeline width, at a power draw of negligible increase and areal complexity of acceptable increase over an InO µArch design (vs OoO that is).

The power increase over the 3.12W in order core was a mere 10s of milliWatts for the most recent proposal "Forward Slice Core", less than 1.1% - areal increase was measured on a scale of 1-5, with in order being 1, OoO being 5 and Forward Slice Core being 2.

Benefits of this design seemed to increase with pipeline width, though only 2 wide and 3 wide were given as examples, so it's anyones guess as to whether the benefits would taper off or even reverse at the pipeline width of modern flagship OoO cores.

It seems like research in this direction has come a long way iteratively increasing the PPA of this design path over the last decade, curious to see how it progresses further both academically and commercially.

Nothingness · Jan 27, 2025

soresu said:
Assuming for a second that they meant power/area (and not ARM Ltd's licensing costs), then it's not that far out of the realms of possibility at some point in the future.

Having looked further into the research Nosta brought up earlier it actually seems functionally possible to get within <10% of OoO perf for a given pipeline width, at a power draw of negligible increase and areal complexity of acceptable increase over an InO µArch design (vs OoO that is).

The power increase over the 3.12W in order core was a mere 10s of milliWatts for the most recent proposal "Forward Slice Core", less than 1.1% - areal increase was measured on a scale of 1-5, with in order being 1, OoO being 5 and Forward Slice Core being 2.

Benefits of this design seemed to increase with pipeline width, though only 2 wide and 3 wide were given as examples, so it's anyones guess as to whether the benefits would taper off or even reverse at the pipeline width of modern flagship OoO cores.

It seems like research in this direction has come a long way iteratively increasing the PPA of this design path over the last decade, curious to see how it progresses further both academically and commercially.

One has to be careful with this kind of academic results. For instance, a significant area in higher performance cores is dedicated to branch prediction and data prefetching because it make a large difference in many realistic workloads (web browsing for instance). Going from academic to industry has taught me a thing: until you have pushed your design close to something that looks like a releasable product, all the performance results are to be taken with care.

I'm not trying to say the findings in the articles are not worth it, only that their results need more work.

DZero · Jan 27, 2025

Nothingness said:
One has to be careful with this kind of academic results. For instance, a significant area in higher performance cores is dedicated to branch prediction and data prefetching because it make a large difference in many realistic workloads (web browsing for instance). Going from academic to industry has taught me a thing: until you have pushed your design close to something that looks like a releasable product, all the performance results are to be taken with care.

I'm not trying to say the findings in the articles are not worth it, only that their results need more work.

Indeed, but if Huawei managed to pull a small out of order Core with limited resources, what is not allowing ARM to pull one since Qualcomm might improve their own cores too?

Also, Apple "little" cores are OoO too.

Nothingness · Jan 27, 2025

DZero said:
Indeed, but if Huawei managed to pull a small out of order Core with limited resources, what is not allowing ARM to pull one since Qualcomm might improve their own cores too?

Also, Apple "little" cores are OoO too.

Perhaps Arm is more interested in pushing A7xx cores as little cores, leaving it to customers really interested in very small cores to do their own design?

DZero · Jan 27, 2025

Nothingness said:
Perhaps Arm is more interested in pushing A7xx cores as little cores, leaving it to customers really interested in very small cores to do their own design?

I want to think that as an alternative, but if they are releasing A5xx cores as the small ones, seems that something is off.

Also A5xx is a lost cause if trying to be more performant, the gains are not as big as expecting, going full efficiency should be the path.

Shivansps · Jan 27, 2025

GT 1030 on the Orion O6

They say it dosent work on UEFI (compared to the RX 6400), but works in linux.

PCI-E on this board seems decent enoght.

Having a ARM consumer board that you can just plug in a gpu and "just works" is something new.

DZero · Jan 27, 2025

adroc_thurston said:
They are ughhhh. Cucked. Out of mainline core or ISA licensing.

Out of order but the same price as A5x? Oxymoron.

Funny story they jumped the ship since they had A510 cores, leading that they had access to X2 core, but choose to not to use it. Also Huawei going to the custom path would allow to have more control by themselves, that is how Apple went custom and even ditched the In Order Core really fast with A7 and even before it went out of order pretty fast. Even their watches uses the Out of Order cores too.

Shivansps said:
GT 1030 on the Orion O6

They say it dosent work on UEFI (compared to the RX 6400), but works in linux.

PCI-E on this board seems decent enoght.

Having a ARM consumer board that you can just plug in a gpu and "just works" is something new.

Now that is insane. That is how they planned to release a system that can work instantly. Is a big advancement from them.

soresu · Jan 27, 2025

Nothingness said:
One has to be careful with this kind of academic results. For instance, a significant area in higher performance cores is dedicated to branch prediction and data prefetching because it make a large difference in many realistic workloads (web browsing for instance). Going from academic to industry has taught me a thing: until you have pushed your design close to something that looks like a releasable product, all the performance results are to be taken with care.

I'm not trying to say the findings in the articles are not worth it, only that their results need more work.

I already said as much in an earlier post.

A relatively short and low staffed academic design effort is not going to be comparable to a polished production µArch created by a team of dozens working at multiple levels to get something that is hopefully going to add profit to the bottom line of a company.

Shivansps · Jan 28, 2025

radxa@orion-o6:~$ glmark2-es2-wayland======================================================= glmark2 2023.01======================================================= OpenGL Information GL_VENDOR: ARM GL_RENDERER: Mali-G720-Immortalis GL_VERSION: OpenGL ES 3.2 v1.r49p0-00eac0.b97811108d91b3a6cd0a9d90e51f9da5 Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0 Surface Size: 800x600 windowed=======================================================[build] use-vbo=false: FPS: 9015 FrameTime: 0.111 ms[build] use-vbo=true: FPS: 9245 FrameTime: 0.108 ms[texture] texture-filter=nearest: FPS: 9097 FrameTime: 0.110 ms[texture] texture-filter=linear: FPS: 9035 FrameTime: 0.111 ms[texture] texture-filter=mipmap: FPS: 9915 FrameTime: 0.101 ms[shading] shading=gouraud: FPS: 8299 FrameTime: 0.121 ms[shading] shading=blinn-phong-inf: FPS: 7943 FrameTime: 0.126 ms[shading] shading=phong: FPS: 8077 FrameTime: 0.124 ms[shading] shading=cel: FPS: 8054 FrameTime: 0.124 ms[bump] bump-render=high-poly: FPS: 4527 FrameTime: 0.221 ms[bump] bump-render=normals: FPS: 9443 FrameTime: 0.106 ms[bump] bump-render=height: FPS: 8805 FrameTime: 0.114 ms[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 8780 FrameTime: 0.114 ms[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 6532 FrameTime: 0.153 ms[pulsar] light=false:quads=5:texture=false: FPS: 8829 FrameTime: 0.113 ms[desktop] blur-radius=5:effect=blurasses=1:separable=true:windows=4: FPS: 3839 FrameTime: 0.261 ms[desktop] effect=shadow:windows=4: FPS: 7163 FrameTime: 0.140 ms[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1449 FrameTime: 0.690 ms[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 1407 FrameTime: 0.711 ms[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 2160 FrameTime: 0.463 ms[ideas] speed=duration: FPS: 4056 FrameTime: 0.247 ms[jellyfish] <default>: FPS: 6343 FrameTime: 0.158 ms[terrain] <default>: FPS: 708 FrameTime: 1.413 ms[shadow] <default>: FPS: 7009 FrameTime: 0.143 ms[refract] <default>: FPS: 1615 FrameTime: 0.619 ms[conditionals] fragment-steps=0:vertex-steps=0: FPS: 9377 FrameTime: 0.107 ms[conditionals] fragment-steps=5:vertex-steps=0: FPS: 8472 FrameTime: 0.118 ms[conditionals] fragment-steps=0:vertex-steps=5: FPS: 9597 FrameTime: 0.104 ms[function] fragment-complexity=low:fragment-steps=5: FPS: 9585 FrameTime: 0.104 ms[function] fragment-complexity=medium:fragment-steps=5: FPS: 8515 FrameTime: 0.117 ms[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 8545 FrameTime: 0.117 ms[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 8429 FrameTime: 0.119 ms[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 8387 FrameTime: 0.119 ms======================================================= glmark2 Score: 7036=======================================================

radxa@orion-o6:~$ glmark2-wayland======================================================= glmark2 2023.01======================================================= OpenGL Information GL_VENDOR: Mesa GL_RENDERER: zink (Mali-G720-Immortalis) GL_VERSION: 4.0 (Compatibility Profile) Mesa 23.0.4 (git-b87980692c) Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0 Surface Size: 800x600 windowed=======================================================[build] use-vbo=false: FPS: 3568 FrameTime: 0.280 ms[build] use-vbo=true: FPS: 3743 FrameTime: 0.267 ms[texture] texture-filter=nearest: FPS: 4106 FrameTime: 0.244 ms[texture] texture-filter=linear: FPS: 4124 FrameTime: 0.243 ms[texture] texture-filter=mipmap: FPS: 3921 FrameTime: 0.255 ms[shading] shading=gouraud: FPS: 3219 FrameTime: 0.311 ms[shading] shading=blinn-phong-inf: FPS: 2941 FrameTime: 0.340 ms[shading] shading=phong: FPS: 3190 FrameTime: 0.313 ms[shading] shading=cel: FPS: 3112 FrameTime: 0.321 ms[bump] bump-render=high-poly: FPS: 2123 FrameTime: 0.471 ms[bump] bump-render=normals: FPS: 3788 FrameTime: 0.264 ms[bump] bump-render=height: FPS: 3112 FrameTime: 0.321 ms[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 3868 FrameTime: 0.259 ms[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 2947 FrameTime: 0.339 ms[pulsar] light=false:quads=5:texture=false: FPS: 3705 FrameTime: 0.270 ms[desktop] blur-radius=5:effect=blurasses=1:separable=true:windows=4: FPS: 1416 FrameTime: 0.706 ms[desktop] effect=shadow:windows=4: FPS: 2332 FrameTime: 0.429 ms[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1221 FrameTime: 0.820 ms[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 1435 FrameTime: 0.697 ms[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1435 FrameTime: 0.697 ms[ideas] speed=duration: FPS: 1341 FrameTime: 0.746 ms[jellyfish] <default>: FPS: 1825 FrameTime: 0.548 ms[terrain] <default>: FPS: 440 FrameTime: 2.275 ms[shadow] <default>: FPS: 2321 FrameTime: 0.431 ms[refract] <default>: FPS: 797 FrameTime: 1.256 ms[conditionals] fragment-steps=0:vertex-steps=0: FPS: 3792 FrameTime: 0.264 ms[conditionals] fragment-steps=5:vertex-steps=0: FPS: 3948 FrameTime: 0.253 ms[conditionals] fragment-steps=0:vertex-steps=5: FPS: 3786 FrameTime: 0.264 ms[function] fragment-complexity=low:fragment-steps=5: FPS: 4004 FrameTime: 0.250 ms[function] fragment-complexity=medium:fragment-steps=5: FPS: 4147 FrameTime: 0.241 ms[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4004 FrameTime: 0.250 ms[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 3936 FrameTime: 0.254 ms[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 3986 FrameTime: 0.251 ms======================================================= glmark2 Score: 2957=======================================================

Looks like Jeff already got his, mine didnt shipped before the 20th so ill have to wait for a while.

mali-g720mc10-immortals-r49p0-00eac0-clinfo-clpeak.log

mali-g720mc10-immortals-r49p0-00eac0-clinfo-clpeak.log - mali-g720mc10-immortals-r49p0-00eac0-clinfo-clpeak.log

gist.github.com

mali-g720mc10-immortals-r49p0-00eac0-vulkaninfo-vkpeak.log

mali-g720mc10-immortals-r49p0-00eac0-vulkaninfo-vkpeak.log - mali-g720mc10-immortals-r49p0-00eac0-vulkaninfo-vkpeak.log

gist.github.com

OpenGL 4.0 support (via zink) also means it has geometry shader, something the G610 lacks.

Nothingness · Jan 29, 2025

CNX posted a first round of benchmarks for Radxa o6

Radxa Orion O6 Review - Part 1: Unboxing, Debian 12 installation, and first benchmarks - CNX Software

Radxa Orion O6 review with an unboxing of the CIX P1 12-core Armv9 mini-ITX motherboard, Debian 12 installation, and a few benchmarks.

www.cnx-software.com

That looks good.

Shivansps · Jan 29, 2025

CIX founder is an ex-AMD

igor_kavinski · Jan 29, 2025

I wonder if shaking my hand with Jim Keller may rub some of his magic on me...

DZero · Jan 29, 2025

Shivansps said:
CIX founder is an ex-AMD

That explains a lot. Has a lot of sense how they are trying to break the market.

Nothingness said:
CNX posted a first round of benchmarks for Radxa o6

Radxa Orion O6 Review - Part 1: Unboxing, Debian 12 installation, and first benchmarks - CNX Software

Radxa Orion O6 review with an unboxing of the CIX P1 12-core Armv9 mini-ITX motherboard, Debian 12 installation, and a few benchmarks.

www.cnx-software.com

That looks good.

Damn. the Rock 5B who uses the Orange Pi 5 processor is not that far.

DZero · Jan 30, 2025

Makes me think who needs to release the GPU drivers? If ends being released for Windows, how will improve the performance?

Shivansps · Jan 30, 2025

ARM is the one that needs to make the GPU drivers. They already build linux and android drivers.

DZero · Jan 30, 2025

Shivansps said:
ARM is the one that needs to make the GPU drivers. They already build linux and android drivers.

Thanks, makes me think if ARM makes for Windows (along a little help of Microsoft) how it will fare? It will benefit also in the terms of emulating windows games on Android?

Shivansps · Jan 30, 2025

DZero said:
Thanks, makes me think if ARM makes for Windows (along a little help of Microsoft) how it will fare? It will benefit also in the terms of emulating windows games on Android?

No, since this SoC already supports Vulkan on linux and android, it can already run all emulators as best as it can. It will only work when running native windows.

i would not discard some sort of "triple alliance" (ARM, CIX and Mediatek) to make a working Mali Windows driver.

Shivansps · Feb 1, 2025

https://browser.geekbench.com/v6/cpu/10224768
looks like Ryzen 7 1700 level at those clocks. I kinda expected to land more in the 5600G area. With better clocks it may get to the 2700X territory.

Actually, it might score higher with the A520 cores disabled, specially if they take space on the L3 cache.

DZero · Feb 1, 2025

Shivansps said:
https://browser.geekbench.com/v6/cpu/10224768
looks like Ryzen 7 1700 level at those clocks. I kinda expected to land more in the 5600G area. With better clocks it may get to the 2700X territory.

Actually, it might score higher with the A520 cores disabled, specially if they take space on the L3 cache.

This is why I expect an out of order small core. In order ones are no longer useful beyond the basic tasks.

Discussion Future ARM Cortex + Neoverse µArchs Discussion

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member