Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Page 50 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

naukkis

Senior member
Jun 5, 2002
991
841
136
I think we already went through this. I'd really like to see code VL agnostic and make comparisons of SVE vs R-V vector extension.

I've seen VL agnostic SVE code that doesn't need a single change for different VL. But I guess there are cases where that doesn't work (shuffles?) and I'd be interested in seeing how R-V handles that.

Vector ISA is hardware abstraction layer. By definition everything just works - and it's not only theoretical as working vector cpus have been build from 70's. Every op is scalar - SIMD packing is only done in hardware level which doesn't show outside. (there might be side-channels though). SVE other hand have some instructions scalable with hardware tail handling and part of instructions needs different software handlers for different SIMD widths. It seems to be nightmare beyond any other commercial architechture. Waiting to see when some hardware maker finally implements SVE over 128 bits, offers Linux support and Linus Torvalds tries to implement and verify algorithms to support it. I might be totally wrong and SVE is actually fine to code for - but as years go and nobody bothers to use it I suspect that everybody else is also seeing SVE problems and stay as far as possible from it.
 

Nothingness

Diamond Member
Jul 3, 2013
3,277
2,329
136
Waiting to see when some hardware maker finally implements SVE over 128 bits, offers Linux support and Linus Torvalds tries to implement and verify algorithms to support it. I might be totally wrong and SVE is actually fine to code for - but as years go and nobody bothers to use it I suspect that everybody else is also seeing SVE problems and stay as far as possible from it.
As far as Arm goes:
- First Neoverse were 256-bit
- Fujitsu is 512-bit
- Apple M4 supports 512-bit in streaming SVE (though it's too slow to be really worth it in general)
- You can buy boards with 128-bit CPU
- SVE implementations exist in open source software (including support in Linux kernel)
- gcc/clang can generate VL agnostic SVE code.

I'll leave it to you to list the existing R-V vector implementations and what software makes use of it.

For the rest of your message you seem to think SVE can only handle multiple of VL flows of data. That's wrong. Predicates and fault ignoring in loads allow to change that. I'll agree that VL can't exceed the hardware width of SVE registers (currently 128, 256 or 512 depending on CPU). But a 512-bit implementation will happily run software for 128 or 256-bit vectors without changing a line of code for many vectorized loops.
 

Nothingness

Diamond Member
Jul 3, 2013
3,277
2,329
136
LightningDust's axioms:

1. Any thread containing naukkis will inevitably degenerate into naukkis insisting that anything but length-abstracted vector machines is doomed and obsolete.
Which makes me wonder: is anyone still doing "clean and pure" vector CPUs? I mean except R-V vendors who are stuck in 20th century (apologies for the obvious trolling).

2. Any thread containing LD will inevitably degenerate into LD insisting* that EPIC was Good Actually


* correctly
Code:
s/LD/SK/g
 

LightningDust

Member
Sep 3, 2024
40
67
51
Which makes me wonder: is anyone still doing "clean and pure" vector CPUs? I mean except R-V vendors who are stuck in 20th century (apologies for the obvious trolling).

NEC, though they're exiting that market. Was always really about domestic-market seismic-analysis and weather codes that are hypersensitive to memory-bandwidth and have basically arbitrary amounts of DLP.

Code:
s/LD/SK/g

I could not possibly be the Queen of Blades. She got dragged into way too many idiot arguments about "x86 good in sekrit ways that don't show up on benchmarks." I'm just here for EPIC apologia. Important difference!
 
Last edited:

MS_AT

Senior member
Jul 15, 2024
556
1,168
96
but as years go and nobody bothers to use it
Have you checked if it is not being used in server deployments? Fujitsu for one will be using that. On mobile it is not suprising nobody is using it since everything still supports Neon. Why support 2 code codebases (since budget devices won't support SVE still for some time) when one (Neon) will cover everything Android?
 

camel-cdr

Member
Feb 23, 2024
27
90
51
Why support 2 code codebases (since budget devices won't support SVE still for some time) when one (Neon) will cover everything Android?
It's worse than that, apart from the a64fx every SVE implementation reuses the NEON ALUs for their SVE implementation AFAIK.
So on the Neoverse V1, you can use four issue 128-bit NEON or two issue 256-bit SVE.
Which makes the gain from SVE minimal.
 

LightningDust

Member
Sep 3, 2024
40
67
51
Citation needed.

When IPF was invested in, performance was generally good to excellent, despite Montecito being delayed, Hondo being an abomination built out of PA-8800 bits, etc. (Merced is safely ignored, as it bears no resemblance to any other Itanium microarchitecture, but even it put up respectable SPEC numbers.)

IPF's reputational issues were always a matter of expectations - "64-bit evolution of x86", "industry-standard 64-bit merchant processor", and that stupid IDC projected-volume graph - rather than what the Fort Collins design group actually delivered, which was generally pretty decent.
 

Doug S

Diamond Member
Feb 8, 2020
3,083
5,316
136
When IPF was invested in, performance was generally good to excellent, despite Montecito being delayed, Hondo being an abomination built out of PA-8800 bits, etc. (Merced is safely ignored, as it bears no resemblance to any other Itanium microarchitecture, but even it put up respectable SPEC numbers.)

IPF's reputational issues were always a matter of expectations - "64-bit evolution of x86", "industry-standard 64-bit merchant processor", and that stupid IDC projected-volume graph - rather than what the Fort Collins design group actually delivered, which was generally pretty decent.

Performance was always well behind x86. Itanium failed because the whole idea that a magic compiler will find more ILP than a CPU can at runtime was massively flawed.

If you want to imply lack of investment did it in, what about the lack of investment in Alpha? If you devoted equal resources to both IPF would have looked silly by comparison. The lack of investment was because its design roadmap showed it couldn't compete with x86, especially not after x86 went 64 bits (despite Intel doing everything they could to avoid or at least delay that and prop up Itanium) and gained some breathing room on registers and no longer needed any kludges to handle more than 2 GB.

Itanium was done in like the RISCs it (briefly) replaced, because a market niche can't compete with x86 so the latter got more investment and cheaper pricing. The same massive revenue advantage in Apple's ARM products (iPhone/iPad) was used to dethrone x86 and kick it out of the Mac.
 
Jul 27, 2020
23,540
16,535
146
Intel is no stranger to shooting itself in its own feet, both of them. They destroyed the performance potential of AVX-512 by not releasing it in all their consumer CPUs (yes, even Celerons should've had them, if just a cut-down version).

Similarly, they tried being greedy and sell a 64-bit CPU to corporations hoping they would somehow destroy IBM overnight and take control of the entire world with their 64-bit revolution when a simple mass market consumer CPU line-up with IPF architecture could've generated developer interest, ensured tons of open source development and eventual success within a decade.

Heck, they could've put an Atom CPU alongside that IPF based CPU to provide backward compatibility. But simple solutions don't appeal to the morons deciding Intel's fate.
 

Nothingness

Diamond Member
Jul 3, 2013
3,277
2,329
136
Heck, they could've put an Atom CPU alongside that IPF based CPU to provide backward compatibility. But simple solutions don't appeal to the morons deciding Intel's fate.
For some reason I thought the first Itanium had an x86 core in them, but I can't find any reference to that. Failing memory likely.
 

LightningDust

Member
Sep 3, 2024
40
67
51
Performance was always well behind x86

SPEC disagrees. So do a bunch of real applications I've worked with.

Itanium failed because the whole idea that a magic compiler will find more ILP than a CPU can at runtime was massively flawed.

If you want to imply lack of investment did it in, what about the lack of investment in Alpha? If you devoted equal resources to both IPF would have looked silly by comparison.

Shrug. The HP userbase was bigger, so their needs won. The Compaq kids got a VMS port as a consolation prize.

Itanium was done in like the RISCs it (briefly) replaced, because a market niche can't compete with x86 so the latter got more investment and cheaper pricing. The same massive revenue advantage in Apple's ARM products (iPhone/iPad) was used to dethrone x86 and kick it out of the Mac.

Yep. Power's still slouching along, but it isn't particularly performance-competitive with x86 anymore - especially for the price. Certainly I don't remember saying that RISC/UNIX was long-term commercially viable.

There were plans to, but in the end it used slowass software emulation.

On the contrary, three generations of Itanium had a hardware x86 block, an alternate frontend rather than a full x86 core (Merced, McKinley, and Madison.) It was very bad. IA-32 EL, the software emulator that came in after around 2004, was actually pretty good (perf comparable to a similarly-clocked Prescott, so not amazing, but acceptable for running Windows utility software, which was the main use case.)

Heck, they could've put an Atom CPU alongside that IPF based CPU to provide backward compatibility.

In 2001?
 
Last edited:
Reactions: Ghostsonplanets

LightningDust

Member
Sep 3, 2024
40
67
51
SPEC disagrees. So do a bunch of real applications I've worked with.

Elaborating on this, I bring to you three pairs of processor results for spec06. All but one (Woodcrest SPECFP) are non-autoparallel to try to level the playing field.

Opteron 2222 (3GHz second-gen K8, 90nm):

14.7/16.1 int
14.8/16.0 FP


Xeon 5160 (3GHz Woodcrest, 65nm):

18.7/20.4 int
16.6/17.2 FP

https://spec.org/cpu2006/results/res2007q4/cpu2006-20071109-02479.html (autopar)

Itanium 9140 (1.66GHz Montvale, 90nm):

15.7/17.0 int
19.9/20.4 FP


I would call that competitive. Not dominant, but competitive. Bearing in mind that the Woodcrest has a full lithography generation advantage, that Montecito/Montvale are using a reheated version of a microarchitecture that was at that point five years old, and that Montecito/Montvale lost 10-15% of their originally announced clock speed due to the late-breaking Foxton erratum, I don't consider that to be a particularly poor showing for the EPIC concept.
 
Reactions: igor_kavinski

soresu

Diamond Member
Dec 19, 2014
3,689
3,026
136
Intel is no stranger to shooting itself in its own feet, both of them. They destroyed the performance potential of AVX-512 by not releasing it in all their consumer CPUs (yes, even Celerons should've had them, if just a cut-down version).
IMHO this was a relic of the time when they had little/no competition from AMD and were just seeking to monetise their IP down to the dregs, even to the point of segmenting the high end and server ISA feature sets from that of the mainstream and value SKUs.

Not that I am saying it's AMD's fault - only that the lack of competition simply gave them the breathing room to be what Intel always has been.
 
Reactions: igor_kavinski

soresu

Diamond Member
Dec 19, 2014
3,689
3,026
136
It's worse than that, apart from the a64fx every SVE implementation reuses the NEON ALUs for their SVE implementation AFAIK.
So on the Neoverse V1, you can use four issue 128-bit NEON or two issue 256-bit SVE.
Which makes the gain from SVE minimal.
This is unlikely to change for any v9.x-A based CPU as NEON is mandatory.

That being said, I suspect that new NEON instructions have been few and far between since SVE2 was announced, and the delta in instruction parity is only going to widen as successive increments of SVE2 become available.

v9.6-A already came with SVE2p2, its 2nd extension.
 

soresu

Diamond Member
Dec 19, 2014
3,689
3,026
136
Oh huh, look what I found when trawling the ARM developer documentation:

+sve2p1Extends +sve2 to support a number of instructions that have been moved from SME, and includes changes to Streaming SVE mode instructions in SVE2.1 (FEAT_SVE2p1).

I wonder what this augurs for the future of SME/SME2 in ARM Ltd's own CPU µArchs.
 

naukkis

Senior member
Jun 5, 2002
991
841
136
LightningDust's axioms:

1. Any thread containing naukkis will inevitably degenerate into naukkis insisting that anything but length-abstracted vector machines is doomed and obsolete.

It's about vector length agnostic designs - there is full vector isa implementations and ARM SVE, scalable packed SIMD arch. And ARM SVE doesn't actually work because vector register manipulation instructions aren't vector length agnostic but need software support for every supported vector length. And by definition SVE supports every vector length between 128 and 2048 bits with 128 bit intervals. OK they backed up later revisions to only support 2-base vector lengths but still - that design in unusable as vector length agnostic. Arm did try but after 10 years and practically zero support for both software and hardware makers I think they should seriously reconsider to set SVE as fixed length design for maybe 128,256 and 512 bit vectors like AVX512 to get actually usable platform which developers could actually adopt without seeing nightmares from supporting something that is fundamentally broken.
 
Reactions: Nothingness

Doug S

Diamond Member
Feb 8, 2020
3,083
5,316
136
Oh huh, look what I found when trawling the ARM developer documentation:

+sve2p1Extends +sve2 to support a number of instructions that have been moved from SME, and includes changes to Streaming SVE mode instructions in SVE2.1 (FEAT_SVE2p1).

I wonder what this augurs for the future of SME/SME2 in ARM Ltd's own CPU µArchs.

I wonder if those changes to SSVE were driven by Apple, to work around the issues that caused them to mostly ignore SSVE in M4/A18 as far as performance goes. Maynard can speak to this better since he suggested a few changes that would make the way Apple wants to do things work much better - and presumably get them to start caring about SSVE performance in future designs.
 

naukkis

Senior member
Jun 5, 2002
991
841
136
I wonder if those changes to SSVE were driven by Apple, to work around the issues that caused them to mostly ignore SSVE in M4/A18 as far as performance goes. Maynard can speak to this better since he suggested a few changes that would make the way Apple wants to do things work much better - and presumably get them to start caring about SSVE performance in future designs.

What Apple did - they took those vector length agnostic parts from SVE, added few outer matrix registers for better supporting matrix math and build very simple mostly in-order VPU around it. For general purpose performance they need those vector register manipulation instructions which ARM lacks - or implement strong OOO capabilities - which pretty much nullifies whole idea of power-efficiency optimized vector math unit. So I think that Apple doesn't target general purpose SSVE performance anytime soon on their designs.
 

Nothingness

Diamond Member
Jul 3, 2013
3,277
2,329
136
For general purpose performance they need those vector register manipulation instructions which ARM lacks
What do you mean? All of what Apple implemented in their SME unit was fully architected by Arm and has been documented even before M4 release.
 

naukkis

Senior member
Jun 5, 2002
991
841
136
What do you mean? All of what Apple implemented in their SME unit was fully architected by Arm and has been documented even before M4 release.

It is Apple design - firstly known as AMX. Arm implemented it later to their instruction set. It's a subset of SVE - you know your instruction set is great when customers don't want to implement it fully.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |