Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

DrMrLordX · Feb 9, 2025

soresu said:
RK3588 was announced long before it actually went to fabs I think.

The specs they originally announced were different to what they later made.

Yup, there was a long wait for it since people were excited for an SBC SoC with something other than A72 on it. Took way too long to get to market, hence my comment about the O6 and RK3688.

Nothingness · Feb 10, 2025

soresu said:
A bit of reading seems to imply by SIMD's basic definition that is a vector, depending on what you classify to be a vector:

Row and column vectors - Wikipedia

en.wikipedia.org

For some a vector can only look like CDC and Cray vector implementations. Anyway the point is moot: the article under discussion clearly classifies both R-V vector and Arm SVE as vector extensions:

Vector instructions such as RISC-V vector extension [9] and ARMSVE [18] have recently been introduced in general-purpose CPUs [1,17]. A vector instruction processes multiple data elements in atime-division manner, achieving high performance with lower hardware cost.

They used RISC-V for their study for obvious reasons (the main being the availability of a cycle accurate model, Spike).

soresu · Feb 10, 2025

DrMrLordX said:
Yup, there was a long wait for it since people were excited for an SBC SoC with something other than A72 on it. Took way too long to get to market, hence my comment about the O6 and RK3688.

Ye, the Amlogic 'competitor' S928X took even longer to get to market and on a larger node to boot.

DZero · Feb 10, 2025

soresu said:
IIRC the GPU spec changed from one more contemporary to the A76 to the G610.

I might be misremembering things though.

RockChip have an annoying tendency to be ambiguous with specs of future SoCs sometimes, like the RK3688 mentions a v9.3-A CPU core, but according to latest rumours of X930 and Ax30 they are v9.4-A ISA instead 😒

Wait, there isn't v9.3 CPU core?

soresu · Feb 10, 2025

DZero said:
Wait, there isn't v9.3 CPU core?

Not sure at this point.

2025 IP ISA increment has not been confirmed, only that there will be an increment.

Nothingness · Feb 10, 2025

DZero said:
Wait, there isn't v9.3 CPU core?

IIRC v9.3 contains almost nothing of interest. In particular SME2 can be implemented with v9.2.

naukkis · Feb 10, 2025

soresu said:
A bit of reading seems to imply by SIMD's basic definition that is a vector, depending on what you classify to be a vector:

Row and column vectors - Wikipedia

en.wikipedia.org

Vector cpu vectors are loops of independent scalar operations which can take different paths. Direct not abstracted SIMD hardware uses packed vectors -without shuffle instructions hardware is identical to scalar but operate with packed vectors instead of scalar operations. OOO is so identical for SIMD hardware as scalar hardware - one op per vector. Vector ISA code instead have hundreds of possible ops per vector.

naukkis · Feb 10, 2025

MS_AT said:
Here is a link to hotchips slides for SXAurora Vector Engine Proccessor https://old.hotchips.org/hc30/2conf/2.14_NEC_vector_NEC_SXAurora_TSUBASA_HotChips30_finalb.pdf, Anandtech did a live blog on it here https://www.anandtech.com/show/13259/hot-chips-2018-nec-vector-processor-live-blog. The blog post mentions OoO, while the slides are using OoO scheduling. You can also find SX-ACE slides here https://old.hotchips.org/wp-content...e-epub/HC26.11.110-SX-ACE-MOMOSE-NEC-v004.pdf

From high level point of view they seem similar, SX-ACE hotchips slides don't mention OoO explicitly as far as I can tell. But still Aurora seems like an evolution of ACE so they thought that adding OoO scheduling is important.

My vector cpu knowledge might be from 80's but Nec designers in that ACE document mention that they consider their vector design OOO. It's just software based as hardware resolving memory dependencies in system where programmer/compiler has already packed massive mostly independent instructions into vectors seems like a way to wreck whole design. But maybe Intel EPIC would have eventually also develop into hardware-OOO machinery......

LightningDust · Feb 10, 2025

naukkis said:
But maybe Intel EPIC would have eventually also develop into hardware-OOO machinery......

Intel experimented with it. The final gen, Poulson, was also dynamically scheduled and could have been extended to full OoO relatively easily.

naukkis · Feb 10, 2025

LightningDust said:
Intel experimented with it. The final gen, Poulson, was also dynamically scheduled and could have been extended to full OoO relatively easily.

For EPIC it was only potential way to go forward. For vector cpu it ain't. All vector cpu's also have scalar side of cpu which can handle situations where vector side performs poorly. There have not been, yet, vector cpu where scalar side have packed SIMD support but RV might evolve to that direction.

LightningDust · Feb 10, 2025

naukkis said:
For EPIC it was only potential way to go forward.

It really wasn't.

soresu · Feb 10, 2025

naukkis said:
Vector cpu vectors are loops of independent scalar operations which can take different paths. Direct not abstracted SIMD hardware uses packed vectors -without shuffle instructions hardware is identical to scalar but operate with packed vectors instead of scalar operations. OOO is so identical for SIMD hardware as scalar hardware - one op per vector. Vector ISA code instead have hundreds of possible ops per vector.

Too many vectors.....

Sorry, had to be done 😆

Nothingness · Feb 11, 2025

naukkis said:
Vector ISA code instead have hundreds of possible ops per vector.

I invite you to show us some RISC-V vector code that demonstrates that it fits that "definition" which I don't understand.

soresu · Feb 11, 2025

Nothingness said:
I invite you to show us some RISC-V vector code that demonstrates that it fits that "definition" which I don't understand.

I can only assume that they mean instead of SIMD's "every problem must fit 4 hammers at once" take that 'vector ISA' code can do a lot more than 4 operations per instruction, or possibly any number of operations from 2 to whatever the limit is.

I can only assume that the limit is determined by the number of ALUs and their size.

So if you had a 128 bit ALU you could do 64 x 2 bit ops, or 32 x 4 bis, or 16 x 8 bits and so on and so on.

IIRC rapid packed math in GPUs required actually changing the SIMD ALUs so that you could actually get double rate FP16, where as prior to that you just got full rate FP32 with FP16 at the same speed despite the halved precision.

My guess would be that he is implying a true vector ISA is built to do all of these possible variations from the ground up.

DrMrLordX · Feb 14, 2025

On the subject of Neoverse:

Arm is launching its own chip this year with Meta as a customer | TechCrunch

Arm's foray into making its own chips internally will turn some of its existing customers into competitors.

techcrunch.com

Very interesting when taken in context of the apparent buyout of Ampere by SoftBank.

naukkis · Feb 15, 2025

soresu said:
I can only assume that they mean instead of SIMD's "every problem must fit 4 hammers at once" take that 'vector ISA' code can do a lot more than 4 operations per instruction, or possibly any number of operations from 2 to whatever the limit is.

Don't confuse vector isa to SIMD. Vector cpu binary language is to present loops of scalar instructions as vectors. RVV vector max length is 64KB. Vector cpu's don't have to have SIMD hardware at all, all code could be executed just fine with scalar execution units like it was done on first vector cpus. But that code presented as vectors is also executable with SIMD hardware - at any SIMD length with same binaries. It really seems that most people don't even understand whole basic idea behind vector cpus.

naukkis · Feb 15, 2025

Nothingness said:
I invite you to show us some RISC-V vector code that demonstrates that it fits that "definition" which I don't understand.

Vector cpus vectors are independent scalar ops. Vector cpu''s can usually chain those ops between execution units and if data addresses are known there's no need for hardware out-of-ordering. OOO will be needed when addressing is calculated dynamically on the fly and when doing so there's as many tracked ops as there's scalar ops in vectors as they are independent ops and not packed to solid packed vectors. With low-width SIMD hardware it's still possible to do hardware data tracking and ordering but as those NEC engineers noted, it's pretty impractical to track and rearrange wide alu count vector machine ops, like those 256 per cycle on that Nec design. Hardware limitation makes it possible to do either wide in-order execution units or much narrower OOO units.

naukkis · Feb 15, 2025

LightningDust said:
It really wasn't.

Hardware can easily detect data patterns from runtime execution which cannot predict in compile time. When it detects those and those are critical for code timing it's only logical to add out-of-order hardware to execute those in advance. Intel hardware designers did know what to do - but such a complex ISA implementation made hardware implementations too complex to be any competitive against simpler designs.

Nothingness · Feb 15, 2025

naukkis said:
Vector cpus vectors are independent scalar ops. Vector cpu''s can usually chain those ops between execution units and if data addresses are known there's no need for hardware out-of-ordering. OOO will be needed when addressing is calculated dynamically on the fly and when doing so there's as many tracked ops as there's scalar ops in vectors as they are independent ops and not packed to solid packed vectors. With low-width SIMD hardware it's still possible to do hardware data tracking and ordering but as those NEC engineers noted, it's pretty impractical to track and rearrange wide alu count vector machine ops, like those 256 per cycle on that Nec design. Hardware limitation makes it possible to do either wide in-order execution units or much narrower OOO units.

So can you exhibit R-V code that demonstrates that it is more of a vector ISA than SVE?

naukkis · Feb 15, 2025

Nothingness said:
So can you exhibit R-V code that demonstrates that it is more of a vector ISA than SVE?

I don't understand why? RVV is vector isa, pure and clean. SVE instead is scalable packed SIMD - which doesn't actually work beyond academic use cases. In vector isa underlying hardware is totally abstracted, code just works in any hardware if ISA and hardware are bug free. SVE instead is relying software support for different width SIMD hardware -that ain't newer worked and probably newer will. I really don't know why ARM wants to push that braindead solution which both software and hardware vendors don't want to use.

And about where that discussion started - no, hardware vendors don't want or should make OOO arm core with in-order SVE. SVE is OOO-friendly and and removing OOO would just make it slower, specially when running with 128bit vectors where SVE is performing well.

DZero · Feb 15, 2025

Heck... I just wanted an out of order small core and we have vectors on there T_T

LightningDust · Feb 15, 2025

naukkis said:
Hardware can easily detect data patterns from runtime execution which cannot predict in compile time. When it detects those and those are critical for code timing it's only logical to add out-of-order hardware to execute those in advance. Intel hardware designers did know what to do

I'm not saying out-of-order isn't useful. I'm saying EPIC performance could have been scaled fairly easily, and that Intel microarchitects had a clear idea of how they would go about it if there was going to be an extended IPF roadmap.

but such a complex ISA implementation made hardware implementations too complex to be any competitive against simpler designs.

On the contrary, IPF cores were fairly small; Itanium silicon was dominated by SRAM (and the small cores compared to, say, Power meant that IPF was able to bring LLC on-die almost a decade prior to IBM.) Additionally, Itanium was competitive against comparable RISC and x86 server processors as long as Intel and HP were continuing to seriously invest in it. Even after the Montecito fiasco, where a late-breaking erratum in novel power management features caused Montecito to be delayed by a year and to lose 15% of its projected clock speed, the resulting part was essentially performance-competitive at release. Itanium silicon only became uncompetitive when RISC/UNIX as a whole had started to decline. By that stage, there were non-technical considerations in play - I have an informed suspicion that Poulson, a massive improvement, was deliberately held back by at least a year so that the hilariously bad Tukwila could have a full sales lifecycle.

igor_kavinski · Feb 15, 2025

LightningDust said:
On the contrary, IPF cores were fairly small; Itanium silicon was dominated by SRAM (and the small cores compared to, say, Power meant that IPF was able to bring LLC on-die almost a decade prior to IBM.) Additionally, Itanium was competitive against comparable RISC and x86 server processors as long as Intel and HP were continuing to seriously invest in it. Even after the Montecito fiasco, where a late-breaking erratum in novel power management features caused Montecito to be delayed by a year and to lose 15% of its projected clock speed, the resulting part was essentially performance-competitive at release. Itanium silicon only became uncompetitive when RISC/UNIX as a whole had started to decline. By that stage, there were non-technical considerations in play - I have an informed suspicion that Poulson, a massive improvement, was deliberately held back by at least a year so that the hilariously bad Tukwila could have a full sales lifecycle.

Welcome back, Sarah Kerrigan

Nothingness · Feb 15, 2025

naukkis said:
I don't understand why? RVV is vector isa, pure and clean. SVE instead is scalable packed SIMD - which doesn't actually work beyond academic use cases. In vector isa underlying hardware is totally abstracted, code just works in any hardware if ISA and hardware are bug free. SVE instead is relying software support for different width SIMD hardware -that ain't newer worked and probably newer will. I really don't know why ARM wants to push that braindead solution which both software and hardware vendors don't want to use.

I think we already went through this. I'd really like to see code VL agnostic and make comparisons of SVE vs R-V vector extension.

I've seen VL agnostic SVE code that doesn't need a single change for different VL. But I guess there are cases where that doesn't work (shuffles?) and I'd be interested in seeing how R-V handles that.

LightningDust · Feb 15, 2025

igor_kavinski said:
Welcome back, Sarah Kerrigan

It couldn't be helped. IPF slander won't be tolerated while I draw breath.

Discussion ARM Cortex/Neoverse IP + SoCs (no custom cores) Discussion

Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Senior member

Senior member

Member

Senior member

Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Senior member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Member

Lifer

Diamond Member

Member