Question Are scalable vectors the ultimate solution to fixed width SIMD?

igor_kavinski · Jun 29, 2024

This thread inspired by the following quote from http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/qualcomm-snapdragon-thread.2616013/post-41241585

And for SIMD - RV style to made it vector machine instead of hardwired SIMD makes hardware totally agnostic to SIMD register width.

@naukkis @SarahKerrigan @Nothingness

Is the scalable vector approach going to be a game changer when it's already possible to do it more or less in software with multiple supported targets? https://github.com/google/highway

adroc_thurston · Jun 29, 2024

igor_kavinski said:
Is the scalable vector approach going to be a game changer

no, SVE2 is either 128b for everyone or 512b for nVidia.

igor_kavinski · Jun 29, 2024

adroc_thurston said:
no, SVE2 is either 128b for everyone or 512b for nVidia.

https://eupilot.eu/wp-content/uploads/2022/11/RISC-V-VectorExtension-1-1.pdf

That highlighted part. Can RISC-V really do it and break all performance records?

igor_kavinski · Jun 29, 2024

So RISC-V is stupid? http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/qualcomm-snapdragon-thread.2616013/post-41241618

naukkis said:
naukkis said:

Jim Keller have been very loud to defend that RV approach.

Click to expand...

Jim has been very vocal about any ISA not mattering for high perf CPU.
Anyway, no matter what he said, what do you expect from the CEO of a company doing RISC-V chips?

Hmmm...why would Jim Keller put his reputation on the line for RISC-V in such a public way? Or maybe his RISC-V designs are more sensible?

Nothingness · Jun 29, 2024

igor_kavinski said:
https://eupilot.eu/wp-content/uploads/2022/11/RISC-V-VectorExtension-1-1.pdf

View attachment 102087
That highlighted part. Can RISC-V really do it and break all performance records?

There's a typo in the table: SVE can go up to 2048 bits.

Regarding the 64k max of R-V, two things make this unrealistic: hardware size, extracting that width from existing code. Wide vectors are only useful for very particular HPC workloads.

Anyway I can't go into more details as I don't know the RVV extension.

Nothingness · Jun 29, 2024

igor_kavinski said:
So RISC-V is stupid? http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/qualcomm-snapdragon-thread.2616013/post-41241618

RISC-V is not stupid. It's just based on a university archaic vision of RISC that evolved into a marketing war machine.

igor_kavinski said:
Hmmm...why would Jim Keller put his reputation on the line for RISC-V in such a public way? Or maybe his RISC-V designs are more sensible?

This was my answer:

Nothingness said:
Jim has been very vocal about any ISA not mattering for high perf CPU.
Anyway, no matter what he said, what do you expect from the CEO of a company doing RISC-V chips?

I agree with Jim on that: when you are targeting high performance, the ISA stops mattering.
I doubt he ever said R-V was magic sauce that had technical advantage. The only interesting point is that people can do whatever they want with the ISA contrary to x86 or Arm.

igor_kavinski · Jun 29, 2024

@Nothingness , any thoughts on Google Highway?

Tuna-Fish · Jun 29, 2024

igor_kavinski said:
already possible to do it more or less in software with multiple supported targets? https://github.com/google/highway

Industry has consistently rejected multiple supported targets for a cool 30 years now.

Compiler vendors sometimes advertise that using their compiler, you can do multiple supported targets and not have to increase testing to test every path. The industry response to that has been more or less this.

Nothingness · Jun 29, 2024

igor_kavinski said:
@Nothingness , any thoughts on Google Highway?

I never played with it (I tend to use inline assembly for SIMD stuff), but I heard good things about it on Realworldtech.

Link to an interesting article: Evaluation of C++ SIMD Libraries

igor_kavinski · Jun 29, 2024

Nothingness said:
Link to an interesting article: Evaluation of C++ SIMD Libraries

I had no idea that OpenMP was so outclassed and hopeless!

naukkis · Jun 29, 2024

Vector ISA like RV or those 70-80's Crays express vectors as loops of scalar instructions. Vectors can be any length, with RV there's that practical 64K limit. And those vectors can be executed with any kind of hardware - from scalar to max 64K width SIMD. It's up to loops length if there's performance uplift from using wider SIMD execution units but existing code can extract more performance from wider SIMD hardware if parallelism is in the code. I actually wonder why only RV does utilize full vector isa - like every other ISA vendors want to stay on fixed length SIMD for cheap hardware implementations instead code reuseability. SVE sucks as badly as any other fixed length SIMD instruction set - or even more with that braindead scheme to support variable SIMD hardware. No wonder nobody wants to use it instead of NEON.

igor_kavinski · Jun 29, 2024

naukkis said:
But cpu designs are about to extract so big OOO windows that separating address generation from actual load/storing will come beneficial for extreme performance designs. RV is right there because it lacks that cheap implementation currently used.

Which CPU designs are you referring to? Zen 6? Apple M5? Upcoming ARM designs?

igor_kavinski · Jun 29, 2024

naukkis said:
cheap hardware implementations

You mean cheap in terms of transistor count?

naukkis · Jun 29, 2024

igor_kavinski said:
Which CPU designs are you referring to? Zen 6? Apple M5? Upcoming ARM designs?

Every high performance designs is heading to thousand instruction window and over.

naukkis · Jun 29, 2024

igor_kavinski said:
You mean cheap in terms of transistor count?

Yeah, those simple predictable addressing modes can be handled pretty much with fixed function logic. But hardware needs massive out-of-order window going towards thousand instructions to being able to pick those ld instructions so far ahead of rest code that data loads won't stall execution. In RV model it's also possible just to change address calculations before load instructions to achieve same effect - and if used wisely possible greatly outperform those fixed-function designs. There aren't those kind of hardware/software implementations out yet so this is of course only a speculation from what is possible to come - there might be coming some really performing RV designs in few years.

igor_kavinski · Jun 29, 2024

naukkis said:
Every high performance designs is heading to thousand instruction window and over.

How do the current designs look in this metric? Do you have instruction window numbers for Apple Mx, Snapdragon Elite X, Zen 4 and Lunar Lake?

Nothingness · Jun 29, 2024

naukkis said:
Every high performance designs is heading to thousand instruction window and over.

Indeed and all high perf CPUs have uop split, uop fusion, large OoOE windows, etc.

That's why having poor addressing modes is silly. You gain exactly nothing (except larger code size and increased register pressure due to an extra reg needed for address computation) since your uarch is already very complex. There's nothing to gain by being simplistic as RISC-V for high performance. I wonder why R-V has reg + imm addressing mode since you could compute that before doing the memory access 🙄

FlameTail · Jun 29, 2024

adroc_thurston said:
no, SVE2 is either 128b for everyone or 512b for nVidia.

Is that for Nvidia Grace?

naukkis · Jun 29, 2024

Nothingness said:
I wonder why R-V has reg + imm addressing mode since you could compute that before doing the memory access 🙄

That's near pointer from cpu hardware point, operating in 4KB range. RV design is very well done, there's not much which could be done better.

Nothingness · Jun 29, 2024

FlameTail said:
Is that for Nvidia Grace?

NVIDIA Grace is using Neoverse-V2, so it's not 512-bit.

IIRC the only 512-bit SVE (not SVE2) implementation is Fujitsu A64FX.

Nothingness · Jun 29, 2024

naukkis said:
That's near pointer from cpu hardware point, operating in 4KB range. RV design is very well done, there's not much which could be done better.

You still need a full adder for that, no matter what the range is.

naukkis · Jun 29, 2024

Nothingness said:
You still need a full adder for that, no matter what the range is.

Fastpath only needs to calculate lower address. Whether if ISA supports longer immediates or don't well written code is optimized to pages. 4KB is immediate range forces coders and compilers to made them making better code than what would happen when allowing larger operating range.

Nothingness · Jun 29, 2024

naukkis said:
Fastpath only needs to calculate lower address. Whether if ISA supports longer immediates or don't well written code is optimized to pages. 4KB is immediate range forces coders and compilers to made them making better code than what would happen when allowing larger operating range.

That's so naive I'm speechless.

Do you have any experience in programming larger program?

naukkis · Jun 29, 2024

Nothingness said:
That's so naive I'm speechless.

Do you have any experience in programming larger program?

Those general code optimization rules will stay as long as hardware is page-based. Optimal data access patterns are full pages as those are easy to cache. Those indexed addressing modes are usually worst for cache optimization as scaling addresses will very easily result running out of cache ways - though code optimization nowadays is pretty much compilers problems.

Nothingness · Jun 29, 2024

naukkis said:
Those general code optimization rules will stay as long as hardware is page-based. Optimal data access patterns are full pages as those are easy to cache. Those indexed addressing modes are usually worst for cache optimization as scaling addresses will very easily result running out of cache ways - though code optimization nowadays is pretty much compilers problems.

That's utter nonsense.

Will you do all memory allocations so that they're 4KB page aligned? And will you ensure all your data structure sizes will be <4KB? Can't wait to see how you achieve that in real life. If you have to do so because your ISA is limited, it's a dead-end.

Question Are scalable vectors the ultimate solution to fixed width SIMD?

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Lifer

Golden Member

Lifer

Lifer

Golden Member

Golden Member

Lifer

Diamond Member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member