Question Are scalable vectors the ultimate solution to fixed width SIMD?

igor_kavinski · Monday at 12:04 PM

Nothingness said:
I'm sorry, but I can't help you then

Here is a simple example: https://godbolt.org/z/WaEW68cMP

Experiment and learn is my preferred learning method too. Thanks!

Nothingness · Monday at 12:42 PM

igor_kavinski said:
Experiment and learn is my preferred learning method too. Thanks!

You're welcome. That definitely is the best way to learn

Jan Olšan · Monday at 3:37 PM

Scalable vectors supposedly have upsides (write once, run forever), but those really aren't proven yet. I've already seen opinions that if you write and tune certain algoritm now, it'll likely be optimal only for 128bit and you may lose performance on the future CPUs with wider units.
The big problem of course is, there are no CPUs with wider units ATM. It's unproven concept at least to some part.*

There are some known weaknesess as there are algorithms you want to use SIMD for, but not having a defined width your operations are going ot use make it hard or complicated to handle.
Shuffle instruction usage which are critical for more complicated algorithms besides "jsut need a lot of FMAs, CPU" tasks, are sort of at odds with scalable width SIMD instructions. Probably can be surmounted, at least sometimes, but it adds complexity. And again, it may lead to suboptimal performance.

I suspect that is something you can list as disadvantage - suboptimal performace compared to potential you could ahve with a fixed-width SIMD instruction set.

* Technically, you could write 128bit SVE code and then go and test it on the very rare Fujitsu CPU or on older processor with Neoverse V1 cores. That will onyl allow you to test the older floating-point focused part of the instruction set, not full SVE2.
Also I think you really want to try higher spread of withs, not just 128/256, to really see how it behaves (the Fujitsu with 512bit width would be better for this, but if SVE proposes up to 2048, it's still far from being a perfect test of more hardcore configs).

Thunder 57 · Monday at 4:07 PM

Nothingness said:
It should be obvious given my post history I've always liked the Arm ISA ever since it was announced in 85. It's not perfect and might have grown too much, but it still is great fun to program at assembly language level.

Thank you for making me feel young again.

camel-cdr · Monday at 4:49 PM

Jan Olšan said:
Scalable vectors supposedly have upsides (write once, run forever), but those really aren't proven yet. I've already seen opinions that if you write and tune certain algoritm now, it'll likely be optimal only for 128bit and you may lose performance on the future CPUs with wider units.
The big problem of course is, there are no CPUs with wider units ATM. It's unproven concept at least to some part.*

There are some known weaknesess as there are algorithms you want to use SIMD for, but not having a defined width your operations are going ot use make it hard or complicated to handle.
Shuffle instruction usage which are critical for more complicated algorithms besides "jsut need a lot of FMAs, CPU" tasks, are sort of at odds with scalable width SIMD instructions. Probably can be surmounted, at least sometimes, but it adds complexity. And again, it may lead to suboptimal performance.

I suspect that is something you can list as disadvantage - suboptimal performace compared to potential you could ahve with a fixed-width SIMD instruction set.

* Technically, you could write 128bit SVE code and then go and test it on the very rare Fujitsu CPU or on older processor with Neoverse V1 cores. That will onyl allow you to test the older floating-point focused part of the instruction set, not full SVE2.
Also I think you really want to try higher spread of withs, not just 128/256, to really see how it behaves (the Fujitsu with 512bit width would be better for this, but if SVE proposes up to 2048, it's still far from being a perfect test of more hardcore configs).

I don't think there is a problem with the shuffle instructions in a scalable context for application class processors.

However, it's unrealistic for them to scale much beyond a vector length of 512/1024, since the implementation complexity of a fast, 1 cycle throughput, implementation growth quadratically with vector length. It grows linearly with vector length if you what to shuffle a fixed N elements at a time (in one uop). But fixed size ISAs would have the same problem if they choose a larger SIMD width.

There is a problem, not due to the scalability, but rather due to the free-for-all nature of RISC-V (this doesn't apply as much to SVE, since ARM has the final word), which allows you to target your processor at vastly different applications, that prioritize different instructions. For an AI accelerator, you wouldn't spend the silicon to make a permute fast, since you only use it in setup code and never in hot loops. While a fast permute is, as you said, very useful for general purpose computation, and will be a must for application class cores.

On the RISC-V side, there are currently two boards with RVV 1.0 support available, one with 128 (XuanTie C908) and one with a 256 (SpacemiT X60) vector length. The devboards that have been concretely announced to release this year, or at the beginning of next year also feature a variety of vector lengths: Milk-V Oasis (16 SiFive P670 cores with VLEN=128 and 4 SiFive X280 with VLEN=512 as an AI accelerator), Andes Qilai (4 AX45MP cores without vector and 1 accelerator nx27v core with VLEN=512 and reportedly quadruple issue with 512-bit datapath 🤤)

If you look at open source implementations, then there is also ara with a vector length up to 16K, and XiangShan with 128-bit vectors, but a high performance ooo design. You can play with them using rtl simulation, but ara still has some bugs, and the XiangShan vector implementation currently only works properly on their MinimalConfig, which is still multi issue ooo, but is scaled down quite a bit.

As I've written in my earlier comment, I think there are a lot of problems (if not most problems, not "just a bunch of FMAs" problems) that can benefit from and scale properly with larger vector length, but some will still require specialization or can only take advantage of smaller vector length, due to the nature of the problem.

Schmide · Monday at 6:39 PM

One caveat that is often glossed over for the scalable vector space, operations still happen within a lane (128 bits).

So if you do an interleave, unpack on x86, all operations happen within a lane. (vzip on arm)

Using avx double for simplify ( "|" represents a lane boundary)

vector 1 = ab | cd
vector 2 = ef | gh

unpacklow 1 2 = ae | cg
unpackhigh 1 2 = bf | dh

As you can see the operations stay in their lane. This requires one or more special cross lane operation to readjust the data.

To linearize the data a permute is applied

after unpack the vectors are now

vector 1 = ae | cg
vector 2 = bf | dh

permute 1 2 (0x20) = ae | bf
permute 1 2 (0x31) = cg | dh

now the interleave is linear with respect to memory.

The more lanes you have in a vector the more extra steps are needed to readjust the operation.

So

sse has 1 lane so no extra operations are needed

avx has 2 lanes so 1 extra operation is needed (as above)

avx512 has 4 lanes so 2 extra operations are needed.

a 1024 vector would require 4 extra operations.

Nothing is really free in vector land.

Thibsie · Tuesday at 3:08 AM

igor_kavinski said:
That's the really hard way! I was hoping you could point me to something more geared towards kids, with fun projects

ARM assembly book by Maria Azeria Makedster might help you.

soresu · Tuesday at 5:16 AM

Nothingness said:
Perhaps this will help: https://developer.arm.com/documentation/107829/0200

If you want a simulator, you can pick QEMU and target AArch64 Linux.

I think at this point FEX is better, they even have some AVX games working now 🤘

Nothingness · Tuesday at 5:33 AM

soresu said:
I think at this point FEX is better, they even have some AVX games working now 🤘

FEX is for running x86 on Arm, no?
We want the reverse here

soresu · Tuesday at 5:36 AM

Nothingness said:
FEX is for running x86 on Arm, no?
We want the reverse here

Oh, misread that.

Hmmm dynarmic can do that too, but I don't know if it works for more general use rather than just for console emulators.

Unfortunately it was essentially tied to the Yuzu code, so the OG repo for dynarmic got yeeted along with everything else, lots of repo clone exist tho.

Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘

igor_kavinski · Tuesday at 5:41 AM

soresu said:
Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘

They are small enough (relatively speaking to the other console giants) that someone should buy them out and open source their EVERYTHING!

By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?

Nothingness · Tuesday at 5:45 AM

soresu said:
Hmmm dynarmic can do that too, but I don't know if it works for more general use rather than just for console emulators.

Unfortunately it was essentially tied to the Yuzu code, so the OG repo for dynarmic got yeeted along with everything else, lots of repo clone exist tho.

Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘

QEMU is much more reliable as far as Arm architecture compliance goes and its speed, though not the best, is good enough for running almost everything (last time I checked it could run 1 billion Arm instructions per second running in syscall emulation mode). In my biased opinion it's the best tool to experiment with Arm architecture on x86

igor_kavinski · Tuesday at 6:09 AM

Nothingness said:
(last time I checked it could run 1 billion Arm instructions per second running in syscall emulation mode).

Have you checked how it does in SVE2 emulation?

Nothingness · Tuesday at 6:32 AM

igor_kavinski said:
Have you checked how it does in SVE2 emulation?

I checked years ago, and it was OK. You can even go to 2048-bit if you want.

igor_kavinski · Tuesday at 6:40 AM

Nothingness said:
I checked years ago, and it was OK. You can even go to 2048-bit if you want.

If suppose I try out the different widths, would I see perceptible performance gains going from 128-bit to 2048-bit?

Nothingness · Tuesday at 6:51 AM

igor_kavinski said:
If suppose I try out the different widths, would I see perceptible performance gains going from 128-bit to 2048-bit?

QEMU has no concept of simulated performance, it's a purely functional model.
But you should see fewer instructions if the application can benefit from wider vectors.

SarahKerrigan · Tuesday at 11:27 AM

igor_kavinski said:
They are small enough (relatively speaking to the other console giants) that someone should buy them out and open source their EVERYTHING!

Do you have seventy billion dollars sitting around?

igor_kavinski said:
By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?

Er, from what SoC vendor? Also why?

igor_kavinski · Tuesday at 11:30 AM

SarahKerrigan said:
Do you have seventy billion dollars sitting around?

No but a bunch of billionaires do. We gotta entice them!

SarahKerrigan said:
Er, from what SoC vendor? Also why?

Maybe Tenstorrent (hopefully).

Why? AI-powered games!

SarahKerrigan · Tuesday at 11:32 AM

igor_kavinski said:
Maybe Tenstorrent (hopefully).

Why? AI-powered games!

So... slideware from a company that is not in the consumer-electronics SoC business. Oooookay.

NostaSeronx · Tuesday at 12:26 PM

igor_kavinski said:
By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?

igor_kavinski said:
Maybe Tenstorrent (hopefully).

Why? AI-powered games!

Minus the 512-bit Vector stuff. DAMO(Discovery, Adventure, Momentum and Outlook) Academy and MiHoYo are somewhat connected.

Dharma SoCs w/ C930+DXT(?) + OpenHarmony(RISC-V Large System) + MiHoYo should mean that these should be running on RISC-V/RVV eventually:
Honkai Impact 3rd
Genshin Impact
Honkai Star Rail
Zenless Zone Zero
Cross-IP PUBG/Fortnite-esque game
Honkai MMORPG

// This is relatively minor compared to the brain-computer interface and nuclear fusion investments.
// Mihoyo+Alibaba deep collaboration = "The friendship between miHoYo and Alibaba Cloud has lasted for eight years." - 2021 article

igor_kavinski · Tuesday at 12:53 PM

SarahKerrigan said:
So... slideware from a company that is not in the consumer-electronics SoC business. Oooookay.

You would do well NOT to underestimate King Koduri!

SarahKerrigan · Tuesday at 1:14 PM

igor_kavinski said:
You would do well NOT to underestimate King Koduri!

Success in semiconductor projects, especially large and ambitious ones, is not usually a product of individual personalities.

Also, if you mean Raja Koduri, he doesn't work at Tenstorrent AFAIK.

igor_kavinski · Tuesday at 1:20 PM

SarahKerrigan said:
Also, if you mean Raja Koduri, he doesn't work at Tenstorrent AFAIK.

He is on the Board. You would do well not to expect him to just sit on it!

DrMrLordX · 2024-07-03T00:38:22-0400

igor_kavinski said:
No but a bunch of billionaires do. We gotta entice them!

There are better uses for $70 billion. Plus let Raja rest. He probably wants to be semi-retired. If not then I question his sanity. Now is a great time for him to just kick back, relax, and let a younger generation develop megalomaniacal tendencies er I mean uh innovate in the semiconductor industry.

naukkis · 2024-07-03T10:42:34-0400

Schmide said:
One caveat that is often glossed over for the scalable vector space, operations still happen within a lane (128 bits).

So if you do an interleave, unpack on x86, all operations happen within a lane. (vzip on arm)

Using avx double for simplify ( "|" represents a lane boundary)

vector 1 = ab | cd
vector 2 = ef | gh

unpacklow 1 2 = ae | cg
unpackhigh 1 2 = bf | dh

As you can see the operations stay in their lane. This requires one or more special cross lane operation to readjust the data.

To linearize the data a permute is applied

after unpack the vectors are now

vector 1 = ae | cg
vector 2 = bf | dh

permute 1 2 (0x20) = ae | bf
permute 1 2 (0x31) = cg | df

now the interleave is linear with respect to memory.

The more lanes you have in a vector the more extra steps are needed to readjust the operation.

So

sse has 1 lane so no extra operations are needed

avx has 2 lanes so 1 extra operation is needed (as above)

avx512 has 4 lanes so 2 extra operations are needed.

a 1024 vector would require 4 extra operations.

Nothing is really free in vector land.

RVV is vector ISA not SIMD. You are talking about SIMD arch, vector ISA is totally different. Vector ISA vectors aren't fixed size but variable so your example cannot turn 1:1 into vector example. BTW second permute seems to have typo, you probably meant dh instead df. But if vector ISA needs to take changes to vector registers it either compress masked data into different vector or gather them with index vector. RVV has actually pretty risc view of doing those kind of permutations, traditional vector cpus have compress/expand and gather/scatter to vector registers but RVV does only have compress and gather. So no way to easiest programming with manipulating vectors forth and back, always only way is to generate code towards high performance path.

Question Are scalable vectors the ultimate solution to fixed width SIMD?

Lifer

Platinum Member

Senior member

Platinum Member

Junior Member

Diamond Member

Senior member

Platinum Member

Platinum Member

Platinum Member

Lifer

Platinum Member

Lifer

Platinum Member

Lifer

Platinum Member

Senior member

Lifer

Senior member

Diamond Member

Lifer

Senior member

Lifer

Lifer

Senior member