Question Are scalable vectors the ultimate solution to fixed width SIMD?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Jan Olšan

Senior member
Jan 12, 2017
314
407
136
Scalable vectors supposedly have upsides (write once, run forever), but those really aren't proven yet. I've already seen opinions that if you write and tune certain algoritm now, it'll likely be optimal only for 128bit and you may lose performance on the future CPUs with wider units.
The big problem of course is, there are no CPUs with wider units ATM. It's unproven concept at least to some part.*

There are some known weaknesess as there are algorithms you want to use SIMD for, but not having a defined width your operations are going ot use make it hard or complicated to handle.
Shuffle instruction usage which are critical for more complicated algorithms besides "jsut need a lot of FMAs, CPU" tasks, are sort of at odds with scalable width SIMD instructions. Probably can be surmounted, at least sometimes, but it adds complexity. And again, it may lead to suboptimal performance.

I suspect that is something you can list as disadvantage - suboptimal performace compared to potential you could ahve with a fixed-width SIMD instruction set.

* Technically, you could write 128bit SVE code and then go and test it on the very rare Fujitsu CPU or on older processor with Neoverse V1 cores. That will onyl allow you to test the older floating-point focused part of the instruction set, not full SVE2.
Also I think you really want to try higher spread of withs, not just 128/256, to really see how it behaves (the Fujitsu with 512bit width would be better for this, but if SVE proposes up to 2048, it's still far from being a perfect test of more hardcore configs).
 

camel-cdr

Junior Member
Feb 23, 2024
15
54
51
Scalable vectors supposedly have upsides (write once, run forever), but those really aren't proven yet. I've already seen opinions that if you write and tune certain algoritm now, it'll likely be optimal only for 128bit and you may lose performance on the future CPUs with wider units.
The big problem of course is, there are no CPUs with wider units ATM. It's unproven concept at least to some part.*

There are some known weaknesess as there are algorithms you want to use SIMD for, but not having a defined width your operations are going ot use make it hard or complicated to handle.
Shuffle instruction usage which are critical for more complicated algorithms besides "jsut need a lot of FMAs, CPU" tasks, are sort of at odds with scalable width SIMD instructions. Probably can be surmounted, at least sometimes, but it adds complexity. And again, it may lead to suboptimal performance.

I suspect that is something you can list as disadvantage - suboptimal performace compared to potential you could ahve with a fixed-width SIMD instruction set.

* Technically, you could write 128bit SVE code and then go and test it on the very rare Fujitsu CPU or on older processor with Neoverse V1 cores. That will onyl allow you to test the older floating-point focused part of the instruction set, not full SVE2.
Also I think you really want to try higher spread of withs, not just 128/256, to really see how it behaves (the Fujitsu with 512bit width would be better for this, but if SVE proposes up to 2048, it's still far from being a perfect test of more hardcore configs).
I don't think there is a problem with the shuffle instructions in a scalable context for application class processors.

However, it's unrealistic for them to scale much beyond a vector length of 512/1024, since the implementation complexity of a fast, 1 cycle throughput, implementation growth quadratically with vector length. It grows linearly with vector length if you what to shuffle a fixed N elements at a time (in one uop). But fixed size ISAs would have the same problem if they choose a larger SIMD width.

There is a problem, not due to the scalability, but rather due to the free-for-all nature of RISC-V (this doesn't apply as much to SVE, since ARM has the final word), which allows you to target your processor at vastly different applications, that prioritize different instructions. For an AI accelerator, you wouldn't spend the silicon to make a permute fast, since you only use it in setup code and never in hot loops. While a fast permute is, as you said, very useful for general purpose computation, and will be a must for application class cores.

On the RISC-V side, there are currently two boards with RVV 1.0 support available, one with 128 (XuanTie C908) and one with a 256 (SpacemiT X60) vector length. The devboards that have been concretely announced to release this year, or at the beginning of next year also feature a variety of vector lengths: Milk-V Oasis (16 SiFive P670 cores with VLEN=128 and 4 SiFive X280 with VLEN=512 as an AI accelerator), Andes Qilai (4 AX45MP cores without vector and 1 accelerator nx27v core with VLEN=512 and reportedly quadruple issue with 512-bit datapath 🤤)

If you look at open source implementations, then there is also ara with a vector length up to 16K, and XiangShan with 128-bit vectors, but a high performance ooo design. You can play with them using rtl simulation, but ara still has some bugs, and the XiangShan vector implementation currently only works properly on their MinimalConfig, which is still multi issue ooo, but is scaled down quite a bit.

As I've written in my earlier comment, I think there are a lot of problems (if not most problems, not "just a bunch of FMAs" problems) that can benefit from and scale properly with larger vector length, but some will still require specialization or can only take advantage of smaller vector length, due to the nature of the problem.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
One caveat that is often glossed over for the scalable vector space, operations still happen within a lane (128 bits).

So if you do an interleave, unpack on x86, all operations happen within a lane. (vzip on arm)

Using avx double for simplify ( "|" represents a lane boundary)

vector 1 = ab | cd
vector 2 = ef | gh

unpacklow 1 2 = ae | cg
unpackhigh 1 2 = bf | dh

As you can see the operations stay in their lane. This requires one or more special cross lane operation to readjust the data.

To linearize the data a permute is applied

after unpack the vectors are now

vector 1 = ae | cg
vector 2 = bf | dh

permute 1 2 (0x20) = ae | bf
permute 1 2 (0x31) = cg | dh

now the interleave is linear with respect to memory.

The more lanes you have in a vector the more extra steps are needed to readjust the operation.

So

sse has 1 lane so no extra operations are needed

avx has 2 lanes so 1 extra operation is needed (as above)

avx512 has 4 lanes so 2 extra operations are needed.

a 1024 vector would require 4 extra operations.

Nothing is really free in vector land.
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,968
2,192
136
FEX is for running x86 on Arm, no?
We want the reverse here
Oh, misread that.

Hmmm dynarmic can do that too, but I don't know if it works for more general use rather than just for console emulators.

Unfortunately it was essentially tied to the Yuzu code, so the OG repo for dynarmic got yeeted along with everything else, lots of repo clone exist tho.

Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘
 
Jul 27, 2020
17,965
11,709
116
Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘
They are small enough (relatively speaking to the other console giants) that someone should buy them out and open source their EVERYTHING!

By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?
 

Nothingness

Platinum Member
Jul 3, 2013
2,757
1,405
136
Hmmm dynarmic can do that too, but I don't know if it works for more general use rather than just for console emulators.

Unfortunately it was essentially tied to the Yuzu code, so the OG repo for dynarmic got yeeted along with everything else, lots of repo clone exist tho.

Stupid Ninty forgot that OSS is a hydra - cut one head off and many more grow to replace it 🤘
QEMU is much more reliable as far as Arm architecture compliance goes and its speed, though not the best, is good enough for running almost everything (last time I checked it could run 1 billion Arm instructions per second running in syscall emulation mode). In my biased opinion it's the best tool to experiment with Arm architecture on x86
 

SarahKerrigan

Senior member
Oct 12, 2014
604
1,469
136
They are small enough (relatively speaking to the other console giants) that someone should buy them out and open source their EVERYTHING!

Do you have seventy billion dollars sitting around?

By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?

Er, from what SoC vendor? Also why?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
By the way, now a new thought entered my mind: what about a future Nintendo console/handheld with RISC-V CPU using 512-bit scalable vectors?
Maybe Tenstorrent (hopefully).

Why? AI-powered games!
Minus the 512-bit Vector stuff. DAMO(Discovery, Adventure, Momentum and Outlook) Academy and MiHoYo are somewhat connected.

Dharma SoCs w/ C930+DXT(?) + OpenHarmony(RISC-V Large System) + MiHoYo should mean that these should be running on RISC-V/RVV eventually:
Honkai Impact 3rd
Genshin Impact
Honkai Star Rail
Zenless Zone Zero
Cross-IP PUBG/Fortnite-esque game
Honkai MMORPG

// This is relatively minor compared to the brain-computer interface and nuclear fusion investments.
// Mihoyo+Alibaba deep collaboration = "The friendship between miHoYo and Alibaba Cloud has lasted for eight years." - 2021 article
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
21,806
11,161
136
No but a bunch of billionaires do. We gotta entice them!

There are better uses for $70 billion. Plus let Raja rest. He probably wants to be semi-retired. If not then I question his sanity. Now is a great time for him to just kick back, relax, and let a younger generation develop megalomaniacal tendencies er I mean uh innovate in the semiconductor industry.
 

naukkis

Senior member
Jun 5, 2002
782
636
136
One caveat that is often glossed over for the scalable vector space, operations still happen within a lane (128 bits).

So if you do an interleave, unpack on x86, all operations happen within a lane. (vzip on arm)

Using avx double for simplify ( "|" represents a lane boundary)

vector 1 = ab | cd
vector 2 = ef | gh

unpacklow 1 2 = ae | cg
unpackhigh 1 2 = bf | dh

As you can see the operations stay in their lane. This requires one or more special cross lane operation to readjust the data.

To linearize the data a permute is applied

after unpack the vectors are now

vector 1 = ae | cg
vector 2 = bf | dh

permute 1 2 (0x20) = ae | bf
permute 1 2 (0x31) = cg | df

now the interleave is linear with respect to memory.

The more lanes you have in a vector the more extra steps are needed to readjust the operation.

So

sse has 1 lane so no extra operations are needed

avx has 2 lanes so 1 extra operation is needed (as above)

avx512 has 4 lanes so 2 extra operations are needed.

a 1024 vector would require 4 extra operations.

Nothing is really free in vector land.

RVV is vector ISA not SIMD. You are talking about SIMD arch, vector ISA is totally different. Vector ISA vectors aren't fixed size but variable so your example cannot turn 1:1 into vector example. BTW second permute seems to have typo, you probably meant dh instead df. But if vector ISA needs to take changes to vector registers it either compress masked data into different vector or gather them with index vector. RVV has actually pretty risc view of doing those kind of permutations, traditional vector cpus have compress/expand and gather/scatter to vector registers but RVV does only have compress and gather. So no way to easiest programming with manipulating vectors forth and back, always only way is to generate code towards high performance path.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |