Question Are scalable vectors the ultimate solution to fixed width SIMD?

Schmide · Wednesday at 12:22 PM

naukkis said:
RVV is vector ISA not SIMD. You are talking about SIMD arch, vector ISA is totally different. Vector ISA vectors aren't fixed size but variable so your example cannot turn 1:1 into vector example. BTW second permute seems to have typo, you probably meant dh instead df. But if vector ISA needs to take changes to vector registers it either compress masked data into different vector or gather them with index vector. RVV has actually pretty risc view of doing those kind of permutations, traditional vector cpus have compress/expand and gather/scatter to vector registers but RVV does only have compress and gather. So no way to easiest programming with manipulating vectors forth and back, always only way is to generate code towards high performance path.

Nice catch on the df.

The way I see it. A vector ISA is just a variable sized SIMD. The point I am trying to make still stands, if you exceed the size of the lane (or operational element) the system is going to have to do more passes to achieve the same results or reorder the data before or after the operation. Using the interleave example on arm, vzip takes 2 vectors and spits out 1 vector double the size. This simplifies the operation but somewhat handicaps the functionality. The work is generally the same.

Gather and scatter seem very useful, but If you think of what it is actually doing, it's reading or writing to memory multiple times to achieve the same results.

For RVV it sure seems programming it will be easier, I hate lanes, whether that translates to better performance is unknown.

naukkis · Wednesday at 12:35 PM

Schmide said:
Nice catch on the df.

The way I see it. A vector ISA is just a variable sized SIMD. The point I am trying to make still stands, if you exceed the size of the lane (or operational element) the system is going to have to do more passes to achieve the same results or reorder the data before or after the operation. Using the interleave example on arm, vzip takes 2 vectors and spits out 1 vector double the size. This simplifies the operation but somewhat handicaps the functionality. The work is generally the same.

Not everything is worth vectorization but vector ISA does about everything to get most of vectorization. Vector cpus do data reordering quite efficiently and a simple way - and when going from big vector to smaller SIMD ability still stays if there's still enough data parallelism.

Schmide said:
Gather and scatter seem very useful, but If you think of what it is actually doing, it's reading or writing to memory multiple times to achieve the same results.

See that vector cpu gather and scatter are from registers not from memory like SIMD cpu. Doing gather from registers is efficient unlike doing it from memory. GPUs are prime example of that, those are kind of vector processors too. Vector CPU memory operation can be continuous or indexed(which is SIMD gather/scatter but as vector data elements are independent in memory it's indexed load instead of scissoring data elements)

SarahKerrigan · Wednesday at 1:12 PM

naukkis said:
See that vector cpu gather and scatter are from registers not from memory like SIMD cpu.

What a bizarre thing to say.

Vector computers have been doing chained scatter/gather load/store for much longer than commodity SIMD.

naukkis · Wednesday at 2:01 PM

SarahKerrigan said:
What a bizarre thing to say.

Vector computers have been doing chained scatter/gather load/store for much longer than commodity SIMD.

I haven't named them. Vector cpu load vectors either from continuous memory location or non-continuous indexed memory. After that costly memory read is done and vector(or part of it) is loaded in register it can also used partially by selecting parts of it into new register. RVV has named that as Vector Register gather. People which only know SIMD designs confuse it as memory gather which it isn't as it gather from other register not from memory.

SarahKerrigan · Wednesday at 2:31 PM

naukkis said:
I haven't named them. Vector cpu load vectors either from continuous memory location or non-continuous indexed memory. After that costly memory read is done and vector(or part of it) is loaded in register it can also used partially by selecting parts of it into new register. RVV has named that as Vector Register gather. People which only know SIMD designs confuse it as memory gather which it isn't as it gather from other register not from memory.

Yeah, no. Vector computers have had scatter/gather load/store forever. That's not just a SIMD-machine thing.

camel-cdr · Wednesday at 2:59 PM

Schmide said:
One caveat that is often glossed over for the scalable vector space, operations still happen within a lane (128 bits).

So if you do an interleave, unpack on x86, all operations happen within a lane. (vzip on arm)

Using avx double for simplify ( "|" represents a lane boundary)

vector 1 = ab | cd
vector 2 = ef | gh

unpacklow 1 2 = ae | cg
unpackhigh 1 2 = bf | dh

As you can see the operations stay in their lane. This requires one or more special cross lane operation to readjust the data.

To linearize the data a permute is applied

after unpack the vectors are now

vector 1 = ae | cg
vector 2 = bf | dh

permute 1 2 (0x20) = ae | bf
permute 1 2 (0x31) = cg | dh

now the interleave is linear with respect to memory.

The more lanes you have in a vector the more extra steps are needed to readjust the operation.

So

sse has 1 lane so no extra operations are needed

avx has 2 lanes so 1 extra operation is needed (as above)

avx512 has 4 lanes so 2 extra operations are needed.

a 1024 vector would require 4 extra operations.

Nothing is really free in vector land.

The lane size usually the vector length though. Or to put it differently, the width of the vector execution units, specifically for vector permutations, is usually the same as the vector length.

Look at AVX512, there you've got a low latency high throughput vpermb, heck even vpermi2b, which gives you an all to all permutation within the vector register. You can use that to implement your example in one operation.

With RVV you actually are more exposed lane width, because you can group vector register using LMUL. vrgather.vv, the vperm* equivalent, does with LMUL>1 a register/lane crossing all to all permutation. Since you rarely need this, especially not in hot loops, hardware implements it using a fast vrgather.vv primitive of the lane width size. That means that a LMUL>1 vrgather.vv takes LMUL^2 times the cycles of an LMUL=1 vrgather.vv. vrgather.vv is actually the only RVV instruction with this properly, all other can scale linearly with LMUL, even vcompress.vm, although current implementations scale it with LMUL^2/2.

Some of the current lower performance RVV implementations actually have a lanewidth smaller than the vector length (at LMUL=1/2) to get a cheaper implementations. Since they can do one LMUL=1/2 vrgather.vv per cycle, an LMUL=1 vrgather.vv takes 4 cycles in their implementation.

The implementation complexity of a single cycle vrgather.vv scales quadratically with vector length, that means we likely won't see many general purpose application processors with a vector length much larger than 512 or 1024. Some cores targeting ai/accelerator workloads implement vrgather.vv at 1 cycle per element instead (SiFive X280), or N cycle per element, this scales linearly with vector length, so it's more feasible for such an implementation. This plays back into the thired point I made in my first comment "that the ecosystem hasn't settled down yet, and we can't know for sure on what performance characteristics we can rely on".

Question Are scalable vectors the ultimate solution to fixed width SIMD?

Schmide

Diamond Member

naukkis

Senior member

SarahKerrigan

Senior member

naukkis

Senior member

SarahKerrigan

Senior member

camel-cdr

Junior Member

TRENDING THREADS