Question Are scalable vectors the ultimate solution to fixed width SIMD?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

naukkis

Senior member
Jun 5, 2002
779
636
136
That's utter nonsense.

Will you do all memory allocations so that they're 4KB page aligned? And will you ensure all your data structure sizes will be <4KB? Can't wait to see how you achieve that in real life. If you have to do so because your ISA is limited, it's a dead-end.

There isn't such a limitations involved. What that immediate size will make is that data localisation is effectively used. If code generates 8 data base pointers all data access stays within immediate range on those data pointers all that loop accessed data can be cached in 32KB 8-way L1 cache. Data prefetching also works best possible way as prefetchers usually won't cross pages boundaries. Allow some kind of scaling for those data access patterns and everything isn't so simple anymore. To make great ISA designers should make what they can to help to produce as efficient code as possible. That sure seem to be one design points of RV.
 

SarahKerrigan

Senior member
Oct 12, 2014
602
1,467
136
"Only having a 4K immediate window is great because it forces compilers to be better!" is certainly a take I haven't heard before. That's especially painful for jumps.

Next, do "why spending the encoding bits for an arbitrary link register is good actually."
 
Reactions: Nothingness

SarahKerrigan

Senior member
Oct 12, 2014
602
1,467
136
Yeah, those simple predictable addressing modes can be handled pretty much with fixed function logic. But hardware needs massive out-of-order window going towards thousand instructions to being able to pick those ld instructions so far ahead of rest code that data loads won't stall execution. In RV model it's also possible just to change address calculations before load instructions to achieve same effect - and if used wisely possible greatly outperform those fixed-function designs. There aren't those kind of hardware/software implementations out yet so this is of course only a speculation from what is possible to come - there might be coming some really performing RV designs in few years.

Except 99.9999% of pre/post-increment or indexed addressing isn't thousands of uops ahead. It's a loop iteration ahead. "You can hoist all of your address calculations for each loop iteration before the loop starts!" is an insane fantasy.

In other words, that buys you nothing except inflating your dynamic op count.
 

naukkis

Senior member
Jun 5, 2002
779
636
136
Except 99.9999% of pre/post-increment or indexed addressing isn't thousands of uops ahead. It's a loop iteration ahead. "You can hoist all of your address calculations for each loop iteration before the loop starts!" is an insane fantasy.

In other words, that buys you nothing except inflating your dynamic op count.
Let's see what Intel optimization guide has to say:

"The micro-op queue decouples the front end and the out-of order engine. It stays between the micro-opgeneration and the renamer as shown in Figure E-4. This queue helps to hide bubbles which are introduced between the various sources of micro-ops in the front end and ensures that four micro-ops aredelivered for execution, each cycle.The micro-op queue provides post-decode functionality for certain instructions types. In particular, loadscombined with computational operations and all stores, when used with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache. In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination, one does the load and the other doesthe operation. A typical example is the following "load plus operation" instruction:ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )Similarly, the following store instruction has three register sources and is broken into "generate storeaddress" and "generate store data" sub-components.MOV [ESP+ECX*4+12345678], ALThe additional micro-ops generated by unlamination use the rename and retirement bandwidth.However, it has an overall power benefit. For code that is dominated by indexed addressing (as oftenhappens with array processing), recoding algorithms to use base (or base+displacement) addressing cansometimes improve performance by keeping the load plus operation and store instructions fused."

Ok, your hardware wants to prefer pure base of base+offset addressing instead of indexed addressing. So how to design ISA that performs well? Dropping those addressing modes that won't suit well to executing hardware might be a good starting point.
 

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
Let's see what Intel optimization guide has to say:

"The micro-op queue decouples the front end and the out-of order engine. It stays between the micro-opgeneration and the renamer as shown in Figure E-4. This queue helps to hide bubbles which are introduced between the various sources of micro-ops in the front end and ensures that four micro-ops aredelivered for execution, each cycle.The micro-op queue provides post-decode functionality for certain instructions types. In particular, loadscombined with computational operations and all stores, when used with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache. In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination, one does the load and the other doesthe operation. A typical example is the following "load plus operation" instruction:ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )Similarly, the following store instruction has three register sources and is broken into "generate storeaddress" and "generate store data" sub-components.MOV [ESP+ECX*4+12345678], ALThe additional micro-ops generated by unlamination use the rename and retirement bandwidth.However, it has an overall power benefit. For code that is dominated by indexed addressing (as oftenhappens with array processing), recoding algorithms to use base (or base+displacement) addressing cansometimes improve performance by keeping the load plus operation and store instructions fused."

Ok, your hardware wants to prefer pure base of base+offset addressing instead of indexed addressing. So how to design ISA that performs well? Dropping those addressing modes that won't suit well to executing hardware might be a good starting point.
You once more avoid answering the point of your contradictor by talking about something else. Or perhaps you've been so brainwashed by R-V propaganda that you don't understand the difference between register offsets and post increments.

Or, and that doesn't contradict the previous hypothesis, you can never admit you were proven wrong. Which makes any sensible discussion with you impossible. I considered ignoring you, but I think it might be worth for other readers to debunk your silly statements.
 
Reactions: SarahKerrigan

naukkis

Senior member
Jun 5, 2002
779
636
136
You once more avoid answering the point of your contradictor by talking about something else. Or perhaps you've been so brainwashed by R-V propaganda that you don't understand the difference between register offsets and post increments.

Or, and that doesn't contradict the previous hypothesis, you can never admit you were proven wrong. Which makes any sensible discussion with you impossible. I considered ignoring you, but I think it might be worth for other readers to debunk your silly statements.

The complex addressing mode the worse performing it usually is. If compiler does some optimization like loop unrolling it pretty much has to revert addressing to base+offset instead of indexed addressing modes. And post / pre increment to address with load is even more harmful - I don't know maybe they have invented some complex hardware to unroll those but software side sure can't unroll those.
 

SarahKerrigan

Senior member
Oct 12, 2014
602
1,467
136
Let's see what Intel optimization guide has to say:

"The micro-op queue decouples the front end and the out-of order engine. It stays between the micro-opgeneration and the renamer as shown in Figure E-4. This queue helps to hide bubbles which are introduced between the various sources of micro-ops in the front end and ensures that four micro-ops aredelivered for execution, each cycle.The micro-op queue provides post-decode functionality for certain instructions types. In particular, loadscombined with computational operations and all stores, when used with indexed addressing, are represented as a single micro-op in the decoder or Decoded ICache. In the micro-op queue they are fragmented into two micro-ops through a process called un-lamination, one does the load and the other doesthe operation. A typical example is the following "load plus operation" instruction:ADD RAX, [RBP+RSI]; rax := rax + LD( RBP+RSI )Similarly, the following store instruction has three register sources and is broken into "generate storeaddress" and "generate store data" sub-components.MOV [ESP+ECX*4+12345678], ALThe additional micro-ops generated by unlamination use the rename and retirement bandwidth.However, it has an overall power benefit. For code that is dominated by indexed addressing (as oftenhappens with array processing), recoding algorithms to use base (or base+displacement) addressing cansometimes improve performance by keeping the load plus operation and store instructions fused."

Ok, your hardware wants to prefer pure base of base+offset addressing instead of indexed addressing. So how to design ISA that performs well? Dropping those addressing modes that won't suit well to executing hardware might be a good starting point.

"LOOK! SQUIRREL!"
 

DavidC1

Senior member
Dec 29, 2023
387
576
96
The problem with scalable vectors is that higher vectors allow you to do more SIMD with a single instruction but it requires recompiling to take advantage of it.

How much software is using AVX 256 even? AVX512 is almost at zero. Based on Skymont's gains, for improving experience for everyone having more FP units is the way to go. It costs more transistors and area, but that seems to be the "unspoken" laws of the universe along with the known laws of physics. Space/power efficient with no gains now and requiring developer effort or less efficient but benefit everyone right away.

I guess what should have been done is Intel for example should have went 512-bit SSE with Pentium 4, but done over 8 cycles, and as the CPU grew over decades to have 128, 256, and 512, the existing programs would have benefitted. Oh and the 512-bit SSE would have included an option to use different vector width. So some would use 64, some 128, some 256, and others 512.

The problem is vector width has been not just consequence of the advances of Moore's Law, but for marketing as well. "Look we have 512-bit FP now!"
 

SarahKerrigan

Senior member
Oct 12, 2014
602
1,467
136
The problem with scalable vectors is that higher vectors allow you to do more SIMD with a single instruction but it requires recompiling to take advantage of it.

How much software is using AVX 256 even? AVX512 is almost at zero. Based on Skymont's gains, for improving experience for everyone having more FP units is the way to go. It costs more transistors and area, but that seems to be the "unspoken" laws of the universe along with the known laws of physics. Space/power efficient with no gains now and requiring developer effort or less efficient but benefit everyone right away.

I guess what should have been done is Intel for example should have went 512-bit SSE with Pentium 4, but done over 8 cycles, and as the CPU grew over decades to have 128, 256, and 512, the existing programs would have benefitted. Oh and the 512-bit SSE would have included an option to use different vector width. So some would use 64, some 128, some 256, and others 512.

The problem is vector width has been not just consequence of the advances of Moore's Law, but for marketing as well. "Look we have 512-bit FP now!"

Scalable vectors in theory do not require recompilation to take advantage of a higher vector width. It's pretty much the whole point.

This works well for sufficiently friendly code streams - ie, Fortran using the "elemental" keyword. How well it works for less well-behaved software - much of which is C codebases assuming that a iteration of whatever backend is processing n pipes at a time (hi, Eigen!) - is left as an exercise for the reader.

I did software optimization work for SX's while RVV was still in draft, and I think it inoculated me a bit against the more extravagant claims made by the scalable-vector gang.

Edited to add: Also, C intrinsics for scalable vectors are kind of awkward. The SVE spec for C intrinsics is a bit nasty and the RVV one, as with many RV things, seems to be in perpetual "Draft" limbo.
 
Last edited:

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
RV design is very well done, there's not much which could be done better.

This sentence really bothers me. Please explain to me why the link register gets a full register specifier? Spending 5 bits on that is a clear example of mental illness, nothing else can explain it. Not a single person has ever complained about having to use a specific link register.

RISC-V is full of stupid things like that. Choices made strictly for some kind of ideological purity over actual practical usefulness.
 
Jul 27, 2020
17,913
11,685
116
Oh please, tell me you're kidding
I did put two winks there

I mean, just imagine the thousands and thousands of mentally ill hardware/software engineers, hammering away day and night, to make R-V work

And these are the countries where few students dare to question their teachers so you can be sure that there won't be any dearth of people who think R-V is the greatest thing ever coz they were told so by their most beloved teacher!
 

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
This sentence really bothers me. Please explain to me why the link register gets a full register specifier? Spending 5 bits on that is a clear example of mental illness, nothing else can explain it. Not a single person has ever complained about having to use a specific link register.
That must be a joy to implement a simple return stack branch predictor if software messes around with that "great feature".

RISC-V is full of stupid things like that. Choices made strictly for some kind of ideological purity over actual practical usefulness.
That's the fundamentalist approach of some of the 80s RISC. A time when it really made sense to ease the implementation of CPU at all cost. Arrive the 21st century and we have a student who designed an ISA for his own needs; then came the RISC guru Patterson who found an opportunity to get back on the stage.

And now they pile extensions over extensions to try and get to a more sensible state.
 
Reactions: Bigos

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
I did put two winks there
Sorry buddy, I'm in a bad mood today and failed to catch it.

I mean, just imagine the thousands and thousands of mentally ill hardware/software engineers, hammering away day and night, to make R-V work
That's the only advantage of R-V: it's easy to make it work. But when you start targeting high perf, and you hit the same walls as other ISA. All you've gained is a stupidly limited ISA that needs many extensions to make sense (and even then companies need to add their own extensions).

And these are the countries where few students dare to question their teachers so you can be sure that there won't be any dearth of people who think R-V is the greatest thing ever coz they were told so by their most beloved teacher!
In all honesty, if I still was a PhD student I would have chosen R-V for my work because of its simplicity and openness. It's great to make toy projects.
 
Reactions: igor_kavinski
Jul 27, 2020
17,913
11,685
116
In all honesty, if I still was a PhD student I would have chosen R-V for my work because of its simplicity and openness. It's great to make toy projects.
So after completing the toy project for the PhD thesis, Dr.Nothingness would choose which serious ISA to propagate his career?
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Even when not making toy projects, there are a lot of places where the ISA being suboptimal is much less important than the fact that it is open with a lot of open implementations.

Don't get me wrong, RISC-V existing is amazing and generally a great thing for the world. It just would be even better if the design wasn't such a lemon. At least some people are trying to evolve it towards something better.
 

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
So after completing the toy project for the PhD thesis, Dr.Nothingness would choose which serious ISA to propagate his career?
It should be obvious given my post history I've always liked the Arm ISA ever since it was announced in 85. It's not perfect and might have grown too much, but it still is great fun to program at assembly language level.
 

camel-cdr

Junior Member
Feb 23, 2024
14
53
51
As somebody who has done a decent bit of programming with RVV let me tell you my biggest worries of the scalable nature:

  • Some code doesn't scale trivially (or at all) beyond 128 bits or 512 bits. This is mostly due to existing specifications and standards, but also due to the code structure of current libraries.
    E.g. a common operation in audio/video codecs or image formats like jpeg is a DCT, which often works on a 8x8 or 4x4 matrix of values, and needs to transpose this matrix in between calculations. This works well, if one row fits into one vector register, so for 4x4 you'd have 4 vector registers that each hold 4 rows.
    It gets more complicated, with scalable vectors: If you just store one row in each vector you lose performance on larger vector length, so you could use fewer vector registers to hold it, but even that wouldn't scale beyond 512 bits, if a the entire matrix into 512 bits (one row in 128 bits). A solution to this could be processing N 4x4 matrices at a time, or going even further, and just storing each element of the 4x4 matrix in a separate vector register, and then processing N=Vector length/element size 4x4 matrices at a time. That would even give you the transpose for free, because you can just adjust the vector registers you use in the operations.
    Current libraries however are not build with this in mind, and are often build around the assumption that you process one matrix at a time, so it can be hard to impossible to use such implementations.

  • You currently can't put scalable vectors directly into structs/classes, because the size of structs/classes needs to be known at compile time. Some existing APIs like C++ iterators need this, because you need to be able to quickly move those types to and from the stack.
    This is mostly a toolchain problem though, because you could add compiler support for types that always use 512 bits of storage, but for smaller vector lengths don't use the full thing. That would allow you to take full advantage of implementations with the most common vector length from 128 to 512, while the code would also run on processors with larger vector length, it wouldn't take full advantage of their larger vector length. Still, this seems like a 98% solution, if implemented.

  • For RISC-V specifically: Because of the amount of different vendors, we currently also have the problem, that the ecosystem hasn't settled down yet, and we can't know for sure on what performance characteristics we can rely on. This should sort itself out, once we've seen more implementations, and the vendors go for their second and third gen implementations.

For a lot of code RVV works very well, but there is some code where it's non-trivial to take advantage of larger vector length, mostly due to existing assumptions made when designing APIs and algorithms.
In the end we can always go back to using code specialized for a specific vector length like x86 when needed, but the alternative would obviously be nicer.

I personally can't speak about the implementation cost, e.g. I've heard the sentiment that uops splitting for LMUL, especially LMUL=8, is harder to implement on ooo implementations. How much of that is because it's unfamiliar to the designers that have most of their experience with arm cores, and thus harder to verify and implement, vs the inherent complexity after all kinds of hardware implementation tricks have been figured out, is unclear to me.
 

Nothingness

Platinum Member
Jul 3, 2013
2,751
1,397
136
That's the really hard way! I was hoping you could point me to something more geared towards kids, with fun projects
I'm sorry, but I can't help you then

Perhaps playing with assembly language output from C compilers will help you?
Here is a simple example: https://godbolt.org/z/WaEW68cMP

EDIT: this can be vectorized, but the output is much more complex.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |