Scalable vectors supposedly have upsides (write once, run forever), but those really aren't proven yet. I've already seen opinions that if you write and tune certain algoritm now, it'll likely be optimal only for 128bit and you may lose performance on the future CPUs with wider units.
The big problem of course is, there are no CPUs with wider units ATM. It's unproven concept at least to some part.*
There are some known weaknesess as there are algorithms you want to use SIMD for, but not having a defined width your operations are going ot use make it hard or complicated to handle.
Shuffle instruction usage which are critical for more complicated algorithms besides "jsut need a lot of FMAs, CPU" tasks, are sort of at odds with scalable width SIMD instructions. Probably can be surmounted, at least sometimes, but it adds complexity. And again, it may lead to suboptimal performance.
I suspect that is something you can list as disadvantage - suboptimal performace compared to potential you could ahve with a fixed-width SIMD instruction set.
* Technically, you could write 128bit SVE code and then go and test it on the very rare Fujitsu CPU or on older processor with Neoverse V1 cores. That will onyl allow you to test the older floating-point focused part of the instruction set, not full SVE2.
Also I think you really want to try higher spread of withs, not just 128/256, to really see how it behaves (the Fujitsu with 512bit width would be better for this, but if SVE proposes up to 2048, it's still far from being a perfect test of more hardcore configs).
I don't think there is a problem with the shuffle instructions in a scalable context for application class processors.
However, it's unrealistic for them to scale much beyond a vector length of 512/1024, since the implementation complexity of a fast, 1 cycle throughput, implementation growth quadratically with vector length. It grows linearly with vector length if you what to shuffle a fixed N elements at a time (in one uop). But fixed size ISAs would have the same problem if they choose a larger SIMD width.
There is a problem, not due to the scalability, but rather due to the free-for-all nature of RISC-V (this doesn't apply as much to SVE, since ARM has the final word), which allows you to target your processor at vastly different applications, that prioritize different instructions. For an AI accelerator, you wouldn't spend the silicon to make a permute fast, since you only use it in setup code and never in hot loops. While a fast permute is, as you said, very useful for general purpose computation, and will be a must for application class cores.
On the RISC-V side, there are currently two boards with RVV 1.0 support available, one with 128 (XuanTie C908) and one with a 256 (SpacemiT X60) vector length. The devboards that have been concretely announced to release this year, or at the beginning of next year also feature a variety of vector lengths: Milk-V Oasis (16 SiFive P670 cores with VLEN=128 and 4 SiFive X280 with VLEN=512 as an AI accelerator), Andes Qilai (4 AX45MP cores without vector and 1 accelerator nx27v core with VLEN=512 and reportedly quadruple issue with 512-bit datapath 🤤)
If you look at open source implementations, then there is also ara with a vector length up to 16K, and XiangShan with 128-bit vectors, but a high performance ooo design. You can play with them using rtl simulation, but ara still has some bugs, and the XiangShan vector implementation currently only works properly on their MinimalConfig, which is still multi issue ooo, but is scaled down quite a bit.
As I've written in my earlier comment, I think there are a lot of problems (if not most problems, not "just a bunch of FMAs" problems) that can benefit from and scale properly with larger vector length, but some will still require specialization or can only take advantage of smaller vector length, due to the nature of the problem.