I think that you underestimate how much SIMD is used for number crunching. Either because the application programmer used assembler/intrinsics/a vectorizingncompiler (intel) or because they rely on some library (blas, fftw,...) that is vectorized.I've been thinking about CPU architecture.
Why wouldn't we get rid of the FPU and use larger ALU for example 532 bit ALU that can basically get rid of FPU and do pure integer math.
Do we really need more than 512 bits?
If we expect our hardware to do function «A» is it not intuitively most effcicient to implement A rather than using some proxy?
-k