On one of the next advanced architecture design group speculation forums:
x-bit Reg FPFMA
V.x=0, 1-wide SISD/Scalar FMA x-bit op [if V.x=0, allows 80-bit/EP for x-bit]
V.x=1, 2-wide SIMD FMA x-bit op
V.x=2, 4-wide SIMD FMA x-bit op
V.x=3, 8-wide SIMD FMA x-bit op
Which can be overlapped with every x87->AVX512 operation with the least amount instructions. [The above covers Centaur's unreleased Extended x87 instruction set, CT64-x87.]
Microsoft's SSEx->NEON transpiler emulator can be improved upon with the above. With the added benefit of the base remains the same, only the extension is swapped: x64-(SSEx) -> x64-(New ISE).
----
AVX512 1024-bit (11)
AVX512 512-bit (01 or 10)
AVX512 256-bit (10 or 01)
AVX512 128-bit (00)
AVX512 Scalar :: 5 instructions for same op
AVX 256-bit
AVX 128-bit
AVX Scalar :: 3 instructions for same op
SSE 128-bit
SSE Scalar :: 2 instructions for same op
10 instructions for same op
-> New instruction set extension:
1 instruction can implement all of the above. Moving complexity of operations to the rename/scheduler is more CISC, than placing the complexity on decode.
New definitions:
CISC -> Grand Library of 1-ISC(OISC)
RISC -> Small Library of 1-ISC(OISC)
Complexity is available, but it is on the micro-architecture not the extension to achieve it.
However, you could just google HPC portability to get this more simplified:
HPC for example is moving away from tuning for portability;
SIMD Architecture-implementation Agnostic high-level code:
Gen1-Arch: 256-bit low-level runtime code
Gen9-Arch: 2048-bit, +Packed SIMD Streaming (8x 256-bit)-also... low-level runtime code
Because of the above, Modern SIMD instruction sets have made HPC computers lean towards custom RISC-V/ARM. Since, they can upgrade and expanded rapidly and know the code will work across generations.
HPC Computer Facility-example:
Building A Phase 1: Gen1 computers
Building E Phase 12: Gen9 computers
If the instruction op is shared between o.g. heterogeneous architectures CPU and GPU:
CPU does V.a_tot to 512-bit~1024-bit && GPU does V.b_tot to 1024-bit~2048-bit, the new instruction extension is definitely needed.
RVV Class A custom CPU: RVV supports 128-bit through 1024-bit.
RVV Class B custom GPU(minus graphics): RVV supports 1024-bit through 8192-bit.
Rapids and Shores being around this is a bit silly.