You're seeing it all the time. It's just that it's only (likely) a smaller percentage of the code.
4ALU is quite good already and got zen the 40%+ ipc gain.
The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.
The slight downside is more idle pipes means it would need gating to avoid loosing efficiency. Or a 4-way MT scheme; SMT2+?
This is just hand waving, nothing of value, for example your only waiting one extra cycle to execute from 4 wide Zen to 6 wide A12 and Zen has lower latency for simple ALU ops, So unless you have sustained 6 ALU ops a cycle over many cycles back to back your not gaining anything yet A12 has an IPC advantage. Why is that? How do you propose to load or store a damn thing while sustaining 6 ALU ops? Zen2 doesn't even have enough issue width right now to sustain the 4 ALU's + 3 AGU's. As i have already proved for SPEC int ( something you moar ALU guys have yet to do) x86 instructions with memory ops make up a very large amount of instructions.
going to 4 alu's alone did not get anywhere near 40% gain, if you want to be specific and correct, bulldozer could already do 4 ALU ops in a core in a single cycle. ( not that it would practically happen or you would want to) .
lets be clear here:
much improved L1I cache ( no more aliasing)
much improved L1D cache
improved instruction fetch/increased fetch
adding of a UOP cache
significantly improved cache hierarchy
dedicated hardware for stack handling ( store to load forwarding at the frontend of the pipeline)
increased instruction dispatch
significantly increased PRF (96 to 168)
improved branch predictors
improved prefetch
improved store forwarding
increased ALU's to 4.
All those thing got 40% performance uplift, not 4 ALU's FFS.
if you go back and look the initial thoughts of people like David Kanter were that Zen's 4:2 ALU:AGU configuration was sub optimal and 3:3 would have been better. Zen 2 comes along and makes it 4:3.... funny that.......
So instead of BS handwaving show me the money! Show me the SPEC int workload that has cycle over cycle over cycle the need to issue 6 ALU instructions while not loading or storing a thing.
im just going to quote agner:
Bottlenecks in AMD Ryzen The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the µop cache can have a throughput of five instructions or six µops per clock cycle. Code that does not fit into the µop cache can have a throughput of four instructions or six µops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.
funny how there nothing about ALU bottlenecks in his "optimization guide for assembly programmers and compiler makers" guide.