It is based largely on the decisions made by the architects. There is no silver bullet so these decisions are basically trade-offs.
These decisions affect the project sub-budgets - areas of development investment. These areas are defined by projected goals using various metrics.
Originally, the Zen IP targeted mobile, server and even desktop/workstation workloads in a rather symmetric way. With Zen 5 things went a different way.
Zen 5 cache + structures + data paths got reworked in order to feed the brand new 512b-wide FPU. This development investment is disproportional to the INT investment. On top of that it lead to regressions for various instructions. So having a 512b does seem like a grand goal for the core. Being it a generally usuable goal? Not really.
Well, integer got increased load-store capacity (4 loads/2 store /4 total from 3/1/3) and + 2 ALUs, which also improved the mix from 3 simple + 1 complex to 3 simple + 3 complex (although not all three know all complex instructions). Jump to 8 ALUs would probably be too complex to do at once (but you can't increase SIMD widths in +50% steps. Although arguably, you could say that AMD did half of the job in Zen 4, half in Zen 5).
Btw Zen 2 doubled the FPU width after years of being stuck on 128b. For context, Intel has been riding the 512b width for two years already.
Perhaps, but I would say generally AMD was late with the SIMD upgrades. Fully 128bit SSE* took them just one year after Intel with Barcelona, full-speed AVX2 in Zen 2 was 6 years behind Intel. Intel had AVX-512 since 2017... (it gets a bit complicated with the question of when it became "full-speed" tho...)
Generally there are people doubting that AVX-512 is useful, but if you accept you want to have it, then the sooner, the better.
The uncore got alost upgrade - no investment was made in that area. Was it a good trade-off given the previous gen was already bottlenecked?
Well, IOD was never going to be updated, that's sadly the policy of the desktop lineup, to have it on half-cadence. I absolutely agree that the IOD and chiplet scheme is the biggest weakness for performance but also for power consumption and efficiency. I wish they got something more efficient via advanced packaging.
Completely reworking the frontend which (now?) dedicates resources to SMT is another strange design choice given the profiling (now?) shows the frontend acts as a significant single-thread bottleneck. Was a server-class workloads investment a good trade-off?
This is my layman PoV.
The reworked frontend isn't completely limited to SMT mode. The improved branch predicion still works in 1T I think, as does dual-fetch from microcode cache, which is luckily the more common source of instruction, at least if the code works as AMD engineers intended. Majority of apps should probably run from ucode cache significantly 50% of a time, although there may be outliers.
However, htere is good chance the split decoders will get the ability to feed 1T too, in a way Intel does it. I have no idea if that feature was buggy and had to be disabled or if it is yet to be added, but I'm pretty confident the whole reason for this scheme is to eventually be able to do it for 1T. If that was not the goal, they would do it like Golden Cove and Lion Cove and try to add decoders in a single cluster. They seem to see the x86 fufute in the Atom-like scheme with multiple clusters, which well may be the more efficient way (and thus in line with the balanced "Zen" philosophy, perhaps?) to increase the decoding width.
Of course, that future prospect doesn't help Zen 5, but if it takes some pain now to get the benefits in the follow-up cores, may be worth it.