Not exactly true. The big performance drop, what exactly caused the performance tank was the general purpose cores.
The FPU is what is mostly carrying the deficiencies in the general purpose cores. Essentially, the Bulldozer-Excavator cores were no better than a more optimized and higher clocked Bobcat-Jaguar core. While the FPU is vastly better than what is in the Zen core. AMD also had a couple years to improve the FPUs inefficiencies. They even had a more optimized FMAC in test chips. Which could in fact split and is still more efficient than bridging FMUL and FADD with lower dependency latency.
They could have easily fixed the Bulldozer general purpose by actually fully utilizing the Alpha 21264 core. Which had FOUR ALUs, not the mostly two ALUs and the completely gimped AGLUs. Essentially, AMD Bulldozer had a vastly improved Alpha 21264 front-end, two gimped Alpa 21264 general purpose cores, a vastly improved Alpha 21264 FPU(2 Alpha 21264 FPUs in FMAC + 2 FMISC from Stars).
It clearly sucks that AMD denied Bulldozer designs to use 2009-2012 improvements that were tested internally.
Different tasks, thus different optimizations. Physically, larger units mean more area, more power, and less frequency.
FPU is actually a bunch of 32-bit and 64-bit units running a single instruction. So, size isn't really a problem its moving all that data at once that is. Load/store is super energy intensive, so larger loads/stores is about the same as many loads/stores. Also, just because it is called FPU doesn't mean it isn't Integer.