Could someone help to explain A12 Vortex core dis-balance, please?
We know that 50% of instructions are load/store, from those are mostly load ones. Given that it means Vortex core has a significant dis-balance between 6xALUs and 2xLSU.
Second thing is the performance. The two aditional ALUs are simple/branch shared type. This could bring theoretically approx +20-30% IPC? However Vortex deliver +58% of IPC over Skylake which is 3x more than what is expected. Combination of dis-balance and high performance is mystery. There must be something smart inside.
Did Apple engineers developed some new advanced technique at reorder buffer? Something like load ROB predictor was on Conroe? Or are they using such a large instruction window that they can extract very high ILP and eliminate these costly load/store instructionsat the same time?
To answer your questions somewhat:
(a) your load/store characterization is not quite correct. Obviously it depends on the workload, compiler, and ISA, but 25% loads and 10% stores are better approximations.
(ARMv8 can require fewer load/stores because of pairing, but then the rich ISA [things like short shifts, and the fancy MOVs and CSELs that can modify data in easy ways] means there are also fewer logic instructions. So overall you get about the same numbers, maybe 24% loads and 9% stores for ARMv8 vs the x86-64 numbers above. See eg
https://arxiv.org/pdf/1607.02318.pdf )
(b) so the balance is reasonably OK, ie 1/3. The next thing you have to remember is that numbers like the above give global averages, but performance happens over a window of a few hundred instructions. What matters as much as how many units of various types you have is how much flexibility you have (in queue depth and reordering capabilities) to cope with temporary deviations from these averages. You may have a long stretch (think copying a large data structure) that's mainly load/stores. Or a long stretch that's primarily ALU or FP instructions. If you're dominated by load/stores for a run longer than the OoO window, obviously you're throttled by the number of LS units; likewise if you're dominated by ALU work then you're limited by the number of ALUs.
(c) so what to do? Yes, in an ideal world you provide 50 LS units and 100 ALU units and you're never throttled by anything. In a non-deal world, you have to make tradeoffs.
LS units are ferociously complicated. ALU units are basically simple.
SO
it makes sense to provide more ALU units than the naive averages suggest...
Yes, sure, much of the time you won't use those extra ALUs (certainly not when copying data, or in "balanced" loops that read/write data and perform a fairly trivial manipulation).
BUT there will be some loops that are dominated by ALU operations, loops where every value you read in gets manipulated a lot before the next value is read. And for loops like THAT, the extra ALUs will kick in and speed things up.
CPU design is always a balancing act. The things you have to balance include the fact that code comes in phases. Looking at windows of 1000 instructions or so, there will be phases that are LS rich, ALU rich, FPU rich, branch rich... Given that, you want extra backup capability along every dimension! This isn't practical, but it IS practical to add backup capability along the easiest dimensions.
ALU is easiest, FPU second easiest, branch seems trivial but to be useful needs a lot of extra support in the fetch front-end so that's probably third in line to get backup, and LS is definitely the hardest to grow.