So.. Apple destroys any ST opponent even Intel, but is brutalized in MT by everyone (even Mediatek and AMD), except Qualcomm? Interesting...
This is simply not true. ("Apple is ... brutalized in MT by everyone".) Look at the numbers.
The reason appears to be that the ARM mobile SoCs do not provide enough memory performance to support 4, let alone 8 cores. The worst offenders utilize a single memory controller connected to a 64-bit wide bus and suffer from both bandwidth and latency (or if you prefer, queuing) problems. The more aggressive vendors try to improve things by using a 128-bit wide bus, which helps with bandwidth, but not with queuing.
You can invent toy benchmarks that run purely in cache, for which all 8 core of a standard high-end SoC can crank away without stamping over each other trying to reach RAM (although such a benchmark will, then, very soon force the SoC to throttle...), but for any real code memory is the real constraint.
This is not purely an "other ARM" problem, even Apple suffers from it. Going from one core to two, Apple's per core performance drops to about 80..85%, and with their one 3 core CPU so far, it dropped to only about 70%. There's a real engineering problem here in that, ideally you'd like more parallelism in your memory system (more RAM banks, more RAM pages, more controllers, deeper queues in the controllers) but all this stuff costs power and area, so it's not a great addition to a phone. Even so, Apple's trade-offs basically make sense ---they get a real performance boost from that second core --- whereas for ARM, especially in the supposed 4+4 mode, the extra cores are nonsense, pretty much guaranteed to spend a lot of time waiting on RAM because other cores submitted their requests first.
It's worth noting that, for example, XGene3, to try to deal with this problem, includes 8(!) memory controllers on the die to feed 32 cores. One controller per four cores, which may still be sub-optimal (ten might be better, though the more cores and controllers you have active, the better statistics average out to help you, so you aren't so hurt by the unfortunate moments when ALL your cores all want memory access at once).