Yes. ARM uses it's own extensions such as NEON and SVE and SVE2 to do similar things.
RISC has more simple instructions than CISC and in-fact, the first portion of the decode in CISC makes RISC like instructions (equal length instruction and data so it can be pipelined easily).
SIMD (Single Instruction, Multiple Data) instructions like SSE, SSE2, AVX, ETC. are a good idea for either processor design since it allows fewer cycles to accomplish the same work as a bunch of series-parallel instruction chains that it replaces.
(2) I have come across various people who have posited that Apple's P cores do not have SMT since they have a very large ROB. I'll try post links to the OPs who said so -I can find them, but the idea is that the need for multithreading is eliminated since the core is very good at Out-of-Order execution. Is this correct?
PS: I am not very knowledgeable with regards to the low level of CPU microarchitecture. I do have a hazy understanding how decoders, ALUs, ROBs and all the other stuff inside a core work though. So a simplified explanation would be much appreciated.
For MT, SMT provides a very large return on investment with respect with performance per area, and performance per core. This is true because of the need for Instruction Level Parallelism built into a Core. This is called "Superscalar" design. It allows many calculations to go on in parallel in a single clock cycle (or some number of clock cycles).
In order to achieve the maximum ILP in a core design, you need to have more execution units than necessary under MOST loads in order not to get bogged down on SOME loads. As a result, quite a few of those extra execution units are just sitting around most of the time twiddling their thumbs ..... now enter SMT.
SMT allows a core to utilize those idle resources that would otherwise be unused to work on a different "thread" in code essentially turning a single core into 2 (or more in some designs) cores.
I don't believe it is possible to compete in DC processors without SMT. It is inefficient to have high performance in MT in desktop and laptop without it as well since having lots of cores duplicates MUCH more of the core transistors than SMT requires.
The cost (IMO) is in the complexity of the schedulers and the validation of the design.
Unlike the PC/server focus of Intel & AMD, Apple's primary market for their core designs is the iPhone, where SMT would be a negative.
Exactly.
CPU Cache increases performance primarily by decreasing latency, not by making up for inadequate bandwidth.
A large cache can eliminate the need for more main memory bandwidth (not CPU internal bandwidth) by keeping the information needed in cache and avoiding external memory fetches.