This oft-cited paper has a lot of great raw data, but the ultimate conclusion, that ISA doesn't really influence power consumption, is completely pulled from nowhere. They took a bunch of x86 and ARM platforms with a ton of different variables and concluded that, since they gave very different results, that one of the variables didn't matter. It doesn't make any sense. It's very un-scientific.
That isn't to say that ISA does make a large difference. That's a very complicated question that's extremely difficult to examine, and probably not something you'll be able to do just by comparing real world hardware. You just can't isolate the variables. The most qualified people to give insight into that question would probably by CPU architects who are highly familiar with x86 and ARM. Over the years different CPU architects have commented on this, with somewhat differing answers - of course the actual answer is not going to be some fixed percentage difference in some metric but will be influenced by a ton of other factors.
I'm not a CPU architect, so I can't give insights like they could. All I can really do is look at things as an assembly programmer and look at aspects of existing CPU designs. But from my perspective, x86 has some major flaws in some areas.
x86 uses a byte variable encoding, vs ARM64 which uses a fixed 4-byte encoding. So x86's encoding is much more flexible and should enable a much better code density. Yet in studies I've seen the two (generated with the same version of GCC) tend to have comparable code density. And from my experience I would wager that in SIMD heavy code x86 has worse code density.
Why would this be the case? It's because x86 has been developed in a very inefficient way, by very gradually adding new functionality a bit at a time in many steps. This started with 32-bit mode. Adding the new operand sizes and expanding the addressing with SIB bytes was sub-optimal. Adding MMX then SSE took more and more prefix bytes. Lots of instructions have redundant encodings or do the same thing as other instructions. VEX undertakes a major restructuring to try to account for this, it's practically a completely new instruction set encoding but it too pays for having to live alongside legacy SSE.
So the key benefit to x86 ends up not actually being much of a benefit at all and you're stuck paying for it. People often say this payment isn't actually anything, just a little extra space in the decoders. But consider the lengths that Sandy Bridge and onward go through to avoid this decoder cost. They have this uop cache, and while it's not known exactly how many bits it takes we do roughly know what the uops are capable of, so I would estimate that the instructions take up at least 2-3x more space than they do in the L1 icache. This is before taking into account the extra space wasted on redundant data (from overlapping lines) and extra metadata needed over a normal cache (to maintain offsets into the cache lines). To pump out 4+ uops a cycle that need a really wide interface, a lot of wires compared to what they'd need to read from the L1 icache. All of this to avoid the decoders, which cost several pipeline stages.
And then, even with all of this work put into x86 instructions, even with the x86 instructions being relatively large for what they do.. in a lot of ways the instruction set still sucks.
Over the last few weeks I've been doing x86 optimization of my Android app. For me this means targeting SSSE3 in 32-bit x86. This is the realistic baseline for x86 on Android; 64-bit use is too low (even among 64-bit capable SoCs, which eg Medfield and Clovertrail+ were not) and SSE4.x doesn't add that much anyway.
I didn't really appreciate this until I had to do it but I can now list many disadvantages SSE4.2 has vs ARMv7 NEON (let alone ARMv8/AArch64). Especially with integer SIMD. SSE has some advantages but they're much fewer.
Here's a comparison of the inner loop from two functions to demonstrate some of what I'm saying:
ARMv7 NEON:
http://pastebin.com/7g4Ad46N
x86 SSSE3:
http://pastebin.com/E3wwyTif
Actual performance will vary depending on uarch and all that.. but the processor executing the second is going to have a
really hard time doing it anywhere close to as efficiently as the processor executing the first. There's just only so much uarch can hide, I really can't look at this big gulf and say that ISA doesn't matter. Now this is kind of a contrived example, I picked something that looked especially bad afterall, for some functions SSSE3 basically nails it. But this is definitely enough of a thing to really make me feel a tangible difference.
AVX and AVX2 fix some of the disadvantages and add their own unique benefits. This is basically Intel's admission that ISA
does matter, which is why they're addressing weaknesses in the ISA. But AVX is not supported on Celeron and Pentium branded processors, let alone Atoms. And at this point I'm wondering when they ever will be. From my perspective, they're only there on the processors that need them the least. They're sold as a luxury and not as a feature to make the CPU more competitive.