Many of you have seen the article posted on EETimes, here's a link: http://www.eetimes.com/author.asp?section_id=36&itc=eetimes_sitedefault&doc_id=1318857 For those interested in the state of mobile benchmarketing I recommend a read, it's pretty enlightening.
UPDATE: BDTI's president also commented on AnTuTu, covers a lot of similar material: http://www.bdti.com/InsideDSP/2013/07/11/JeffBierImpulseResponse
You may have also seen me rant on this topic before. For those interested I figured I'd give some additional analysis behind some of the things I've said.
All of the analysis is taken from disassembling the NDK library files in the APK. This can be done by:
1) Unzipping the APK, it's just a normal zip file
2) Go to the lib directory and look at the x86 and armeabi-v7a directories, these are used on x86 and ARM devices respectively
3) Disassemble the libabenchmark.so files inside. For this I used objdump, which you can easily get for x86 and ARM. This gives assembly listings and a bunch of names for things like functions and global variables which the library doesn't strip.
First, it's important to understand just what AnTuTu is. I haven't looked for information on all the subtests, but I do know that the CPU-centric integer and floating point portions are using nbench. You can find the source code here: http://www.tux.org/~mayer/linux/bmark.html
The reason I can tell it's nbench is because it uses the same function names and global variable names, and a cursory look at those functions show they do the same thing.
So what's the big reason for the x86 performance difference between AnTuTu 2.9.3 and 3.3? On the surface you can see that they started using ICC for the x86 compilation. This is obvious because the disassembly is littered with strings that have "intel" in the name - in fact, there's even one with icc in the name: ".text.__icc.get_pc_thunk.si"
ICC is well known for high quality vectorization. An examination of the ARM disassembly shows that vectorization wasn't even enabled. The reason I can tell this is because there are no integer NEON instructions - a search for things like vadd.u32 or any other permutation of integer types turns up nothing; same for other basic operations like sub, or, and, etc, as well as load or store instructions. Floating point is a little harder to rule out from a simple search because VFP (scalar) instructions look similar to NEON ones, but what's ultimately telling is I couldn't find any usage of quad-word registers except in a few instructions that were clearly part of garbage regions (data, not real code).
AnTuTu using GCC compiled to target vanilla ARM-v7a processors w/o NEON isn't that bizarre, since there is one case of such a processor that doesn't have NEON support (Tegra 2). Nonetheless, the NDK doesn't make it that hard to include separate code paths that are compiled w/o NEON and use them at runtime. This is a standard development paradigm which Google documents. What I find really glaring is that they didn't do this but they did compile the x86 part with ICC, which is totally non-standard and unsupported as far as the NDK is concerned.
UPDATE: jhu found that vectorizing on the current GCC series used with the NDK doesn't yield any benefit (http://forums.anandtech.com/showthread.php?t=2330288), although the verdict is still out on how much fiddling with the compiler flags or using a newer version of GCC could have helped things. IMO if you're going to use Intel's latest and greatest you should at least do the same w/GCC.
But that's really just the tip of the iceberg. There's another advantage they have than picking the best compiler for the job. Here's an example:
One of the CPU tests in nbench is to check how the CPU is at performing simple bitwise operations - shifts, ands, ors, etc. To do this it sets, clears, or toggles a series of bits in memory, one bit at a time. One of the functions for this is ToggleBitRun located in nbench1.c.
Here is the function:
This is what the ARM code does, located at 0x46386:
That's a pretty straightforward implementation.
Now here's what the x86 equivalent does (note that the function has been inlined, here's one instance):
What it's doing is, where possible, setting entire 32 bit runs to 0 or 1. The lines at f64c3 and f64c6 are critical. It's replacing 32 iterations of the ARM loop above with those two instructions. Needless to say, it's dozens of times faster doing it this way.
This is what we call breaking the benchmark. Where the compiler applies some logic that makes the benchmark much faster by doing a set of operations that the benchmark identifies as correct (if it even checks) but are not performing the intended function of the benchmark. Classic examples include omitting code entirely if the results are never read, or performing a complex computation at compile-time instead of run time if the inputs can determined to be constant (then just reporting the results).
In this case I'm sure Intel could claim that they're performing a legitimate optimization. Frankly, I doubt it; this kind of optimization would be difficult to recognize and apply in generic code. It'd also be for little benefit, because I've never seen someone use code like this to set or clear huge sets of bits. That part is kind of the catch, because this optimization would make the code slower if the run lengths weren't sufficiently large. In nbench's case they are, but there's no way the compiler could have known that on its own.
What's more, this optimization wasn't present in ICC until a recent release. Somehow I don't think that they just now discovered it has general purpose value. More likely case is that they discovered is they could manipulate AnTuTu's scores. Seems to coincide well with this third-party report appearing showing how amazing Atom's perf/W is - using nothing but AnTuTu. Or the leaked scores seen for CloverTrail+ and now BayTrail that are AnTuTu. Is this really a coincidence?
But frankly, I blame AnTuTu in all of this. They allowed themselves to be manipulated (probably for a price), despite constantly warning against other people cheating their numbers. I don't know if they're displaying a complete lack of integrity or a complete lack of understanding of how their own software works, or something in between the two, but whatever the case I hope they lose all credibility and whatever revenue the program brings them.
UPDATE: BDTI's president also commented on AnTuTu, covers a lot of similar material: http://www.bdti.com/InsideDSP/2013/07/11/JeffBierImpulseResponse
You may have also seen me rant on this topic before. For those interested I figured I'd give some additional analysis behind some of the things I've said.
All of the analysis is taken from disassembling the NDK library files in the APK. This can be done by:
1) Unzipping the APK, it's just a normal zip file
2) Go to the lib directory and look at the x86 and armeabi-v7a directories, these are used on x86 and ARM devices respectively
3) Disassemble the libabenchmark.so files inside. For this I used objdump, which you can easily get for x86 and ARM. This gives assembly listings and a bunch of names for things like functions and global variables which the library doesn't strip.
First, it's important to understand just what AnTuTu is. I haven't looked for information on all the subtests, but I do know that the CPU-centric integer and floating point portions are using nbench. You can find the source code here: http://www.tux.org/~mayer/linux/bmark.html
The reason I can tell it's nbench is because it uses the same function names and global variable names, and a cursory look at those functions show they do the same thing.
So what's the big reason for the x86 performance difference between AnTuTu 2.9.3 and 3.3? On the surface you can see that they started using ICC for the x86 compilation. This is obvious because the disassembly is littered with strings that have "intel" in the name - in fact, there's even one with icc in the name: ".text.__icc.get_pc_thunk.si"
ICC is well known for high quality vectorization. An examination of the ARM disassembly shows that vectorization wasn't even enabled. The reason I can tell this is because there are no integer NEON instructions - a search for things like vadd.u32 or any other permutation of integer types turns up nothing; same for other basic operations like sub, or, and, etc, as well as load or store instructions. Floating point is a little harder to rule out from a simple search because VFP (scalar) instructions look similar to NEON ones, but what's ultimately telling is I couldn't find any usage of quad-word registers except in a few instructions that were clearly part of garbage regions (data, not real code).
AnTuTu using GCC compiled to target vanilla ARM-v7a processors w/o NEON isn't that bizarre, since there is one case of such a processor that doesn't have NEON support (Tegra 2). Nonetheless, the NDK doesn't make it that hard to include separate code paths that are compiled w/o NEON and use them at runtime. This is a standard development paradigm which Google documents. What I find really glaring is that they didn't do this but they did compile the x86 part with ICC, which is totally non-standard and unsupported as far as the NDK is concerned.
UPDATE: jhu found that vectorizing on the current GCC series used with the NDK doesn't yield any benefit (http://forums.anandtech.com/showthread.php?t=2330288), although the verdict is still out on how much fiddling with the compiler flags or using a newer version of GCC could have helped things. IMO if you're going to use Intel's latest and greatest you should at least do the same w/GCC.
But that's really just the tip of the iceberg. There's another advantage they have than picking the best compiler for the job. Here's an example:
One of the CPU tests in nbench is to check how the CPU is at performing simple bitwise operations - shifts, ands, ors, etc. To do this it sets, clears, or toggles a series of bits in memory, one bit at a time. One of the functions for this is ToggleBitRun located in nbench1.c.
Here is the function:
Code:
static void ToggleBitRun(farulong *bitmap, /* Bitmap */
ulong bit_addr, /* Address of bits to set */
ulong nbits, /* # of bits to set/clr */
uint val) /* 1 or 0 */
{
unsigned long bindex; /* Index into array */
unsigned long bitnumb; /* Bit number */
while(nbits--)
{
bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* bit number in word */
if(val)
bitmap[bindex]|=(1L<<bitnumb);
else
bitmap[bindex]&=~(1L<<bitnumb);
bit_addr++;
}
return;
}
Code:
46386: b5f0 push {r4, r5, r6, r7, lr}
46388: 2501 movs r5, #1
4638a: e00f b.n 463ac <benchmark_ent+0x1fc>
4638c: 094c lsrs r4, r1, #5
4638e: f001 061f and.w r6, r1, #31
46392: fa15 f606 lsls.w r6, r5, r6
46396: f850 7024 ldr.w r7, [r0, r4, lsl #2]
4639a: b10b cbz r3, 463a0 <benchmark_ent+0x1f0>
4639c: 433e orrs r6, r7
4639e: e001 b.n 463a4 <benchmark_ent+0x1f4>
463a0: ea27 0606 bic.w r6, r7, r6
463a4: 3101 adds r1, #1
463a6: 3a01 subs r2, #1
463a8: f840 6024 str.w r6, [r0, r4, lsl #2]
463ac: 2a00 cmp r2, #0
463ae: d1ed bne.n 4638c <benchmark_ent+0x1dc>
463b0: bdf0 pop {r4, r5, r6, r7, pc}
Now here's what the x86 equivalent does (note that the function has been inlined, here's one instance):
Code:
f6416: b8 56 55 55 55 mov $0x55555556,%eax
f641b: 8b cb mov %ebx,%ecx
f641d: f7 eb imul %ebx
f641f: c1 f9 1f sar $0x1f,%ecx
f6422: 2b d1 sub %ecx,%edx
f6424: 8d 34 52 lea (%edx,%edx,2),%esi
f6427: 8b d3 mov %ebx,%edx
f6429: 2b d6 sub %esi,%edx
f642b: 0f 85 82 00 00 00 jne f64b3 <DoBitops+0x593>
...
f64b3: 83 fa 01 cmp $0x1,%edx
f64b6: 0f 85 8d 00 00 00 jne f6549 <DoBitops+0x629>
f64bc: 8b 54 24 10 mov 0x10(%esp),%edx
f64c0: 8d 0c 9a lea (%edx,%ebx,4),%ecx
f64c3: 8b 14 99 mov (%ecx,%ebx,4),%edx
f64c6: 8b 4c 99 04 mov 0x4(%ecx,%ebx,4),%ecx
f64ca: 49 dec %ecx
f64cb: 83 f9 ff cmp $0xffffffff,%ecx
f64ce: 0f 84 f3 00 00 00 je f65c7 <DoBitops+0x6a7>
...
f65c7: 43 inc %ebx
f65c8: 3b 5c 24 0c cmp 0xc(%esp),%ebx
f65cc: 0f 8c 44 fe ff ff jl f6416 <DoBitops+0x4f6>
This is what we call breaking the benchmark. Where the compiler applies some logic that makes the benchmark much faster by doing a set of operations that the benchmark identifies as correct (if it even checks) but are not performing the intended function of the benchmark. Classic examples include omitting code entirely if the results are never read, or performing a complex computation at compile-time instead of run time if the inputs can determined to be constant (then just reporting the results).
In this case I'm sure Intel could claim that they're performing a legitimate optimization. Frankly, I doubt it; this kind of optimization would be difficult to recognize and apply in generic code. It'd also be for little benefit, because I've never seen someone use code like this to set or clear huge sets of bits. That part is kind of the catch, because this optimization would make the code slower if the run lengths weren't sufficiently large. In nbench's case they are, but there's no way the compiler could have known that on its own.
What's more, this optimization wasn't present in ICC until a recent release. Somehow I don't think that they just now discovered it has general purpose value. More likely case is that they discovered is they could manipulate AnTuTu's scores. Seems to coincide well with this third-party report appearing showing how amazing Atom's perf/W is - using nothing but AnTuTu. Or the leaked scores seen for CloverTrail+ and now BayTrail that are AnTuTu. Is this really a coincidence?
But frankly, I blame AnTuTu in all of this. They allowed themselves to be manipulated (probably for a price), despite constantly warning against other people cheating their numbers. I don't know if they're displaying a complete lack of integrity or a complete lack of understanding of how their own software works, or something in between the two, but whatever the case I hope they lose all credibility and whatever revenue the program brings them.
Last edited: