AnTuTu and Intel

PPB · Jul 21, 2013

StrangerGuy said:
Funny because my Mediatek quad A7 phone feels really fast. If ~20% slower than a Galaxy S3 is slow that is so off the mark it's ridiculous.

I'm on a OMAP A9 single core 800mhz phone. Only thing that feels sluggish is Watsapp scrolling. TBH i think anything more than a 2 core A15 would feel overkill in a phone.

But hey, MOAR COARS always good, right?

monstercameron · Jul 21, 2013

PPB said:
I'm on a OMAP A9 single core 800mhz phone. Only thing that feels sluggish is Watsapp scrolling. TBH i think anything more than a 2 core A15 would feel overkill in a phone.

But hey, MOAR COARS always good, right?

you talking to the cpu guys or the gpu guys?

PPB · Jul 21, 2013

No, talking about those phones with 4 A15+ 4 A7 cores. I think even a 1+1 big.LITTLE configuration would be ok for most of our smartphone needs.

Arachnotronic · Jul 21, 2013

PPB said:
No, talking about those phones with 4 A15+ 4 A7 cores. I think even a 1+1 big.LITTLE configuration would be ok for most of our smartphone needs.

How about 4 power efficient Krait cores? Qualcomm has run circles around ARM's own designs with its Krait.

Exophase · Jul 21, 2013

PPB said:
I'm on a OMAP A9 single core 800mhz phone. Only thing that feels sluggish is Watsapp scrolling. TBH i think anything more than a 2 core A15 would feel overkill in a phone.

But hey, MOAR COARS always good, right?

Single core Cortex-A9 OMAP @ 800MHz doesn't exist. Did you mean Cortex-A8?

PPB · Jul 21, 2013

Exophase said:
Single core Cortex-A9 OMAP @ 800MHz doesn't exist. Did you mean Cortex-A8?

Thanks for the correction, its the 800 Mhz TI OMAP3610(800), it sports A8 instead of A9.

Seems I have settled for two years with even less performance than I tought :biggrin:

Nothingness · Jul 21, 2013

Intel17 said:
How about 4 power efficient Krait cores? Qualcomm has run circles around ARM's own designs with its Krait.

Yeah especially in MSM82x. Funny how ARM haters will never understand and will always think extreme performance is paramount

EDIT or Intel fanboys, I don't know

PPB · Jul 21, 2013

Yep seems like 2 Core Krait Variants seem to be the best compromise of performance and battery life. Seems my next smartphone would be a S400 based one

krumme · Jul 21, 2013

jhu said:
Not quite. The A7 is really slow. About on par with an A8. What you might not notice is dual core A9. The performance difference between my Barnes & Noble Nook (Cortex A8) and Nook Tablet (dual-core Cortex A9) was night and day.

You might be right, we dont have anything slower than dual core a9 in the house all running 4.1 and up. For me its all the same on performance. But even so it just shows the importance of the soc power is over rated. Cpu power is solid, and i still dont get what games people play to use those tons of gpu power.

beginner99 · Jul 22, 2013

Nothingness said:
I have another possible explanation: Razr I used an Intel designed platform both HW and SW and was hence highly tuned.

Link to this? AFAIK it's a Google-controlled phone since Motorola = Google. Of course Google is interested too in showing that Android runs great on x86.

Exophase · Jul 22, 2013

beginner99 said:
Link to this? AFAIK it's a Google-controlled phone since Motorola = Google. Of course Google is interested too in showing that Android runs great on x86.

When was the Google acquisition completed, May 2012? Although RAZR i wasn't released until September 2012 it was announced that Motorola was going to release a Medfield phone way back in January 2012. In fact it was said that Intel and Motorola entered a multi-year contract. Somehow I don't think Google had a huge influence on either the business or engineering end of this.

It'd be kind of weird if RAZR i was a big Intel design though, last I checked it seemed to share a lot in common with RAZR m. But most of the other Medfield phones were Intel's reference platform.

Idontcare · Jul 22, 2013

Nothingness said:
I have another possible explanation: Razr I used an Intel designed platform both HW and SW and was hence highly tuned.

Personally I'd rather buy a highly tuned product versus the product that just kind of got muddled along through a "the status quo is OK" RIM-style produce pipeline.

One is the product of a hungry competitor, the other is relying on the assumption of having a captured audience that won't wander.

beginner99 · Jul 22, 2013

Exophase said:
When was the Google acquisition completed, May 2012? Although RAZR i wasn't released until September 2012 it was announced that Motorola was going to release a Medfield phone way back in January 2012. In fact it was said that Intel and Motorola entered a multi-year contract. Somehow I don't think Google had a huge influence on either the business or engineering end of this.

It'd be kind of weird if RAZR i was a big Intel design though, last I checked it seemed to share a lot in common with RAZR m. But most of the other Medfield phones were Intel's reference platform.

OK. Did not know that.

Yes the RAZR M and I have the exact same body /chassis. Since I own it I can say it feels very, very solid and well built.

mrob27 · Jul 22, 2013

jfpoole said:
John from Primate Labs here (the company behind Geekbench).
[...]

We only found out about this issue a couple of months ago. Given that Geekbench 3 will be out in August, and fixing the issue in Geekbench 2 would break the ability to compare Geekbench 2 scores, we made the call not to fix the issue in Geekbench 2.

If you've got any questions about this (or about anything Geekbench) please let me know and I'd be happy to answer them. [...]

Geekbench needs to measure multi-threaded memory bandwidth (in both the "Memory Performance" and "Stream Performance" sections. I have routinely ignored the overall Geekbench score and looked only at the first two sections (integer and floating-point) because of this problem.

Since Nehalem and Athlon 64, multi-socket workstations have had more memory bandwidth than what a single thread can utilize or measure. In more recent generations the same is true even for single-socket systems.

This matters because it is common for multiple threads to be active at the same time. In normal use scenarios (e.g. browsing during an anti-virus scan), there are multiple separate tasks which share no memory between them so there is no issue of data being transferred from one thread's L1/L2 cache to another.

To measure multi-threaded memory bandwidth, create two or more threads and have each one perform the same test as your existing single-threaded test. Each thread should use its own separate buffer(s), and steps should be taken to ensure that all tests are running during the entire period of time that the measurement is being taken (this might mean that some threads end up doing "more work" while they're waiting for the measurement interval to elapse).

Concillian · Jul 22, 2013

PPB said:
Yep seems like 2 Core Krait Variants seem to be the best compromise of performance and battery life. Seems my next smartphone would be a S400 based one

I've come to this conclusion as well. I'd rather have battery life than a bunch more cores.

Nothingness · Jul 23, 2013

I had some free time so I decided to play with icc 13.

First a handy site: http://gcc.godbolt.org/. This allows you to enter code and see the resulting assembly after compilation with gcc, icc, clang and gcc-arm.

First version of the code:

Code:

void r(unsigned *bitmap)
{
  unsigned baddr = 0;
  unsigned nb = 32;

  while (nb--) {
    bitmap[baddr >> 5] |= 1 << (baddr & 0x1f);
    baddr++;
  }
}

icc -O3 -m64

Code:

L__routine_start__Z1rPj_0:
r(unsigned int*):
        xorl      %ecx, %ecx                                    #3.18
        movl      $31, %eax                                     #6.10
..B1.2:                         # Preds ..B1.2 ..B1.1
        movl      %ecx, %edx                                    #7.21
        movl      $1, %esi                                      #7.41
        shrl      $5, %edx                                      #7.21
        decl      %eax                                          #6.10
        shll      %cl, %esi                                     #7.41
        incl      %ecx                                          #8.5
        orl       %esi, (%rdi,%rdx,4)                           #7.5
        cmpl      $-1, %eax                                     #6.10
        jne       ..B1.2        # Prob 82%                      #6.10
        ret                                                     #10.1

icc -O3 -m32

Code:

L__routine_start__Z1rPj_0:
r(unsigned int*):
        pushl     %esi                                          #2.1
        pushl     %edi                                          #2.1
        pushl     %esi                                          #2.1
        lea       32, %ecx                                      #
        xorl      %edx, %edx                                    #
        movl      16(%esp), %eax                                #
        movl      %edx, %esi                                    #
        andl      $31, %edx                                     #
        shrl      $5, %esi                                      #
        lea       (%eax,%esi,4), %eax                           #
        movl      %ecx, %esi                                    #
        addl      %edx, %ecx                                    #
        cmpl      $32, %ecx                                     #
        jbe       ..L10         # Prob 50%                      #
        movl      %ecx, %esi                                    #
        movl      %edx, %ecx                                    #
        movl      $-1, %edi                                     #
        shll      %cl, %edi                                     #
        orl       %edi, (%eax)                                  #
        subl      $32, %esi                                     #
        addl      $4, %eax                                      #
        movl      $-1, %edi                                     #
        cmpl      $32, %esi                                     #
        jbe       ..L11         # Prob 50%                      #
..L12:                                                          #
        movl      %edi, (%eax)                                  #
        addl      $4, %eax                                      #
        subl      $32, %esi                                     #
        cmpl      $32, %esi                                     #
        ja        ..L12         # Prob 50%                      #
..L11:                                                          #
        movl      $32, %ecx                                     #
        subl      %esi, %ecx                                    #
        shrl      %cl, %edi                                     #
        orl       %edi, (%eax)                                  #
        jmp       ..L13         # Prob 100%                     #
..L10:                                                          #
        movl      $-1, %edi                                     #
        movl      $32, %ecx                                     #
        subl      %esi, %ecx                                    #
        shrl      %cl, %edi                                     #
        movl      %edx, %ecx                                    #
        shll      %cl, %edi                                     #
        orl       %edi, (%eax)                                  #
..L13:                                                          #
        popl      %ecx                                          #10.1
        popl      %edi                                          #10.1
        popl      %esi                                          #10.1
        ret                                                     #10.1

This is hilarious: the compiler is able to make use of the set all bits to 1 trick, but can't see that the loop count is constant!

So I decided to change the function to the equivalent code using bytes instead of ints:

Code:

void r(unsigned char *bitmap)
{
  unsigned baddr = 0;
  unsigned nb = 32;

  while (nb--) {
    bitmap[baddr >> 3] |= 1 << (baddr & 0x7);
    baddr++;
  }
}

icc -O3 -m64

Code:

L__routine_start__Z1rPh_0:
r(unsigned char*):
        xorl      %edx, %edx                                    #3.18
        movl      $31, %eax                                     #6.10
..B1.2:                         # Preds ..B1.2 ..B1.1
        movl      %edx, %esi                                    #7.21
        movl      %edx, %ecx                                    #7.41
        shrl      $3, %esi                                      #7.21
        andl      $7, %ecx                                      #7.41
        movl      $1, %r8d                                      #7.41
        shll      %cl, %r8d                                     #7.41
        decl      %eax                                          #6.10
        incl      %edx                                          #8.5
        orb       %r8b, (%rsi,%rdi)                             #7.5
        cmpl      $-1, %eax                                     #6.10
        jne       ..B1.2        # Prob 82%                      #6.10
        ret                                                     #10.1

icc -O3 -m32

Code:

L__routine_start__Z1rPh_0:
r(unsigned char*):
        pushl     %esi                                          #2.1
        pushl     %edi                                          #2.1
        pushl     %ebx                                          #2.1
        xorl      %edx, %edx                                    #
        movl      16(%esp), %ecx                                #1.6
        movl      $31, %eax                                     #
        movl      %ecx, %esi                                    #
..B1.2:                         # Preds ..B1.2 ..B1.1
        movl      %edx, %edi                                    #7.21
        movl      %edx, %ecx                                    #7.41
        shrl      $3, %edi                                      #7.21
        andl      $7, %ecx                                      #7.41
        movl      $1, %ebx                                      #7.41
        decl      %eax                                          #6.10
        shll      %cl, %ebx                                     #7.41
        incl      %edx                                          #8.5
        orb       %bl, (%esi,%edi)                              #7.5
        cmpl      $-1, %eax                                     #6.10
        jne       ..B1.2        # Prob 82%                      #6.10
        popl      %ebx                                          #10.1
        popl      %edi                                          #10.1
        popl      %esi                                          #10.1
        ret                                                     #10.1

And now in 32-bit mode, the set all bits to 1 trick simply disappeared even though it is still applicable at the byte level (the same applies to 16-bit and 64-bit arrays).

One can also play with decreasing baddr (from 31) instead of increasing; again the optimization disappears.

One can also pass two arrays and do the or assignment to the two arrays; again the optimization disappears.

I know some will still deny, but all my doubts have vanished: icc is definitely cheating.

jfpoole · Jul 23, 2013

mrob27 said:
Geekbench needs to measure multi-threaded memory bandwidth (in both the "Memory Performance" and "Stream Performance" sections.

All of the memory workloads will be multi-threaded in v3. The biggest difference is on multi-socket systems, but we've even seen a boost on some Android and iOS handsets.

ashetos · Jul 23, 2013

Nothingness said:
I had some free time so I decided to play with icc 13.

First a handy site: http://gcc.godbolt.org/. This allows you to enter code and see the resulting assembly after compilation with gcc, icc, clang and gcc-arm.

First version of the code:

Code:

void r(unsigned *bitmap) { unsigned baddr = 0; unsigned nb = 32; while (nb--) { bitmap[baddr >> 5] |= 1 << (baddr & 0x1f); baddr++; } }

This is hilarious: the compiler is able to make use of the set all bits to 1 trick, but can't see that the loop count is constant!

So I decided to change the function to the equivalent code using bytes instead of ints:

Code:

void r(unsigned char *bitmap) { unsigned baddr = 0; unsigned nb = 32; while (nb--) { bitmap[baddr >> 3] |= 1 << (baddr & 0x7); baddr++; } }

And now in 32-bit mode, the set all bits to 1 trick simply disappeared even though it is still applicable at the byte level (the same applies to 16-bit and 64-bit arrays).

One can also play with decreasing baddr (from 31) instead of increasing; again the optimization disappears.

One can also pass two arrays and do the or assignment to the two arrays; again the optimization disappears.

I know some will still deny, but all my doubts have vanished: icc is definitely cheating.

In the first case we have the expression:
1 << (baddr & 0x1f)
which sets all 32 bits of the bitmap for baddr (31...0)

In the second case we have the expression:
1 << (baddr & 0x7)
which only works correctly for baddr (7...0).
For all baddr values greater than 7 the result is always 0x01 which means not all 32 bits of the bitmap are set to 1

These pieces of code are already for decreasing baddr. If we are to conclude intel is cheating I think you should post different variations of the code so that there is no doubt.

Nothingness · Jul 23, 2013

ashetos said:
In the first case we have the expression:
1 << (baddr & 0x1f)
which sets all 32 bits of the bitmap for baddr (31...0)

In the second case we have the expression:
1 << (baddr & 0x7)
which only works correctly for baddr (7...0).
For all baddr values greater than 7 the result is always 0x01 which means not all 32 bits of the bitmap are set to 1

These pieces of code are already for decreasing baddr. If we are to conclude intel is cheating I think you should post different variations of the code so that there is no doubt.

Oh I tried a for loop going up or down, and also 8 bits for the char case. In all of these cases the optimisation disappears. On my side I'm 99.9% confident Intel cheated.

EDIT: You got it wrong for the & 7 case

Schmide · Jul 23, 2013

ashetos said:
In the first case we have the expression:
1 << (baddr & 0x1f)
which sets all 32 bits of the bitmap for baddr (31...0)

In the second case we have the expression:
1 << (baddr & 0x7)
which only works correctly for baddr (7...0).
For all baddr values greater than 7 the result is always 0x01 which means not all 32 bits of the bitmap are set to 1

These pieces of code are already for decreasing baddr. If we are to conclude intel is cheating I think you should post different variations of the code so that there is no doubt.

Those are masks to get the bit address, not to set it. The first 3 bits (0x07 in bits is 00000111) is masked to the address and thus equals the 0-7 bit address of the byte. The 1 is then left shifted and ored with the byte.

Think about it 10 is greater than 7 and corrispondes to byte 1 (10>>3) and bit 2 (10 & 7).

ashetos · Jul 23, 2013

Schmide said:
Those are masks to get the bit address, not to set it. The first 3 bits (0x07 in bits is 00000111) is masked to the address and thus equals the 0-7 bit address of the byte. The 1 is then left shifted and ored with the byte.

Think about it 10 is greater than 7 and corrispondes to byte 1 (10>>3) and bit 2 (10 & 7).

They are masks, and for baddr (8...31) the mask is always 0 (baddr & 0x07) which always results in setting only the first bit for those bytes, which as I said is 0x01. This means, one byte is full of ones, the rest three bytes are only 0x01. Right?

dastral · Jul 23, 2013

Exophase said:
Single core Cortex-A9 OMAP @ 800MHz doesn't exist. Did you mean Cortex-A8?

Do we have a reliable website with ARM Benches ?
I have a Galaxy Tab 7 (the first one) and i'm looking for a 6 to 7" phone/tablet.

I've seen a few ones from 200€ (asus fonepad) to 500€ (samsung galaxy mega).
I just need something that can play twitch.tv streams at 480p (or more) something the First Tab 7 can't do anymore....(don't ask me why)

I assume it's mostly CPU performance, but... i can't find a shop with "open WIFI" that will let me test twitch.tv APK before buying

Schmide · Jul 23, 2013

ashetos said:
They are masks, and for baddr (8...31) the mask is always 0 (baddr & 0x07) which always results in setting only the first bit for those bytes, which as I said is 0x01. This means, one byte is full of ones, the rest three bytes are only 0x01. Right?

A walk through

The first 10 numbers

Code:

#      byte address              bit address             first 2 bytes in bits
          (# >> 3)                (# & 0x07)

0             0                         0                 0000 0000 0000 0001
1             0                         1                 0000 0000 0000 0011
2             0                         2                 0000 0000 0000 0111
3             0                         3                 0000 0000 0000 1111
4             0                         4                 0000 0000 0001 1111
5             0                         5                 0000 0000 0011 1111
6             0                         6                 0000 0000 0111 1111
7             0                         7                 0000 0000 1111 1111
8             1                         0                 0000 0001 1111 1111
9             1                         1                 0000 0011 1111 1111
10            1                         2                 0000 0111 1111 1111

Nothingness · Jul 23, 2013

baddr & 7 is the same as baddr % 8 (modulus) in case you have math notions.
This means for b in [0..7] b & 7 = b, for b in [8..15] b & 7 = b-8, etc.

ashetos · Jul 23, 2013

Thanks guys, got it. So then, this means intel really is cheating again, huh.

AnTuTu and Intel

Golden Member

Diamond Member

Golden Member

Lifer

Diamond Member

Golden Member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Member

Diamond Member

Platinum Member

Member

Senior member

Platinum Member

Diamond Member

Senior member

Member

Diamond Member

Platinum Member

Senior member