Core M v.s. A8X in Geekbench 3

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

asendra

Member
Nov 4, 2012
156
12
81
Oh, I'm not saying it's not impressive. I'm blown away that this things are so close to my macbook air 2012 in performance.

Just that saying, "imagine if they released it in 20nm" doesn't really mean anything at the end of the day, at least to me, and today. When and if they do it, we'll have another thread like this commenting it .
AMD could be saying the same thing for the better part of last decade and still wouldn't help them.

Every design has it's compromises. Apple's core for example has to scale from a 4'7" device, to an 9'7" one. and they are a relatively small part of it's die, because they cram A LOT of things in there which don't necessarily help directly in performance (besides gpu). and yes, I know A8X it's different than an A8, by scaled up I meant that technically, the core design is the same, only that they added one more.

I didn't know that there were already smartphone announced with tegra k1, or Denver, but we have yet to see how would that scale. It's not as easy to scale down a design while maintaining it competitive. If it where, Intel would have already crammed some design in smartphones with good performance and call it a day.
 
Last edited:

jfpoole

Member
Jul 11, 2013
43
0
66
Can I nitpick a bit?

Please do!

Note, though, that the GEMM kernel in the blog post is a simplified kernel designed to illustrate the compiler issue we encountered with Visual C++ and is different from the actual GEMM kernel that ships with Geekbench 3. The actual kernel may address issue #1 (don't thrash the LS) but still has issue #2 (don't have dependent mul/accumulate). I'm happy to send you the actual kernel via email if you're interested in taking a look. Just drop me a note at john@primatelabs.com.
 

Khato

Golden Member
Jul 15, 2001
1,225
280
136
That's our expectation as well. This approach won't help with bugs in the "backend" of the compiler (e.g., code generation) for architectures that aren't part of the automated system. Our goal for v4 is to have the automated system cover all of the architectures we ship so that we won't have to fall back on manual validation.
Yeah, that definitely makes sense. There's still the possibility of 'backend' optimizations or failings on the part of the compiler specific to an ISA, but at least the 'frontend' parsed code should be similar for each platform.

Extending the automated system certainly sounds intriguing, would such be entirely automated from the start or would it simply be comparing against a manually verified baseline?

For Geekbench 3 all of the cross-ISA checks we performed were manual. We examined the results for "similar" processors with different instruction sets. We examined the generated code for the smaller kernels. Even enabling and disabling compiler optimizations and observing how that affected different architectures (in particular vectorization) was useful in tracking down issues.

If you know of cases where architectures with different ISAs but similar execution resources produced unexpected results with Geekbench 3 please let me know as we'd like to investigate further. While we won't be able to fix Geekbench 3 anything we find will help contribute to the v4 development process.

Again, sounds reasonable. Especially since I'm not aware of any other 'good' way to try and compare cross-ISA other than to look at the generated code... which becomes quite the daunting task with anything complex.

As for specific cases? Well, I would cite the A8X versus Broadwell (or A7 or Denver versus Haswell, given that those are the architectures we have a better picture of) comparisons and some of the peculiar variations... But how similar they are in execution resources depends entirely upon the specific code executed. For example, from what we know of A7+/Denver they're designed without any resource contention between integer and FP workloads while Intel's design does. So for 'simple' code that mixes integer and FP it's quite plausible for A7+/Denver to come out ahead... whereas if the code makes use of Haswell's 256 bit SIMD capabilities it'll come out ahead, especially since the load/store and uncore are actually built to support that raw compute throughput. (Yes, it's a completely non-helpful non-answer I know.)
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
IIRC, Core M 5Y70 was demonstrated to have Sunspider scores close to 100ms and 3dmark Ice Storm Unlimited overall score close to 45,000, so it is a very very fast application processor in comparison (and hence why it is priced at a premium).

Hardly.

100ms is in IE. Chrome it gets about 200ms. That indicates that the 5Y10 gets 260-270ms.

Also the Yoga 3 Pro is way worse than that. Probably has to do a lot with the design but does not instill any confidence in the Core M, or 14nm process for that matter.

3.5W average seems low but spikes to 12W is likely higher than an iPad ever hits, hence the active cooling.
The "12W" isn't relevant on any meaningful benchmark. By that metric the 15W Haswell U chips have a "peak" of something like 40W.

If its set at 3.5W, that means longer team it has to match 3.5W.

Either way, it doesn't look good for Lenovo NOR Intel.
 

III-V

Senior member
Oct 12, 2014
678
1
41
Hardly.

100ms is in IE. Chrome it gets about 200ms. That indicates that the 5Y10 gets 260-270ms.

Also the Yoga 3 Pro is way worse than that. Probably has to do a lot with the design but does not instill any confidence in the Core M, or 14nm process for that matter.

The "12W" isn't relevant on any meaningful benchmark. By that metric the 15W Haswell U chips have a "peak" of something like 40W.

If its set at 3.5W, that means longer team it has to match 3.5W.

Either way, it doesn't look good for Lenovo NOR Intel.
I'm not at all worried about Core M's performance. The problem seems strictly limited to the Yoga 3, despite having anything else with Core M to compare it to. If the manufacturer set the desired TDP at 3.5W, and the chip is indented to run at a desired TDP of 4.5W... well, there's not really much more that needs to be said. It's a 30% difference in TDP, despite being a single watt.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
ARMv8 has instructions for SHA-1 and SHA-2 (SHA-256). Geekbench 3 only uses the ARMv8 instructions for the SHA-1 test (SHA-2 is a straightforward-ish software implementation). IIRC ARMv8 SHA-1 is 4x faster than C++ SHA-1 on the A8.

Happy to post 32-bit results for the A7, A8, and A8X if folks would find that useful?

You are already using ARMv8 instructions in GB 3?

That's very nice. And cutting edge on ARM.

I therefore fully expect x86 GB 3 has AVX2 extensions and the like and is specifically coding for the architecture.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Why so passive-aggressive?

Sorry if I'm coming off that way.

We already have confirmation that DGEMM is implemented poorly. So far I have heard nothing to refute the point that GB is being tuned extremely well for ARM. I have nothing on whether GB uses the much wider cores to their fullest. I've been given the multi-buffer SHA story (which is not true, fileserver type operations rely a lot on them). Not to mention that other tests such as DGEMM work much better on a GPU and in the real world are deployed in ever increasing amounts in such a fashion (ie real world vs. synthetic performance). Or my criticism that the integer tests are almost exclusively cryptography and compression/decompression.

If I hear otherwise, then perhaps GB is acceptable. I appreciate jfpoole's input and find it very illuminating but I'm not convinced totally that this is a good cross-platform benchmark.
 

jdubs03

Senior member
Oct 1, 2013
377
0
76
I think with next years A9X (if they follow the same SKU scheme), there may be a shot that they can perform on single-core near the level of Core M. Multi-thread they're already there though with the help of that extra core. But the transition to FinFET with likely a new uArch will be an interesting comparison point compared to Skylake-Y, though I would expect Skylake to be quite a nice product.

Nvidia is still there though, CPU wise they have the single-core lead on 32-bit over A8X on 64-bit, which will only increase; once Android Lollipop is supported and I expect the other benchmarks to be pretty beastly (though I did see that the A8X does ~63,000 in Antutu compared to ~54,000 on Denver, extra core advantage?). It's very impressive what Nvidia has done with their CPU on a n-1 node compared to Apple, and n-2.5 to Intel. GPU wise it'll be interesting to see the comparison between Kepler-tweaked? to the GX6650, but it already looks very close when comparing the Shield Tablet.

Next year at around this time, will be far more interesting for me, though I wish the next-gen Nvidia part would be 16FF/+ as they'll be at n-1.5 to both Apple and Intel (assuming TSMC and Intel processes are = which they won't be). If Nvidia were on the same process as their competitors, they would be be on top (as I was saying two months ago).
 

Qwertilot

Golden Member
Nov 28, 2013
1,604
257
126
There isn't going to be a Core M score, because of the potential turbo overhead, throttling etc. Probably wouldn't be a A8X score of course, except its only in one device

One thing that NV have coming for certain next time is a ~doubling of the efficiency of their GPU bits from using Maxwell instead of Kepler.
 

UnmskUnderflow

Junior Member
Sep 24, 2014
7
0
0
Please do!

Note, though, that the GEMM kernel in the blog post is a simplified kernel designed to illustrate the compiler issue we encountered with Visual C++ and is different from the actual GEMM kernel that ships with Geekbench 3. The actual kernel may address issue #1 (don't thrash the LS) but still has issue #2 (don't have dependent mul/accumulate). I'm happy to send you the actual kernel via email if you're interested in taking a look. Just drop me a note at john@primatelabs.com.

I'm in hardware, so I'll get in touch with some trace guys who already have a channel to you through more official channels (preserving my anonymity here and not causing any conflicts of interest outside the job). I'll work with them on some suggestions.

Anyway, I just noticed the GEMM format in the blog post and realized why some chips might accelerate the bench in unexpected ways (ARM cortex cores specifically). That's not detracting from what the cortex cores did on their dependent chains...that FMA was a hell of an accomplishment, but just makes the trace act more like "Horner's rule" than classic GEMM.

If you want the trace to show latencies of dependent FP ops, by all means leave it alone. If you want it to go more classic GEMM which just measures how well the LS/caches/bandwidth and FP play together, then I'll feedback through my company.

Again, I appreciate GB's openness to feedback. SPEC really dropped the ball, and there's just a void of benchmarks in the ARM space, let alone to cross against x86. Not perfect, but if you keep taking feedback, that's way more than the Antutu's of the world..

Thx again, will be in touch indirectly.
 

jfpoole

Member
Jul 11, 2013
43
0
66
Extending the automated system certainly sounds intriguing, would such be entirely automated from the start or would it simply be comparing against a manually verified baseline?

Our plan is to automate as much as possible. During development the workloads change too much for a manually-verified baseline to remain relevant for more than a couple of days. Once the workloads are stable, though, a baseline will be useful.
 

jfpoole

Member
Jul 11, 2013
43
0
66
Geekbench's GEMM workloads weren't written to measure peak performance, they were written to measure relative performance. Take a look at the performance of Geekbench's GEMM implementation compared to Accelerate's GEMM implementation on both x86 and ARM:

Code:
Apple A8 @ 1.4 GHz

SGEMM Geekbench    1.60 Gflops
      Accelerate   5.25 Gflops
      
DGEMM Geekbench    0.77 Gflops
      Accelerate   2.55 Gflops

Intel Core i5-4258U @ 2.4 GHz

SGEMM Geekbench    7.38 Gflops (4.6x faster than A8)
      Accelerate  21.5  Gflops (4.1x faster than A8)

DGEMM Geekbench    3.99 Gflops (5.2x faster than A8)
      Accelerate   9.04 Gflops (3.5x faster than A8)

For SGEMM we see the ratio between architectures is roughly the same for both implementations, which suggests that Geekbench's SGEMM may be a good measure of relative performance.

For DGEMM we see the ratio between architectures is 50% higher for Geekbench than for Accelerate. Either Geekbench's DGEMM favors x86 or Accelerate's DGEMM favors ARM (or some combination of the two). While I wouldn't dismiss Geekbench's DGEMM as a bad measure of performance based on this one result, I would suggest that the assertion that Geekbench "is being tuned extremely well for ARM" is incorrect.
 

Khato

Golden Member
Jul 15, 2001
1,225
280
136
For DGEMM we see the ratio between architectures is 50% higher for Geekbench than for Accelerate. Either Geekbench's DGEMM favors x86 or Accelerate's DGEMM favors ARM (or some combination of the two). While I wouldn't dismiss Geekbench's DGEMM as a bad measure of performance based on this one result, I would suggest that the assertion that Geekbench "is being tuned extremely well for ARM" is incorrect.

Or they're both just measuring some intermediate capability of the processor which doesn't provide any indication about how the architecture would perform running an actual program which would likely be using an optimized GEMM .kernel Because I know that for DGEMM Haswell can achieve ~85% of its theoretical 16 flops/cycle, but what about an A8? Is the 2.55 Gflops which Accelerate achieves as good as it gets or is it also only demonstrating a fraction of its potential?

I'll agree that there's nothing wrong with measuring relative performance on non-optimized code as that's what the majority of applications which get run on these architectures are. From that perspective Geekbench is perfectly valid, though with the trivial processing requirements of those majority of applications it's questionable how much measuring that performance actually matters. It just frustrates me when reviewers, tech enthusiasts, and especially analysts take those numbers as absolute measures for comparison of the capabilities of one architecture versus another.
 

UnmskUnderflow

Junior Member
Sep 24, 2014
7
0
0
I would suggest that the assertion that Geekbench "is being tuned extremely well for ARM" is incorrect.

As mentioned, I'd like to feedback formally through my perf team.

GEMM has been around for decades, and there's an absolute swarm of literature out there about it, CPUs and GPUs alike. I need not google/IEEE Xplore for you.

But to the point, how about an easy find from this year (2014):
"A portable and high-performance general matrix-multiply (GEMM) library for GPUs and single-chip CPU/GPU systems"

Tuning OpenCL implementations of important library functions such as
dense general matrix multiply (GEMM) for a particular device is a difficult problem. Further, OpenCL kernels tuned for a particular architecture perform poorly on other architectures

And this...look familiar?

A naive row-major NN matrix-multiply kernel is shown
in Listing 1. A naive OpenCL implementation will assign
computation of one element of C to one work-item. However,
such an implementation will make poor use of the memory
hierarchy of current compute devices. Thus, typically matrix
multiplication is tiled...

int i ,j ,k;
for ( i =0;i<M; i ++){
for ( j =0;j<N; j ++){
for (k=0;k<K; k++){
C[ i ] [ j ] += A[ i ] [ k ]&#8727;B[k ][ j ];
}
}
}
Listing 1: Row-major NN matrix-multiply

Every paper ever on the topic says to interleave and tile the computations. It makes the for loops look like hell, but if you want the chip to sing, that's what's necessary.


I CARE ABOUT YOUR KERNELS. I don't care about whether we disagree on a nerdery forum like here. I DO care about the fact that real decisions are being made on your results.

Geekbench has hit hardware like a Mack Truck, Wall Street listens to what you guys are doing, and I'm happy for your success. But I ask you take your kernels now with some more responsibility and get real feedback from the scientific/engineering community. First order kernels are easy and good enough, and you have that already. True detailed looks are very hard. SPEC used to do this for this reason, and nothing's been done since 2006.

So you guys are it. You are not obligated to change it. I simply ask that you take feedback, as I don't see your sway in benchmarks going away anytime soon.

As for cascading FMAs, they're out there, so don't be surprised if suddenly a chip arrives with some obnoxious GEMM score with this kind of kernel.

The chips you showed do NOT have cascading.
 

Space69

Member
Aug 12, 2014
39
0
66
I don't think there's a point in going real deep in a specific algorithm, since every benchmark in Geekbench are suboptimal implementations regarding high architecture performance.
We all know we should pack the matrices, skip intrinsics and go asm (compilers somehow thinks they're better at ordering) with prefetching - voila 91% of theoretical maximum without any magic
I don't think Primate Labs have any intention of going that route with any of the benchmarks - unfortunately, but understable since it's extremely time consuming. I do agree that Geekbench could result in wrong conclusions, but the only solution is probably a new modular benchmark tool with participation from hardware vendors.
 
Last edited:

jfpoole

Member
Jul 11, 2013
43
0
66
As mentioned, I'd like to feedback formally through my perf team.

Sounds good.

We won't be able to incorporate any feedback into Geekbench 3 (we want to keep v3 scores as comparable as possible) but we'll certainly consider incorporating your feedback for v4.
 
Mar 10, 2006
11,715
2,012
126
Looks like the A8 in iPhone 6 is more or less Silvermont class in 32-bit mode, but gains substantially in AArch64 mode.

The Denver looks very impressive assuming there's minimal difference between "ARMv7" and "AArch32" (to my knowledge these are identical, but I'm not 100% sure).
 

jfpoole

Member
Jul 11, 2013
43
0
66
Looks like the A8 in iPhone 6 is more or less Silvermont class in 32-bit mode, but gains substantially in AArch64 mode.

The Denver looks very impressive assuming there's minimal difference between "ARMv7" and "AArch32" (to my knowledge these are identical, but I'm not 100% sure).

AArch32 is ARMv7+Crypto (plus a handful of new NEON instructions).
 
Mar 10, 2006
11,715
2,012
126
AArch32 is ARMv7+Crypto (plus a handful of new NEON instructions).

Gotcha. Which subtests (other than obviously crypto) would AArch32 show a significant improvement over ARMv7? Would it be the same floating point tests that show benefit in AArch64?

Thank you for taking the time to answer our questions.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |