Apple A9X the new mobile SoC king

thunng8 · Sep 11, 2015

dahorns said:
I mean, you're also talking about a device with 2+ inches of a higher resolution screen, 6 additional GB of memory, a larger true ssd hard drive, oh and a load measurement that comes from a stress test not available on ios devices.

What you'd prefer for power measurement is to test both running the same workload.

all true of course .. but the delta from idle to max is still 23W. You can't seriously say that an a8x in an ipad air 2 (in a small 6.1mm thin body which doesn't get hot) would use that much. Intel has very aggressive turbo speeds and if the cooling solution is sufficient it will sustain much higher power usage than the TDP of the chip would suggest.

Maybe that's why the a9X might be a lot faster than the a8X. Apple is implementing an agressive turbo-ing solution similar to what Intel has done.

thunng8 · Sep 11, 2015

Leo9 said:
I'm getting
2416.2ms +/- 2.1%

A8X, iOS 9 PB3. Meaningless bench.

About time Apple got to optimizing their slow JS engine. It might even be competitive now against Chrome and even Firefox.

Headcool · Sep 11, 2015

Thala said:
Let me summarize. You acknowledged that web benchmarks are misleading only to continue to quote Kraken, Sunspider and WebXPRT?

They are misleading if you take just take the score of any browser instead of the best one.
The reason you can't ignore them is because browsing is one of the most important tasks performed on a mobile device.
Especially WebXPRT is important because it benchmarks tasks people actually perform an mobile devices.

Thala said:
Bold claim! Mind to enlighten us why this is the case based on your expertise and experience?

I have already posted an example why Geekbench is a big pile of shit. Right below the sentence you quoted. If you think that benchmark that is 6 to 10 times slower than it should be is ok, than I can't help you.

thunng8 said:
Doesn't explained the sustained load of 18.5W from the Macbook.

The workload is not that sustained notebookcheck thinks it is and it is not solely the energy the SoC uses. Like dahorns said it features a SSD, much more memory, a bigger screen with much better viewing angle stability, etc...

thunng8 · Sep 11, 2015

Headcool said:
They are misleading if you take just take the score of any browser instead of the best one.
The reason you can't ignore them is because browsing is one of the most important tasks performed on a mobile device.
Especially WebXPRT is important because it benchmarks tasks people actually perform an mobile devices.

I have already posted an example why Geekbench is a big pile of shit. Right below the sentence you quoted. If you think that benchmark that is 6 to 10 times slower than it should be is ok, than I can't help you.

The workload is not that sustained notebookcheck thinks it is and it is not solely the energy the SoC uses. Like dahorns said it features a SSD, much more memory, a bigger screen with much better viewing angle stability, etc...

I'm not going to detabe with you anymore on this. The numbers speak for themselves.

Max load on an ipad air 2 is 11W vs 29W on the macbook.

Are you seriously saying that a faster SSD, more RAM and marginally bigger screen uses 18W? That is ridiculous.

You only have to look at the idle and max to see isolate the CPU/GPU components as the max is measured when loading the CPU/GPU.

The delta on the ipad air 2 is 6W and macbook is 23W.

Thala · Sep 11, 2015

They are misleading if you take just take the score of any browser instead of the best one.
The reason you can't ignore them is because browsing is one of the most important tasks performed on a mobile device.
Especially WebXPRT is important because it benchmarks tasks people actually perform an mobile devices.

Importance irrelevant in the context of evaluation and comparison of different core architectures.
We are not discussing the deficits of Apple's Javascript implementation here after all.

I have already posted an example why Geekbench is a big pile of shit. Right below the sentence you quoted. If you think that benchmark that is 6 to 10 times slower than it should be is ok, than I can't help you.

You need to understand that:
a) A benchmark does not claim to use the most optimal algorithm for a given problem. It simply does not need to.
b) An optimal implementation of DGEMM on i7 4770 would use AVX most likely via Intel Math Kernel Libraries. Geekbench deliberately chose not to use NEON/SSE/AVX extensions.

As you might be able to see, your argument is void.

thunng8 · Sep 11, 2015

sm625 said:
This is the one that I'm watching. If A9X can even just match an i3-4020Y in Kraken, that would be huge.

Don't have to wait for a9X. a8x already beats the i3 when running ios 9.0. Scores approx 2400ms.

Headcool · Sep 11, 2015

thunng8 said:
I'm not going to detabe with you anymore on this. The numbers speak for themselves.

Max load on an ipad air 2 is 11W vs 29W on the macbook.

Are you seriously saying that a faster SSD, more RAM and marginally bigger screen uses 18W? That is ridiculous.

You only have to look at the idle and max to see isolate the CPU/GPU components as the max is measured when loading the CPU/GPU.

The delta on the ipad air 2 is 6W and macbook is 23W.

The text explicitely states that short peak load is almost 30W.
I was never talking nor I am talking now about peak power consumption.
The text further states that continuous consumption is clearly below 20W.
It does not state what "continuous" actually is. The might have tested it only for 10 minutes. Since Core M is a tablet SoC it might be able to keep the turbo up for like 5 minutes since it is used in a Notebook with much better cooling characteristics. In this case it could draw 5min 15W and 5min 6W resulting in 10.5W. Thats 8W for SSD, memory, screen, which might be fully active as well, if they used a power virus.
But that is just an assumption. The could have tested only for 5 minutes and called that continuous load. They just don't specify it.
Fact is that Core M is a SoC with a configurable TDP between 3.5W and 6.0W.

Eug · Sep 11, 2015

thunng8 said:
Don't have to wait for a9X. a8x already beats the i3 when running ios 9.0. Scores approx 2400ms.

For A8X, I get just under 2340 ms in Kraken 1.1 on iOS 9.1.

For A7 in the iPhone 5s I get 3542 in iOS 9.1, which is a huge improvement from before. It was 5905 in the AnandTech review.

Meanwhile, for A5 in my iPad 2 with iOS 9.1, I get just over 25000, at 25024.

thunng8 · Sep 11, 2015

Headcool said:
if they used a power virus.

As per usual, they used furmark and Prime95 to get the peak figures. Only stresses GPU/CPU.

So that delta of 23W for the Core M in the Macbook is purely the delta when loading the CPU/GPU.

Headcool · Sep 11, 2015

Thala said:
You need to understand that:
a) A benchmark does not claim to use the most optimal algorithm for a given problem. It simply does not need to.
b) An optimal implementation of DGEMM on i7 4770 would use AVX most likely via Intel Math Kernel Libraries. Geekbench deliberately chose not to use NEON/SSE/AVX extensions.

As you might be able to see, your argument is void.

a) No it doesn't. But it should give a "realistic" representation. That means it should be somewhere near of a decent implemenation. If it would be at 120GFlops which is still low I wouldn't mind, but 20GFlops is just pure shit.
This is not a benchmark that measures that capabilities of a CPU but a benchmark that measures how poor the programming skills of the primlabs-employees are.

b) They use AES-instructions in their AES benchmark, they use SHA instructions in their SHA benchmark. The use special instructions there where they favour ARM. In the benchmarks where the favour x86 because of the wider SIMD- and FMA-units, they don't use them.
They don't include benchmarks that would favor x86 like a high quality random number generator, where x86 has special instructions.

You could argue that apps written by average programmers wouldn't contain that much SIMD-instructions at all and that would be true. But the algorithms they use in Geekbench like bzip2-, jpeg-, png-compression and -decompression, sobel, dijkstra, mandelbrot, sharpen filter, blur filter, sgemm, dgemm, sfft, dfft, n-body and raytrace are all algorithms where high quality implementations exist. These are also algorithms normal programmers don't implement theirselves, but instead use libraries which do support AVX & co. Many of these algorithms are not even used on mobiles devices, but only on desktop, server and hpc machines, i.e. raytracing.

c) The memory performance section is pure shit at all. Every benchmark in this section does the same - measure maximum memory bandwith. It doesn't matter if you add this numbers or scale them. The ALU is just not the bottleneck. And you don't have to be a genius to know that.

There are dozens of reviews out there that tested the performance impact of DDR4 in comparison of DDR3 and quad channel in comparison to dual channel. The result is always the same. Increasing the memory bandwith by a large amount (high double digit) increases the performance of real-world-application only by a low single digit.
In other words - maximum memory bandwith is complete irrelevant for real world applications (except for games, but we are testing CPU performance).
But it still contributes heavenly to the total score. Again in favour for small mobile chips where the difference between ALU-througput and memory bandwith is not as large as in bigger cpus.

Headcool · Sep 11, 2015

thunng8 said:
As per usual, they used furmark and Prime95 to get the peak figures. Only stresses GPU/CPU.

So that delta of 23W for the Core M in the Macbook is purely the delta when loading the CPU/GPU.

Again you are using numbers for the maximum power consumption (short peaks) not of sustained load.
What could be the responsible for these peaks is the battery. The core M draws about 15W under full turbo. Maybe it directly uses the battery until the load circuit recognizes that the battery is not 100% full anymore and delivers 15W to fill the battery and another 15W for the Core M.

However for sustained load the Core M can't draw more than 6W. Intel stated that the can't use more power because they can't dissipate more heat passively in such a device.

thunng8 · Sep 11, 2015

Headcool said:
Again you are using numbers for the maximum power consumption (short peaks) not of sustained load.
What could be the responsible for these peaks is the battery. The core M draws about 15W under full turbo. Maybe it directly uses the battery until the load circuit recognizes that the battery is not 100% full anymore and delivers 15W to fill the battery and another 15W for the Core M.

However for sustained load the Core M can't draw more than 6W. Intel stated that the can't use more power because they can't dissipate more heat passively in such a device.

You keep on bringing up marketing slides. If that slide is to believed, then the max sustained load for the a8x in the ipad air 2 for it to be passively cooled is <2.5W since in an ipad Air 2 is a 6mm thick.

You originally claimed that the a8x uses much more power than the Core M.

Stef123 · Sep 11, 2015

Headcool said:
I personally think people overestimate the SoCs from Apple. Mainly because of Geekbench and GFXBench.

I don't think you're correct. Phoronix is running benchs by running GNU/Linux on ARM chips: http://www.phoronix.com/scan.php?page=article&item=nvidia-tegra-x1&num=2

According to their latest results, Tegra X1 (which uses last year's Cortex-A57 cores, which are a more close performance equivalent to A7's swift cores) almost matches (in most tests) mobile-i3 performance.

That's *very* close to the results that I found by running GNU/Linux cortex-A57 machines (it's a close equivalent to an i3-Mobile machine).

It's *not* an overstatement to say that A9x produces i5-grade performance. In fact I would be surprised if it was not to. It's unfortunate that I can't install Linux ARM on it to show that fact to more people.

Contrary to your belief, I believe that a benchmark that really stresses the CPU is useless for any important conclusion. Most people on most operations never stress the cpu as much.

While I agree that Geekbench is not the best test around (and as I have found out it often overestimates/underestimates certain chips) it tracks quite well what I see in my Linux experiences... So just because a benchmark does not stress a CPU correctly doesn't mean that the stress it actually puts on it doesn't reflect real-time performance, even if it is by accident...

Khato · Sep 11, 2015

thunng8 said:
As per usual, they used furmark and Prime95 to get the peak figures. Only stresses GPU/CPU.

So that delta of 23W for the Core M in the Macbook is purely the delta when loading the CPU/GPU.

CPU, GPU, and memory - only point it's missing really is storage. And that's a short-term spike delta of 23W, whereas the steady-state (maybe, it probably drops down further) is 12.5W, when running both Prime95 + Furmark which is pretty much the definition of maximum load.

By comparison, the 'load' on the iPad Air is a joke. I'm guessing that's something that was conveniently overlooked? They're using the 'Relative' benchmark, which is a relatively simple 3D benchmark that doesn't do much at all with respect to CPU load - in fact, running the windows version it doesn't even make a blip. Yeah, that's a real stress test... Anand's tests of the original iPad Air in Kraken showed that a single-threaded load resulted in a delta of ~3W - multiply that by three and you're already getting close to the Core M steady state. Soon as you add in another ~6W for graphics then you're up to some 15W, which is most likely not even allowed by the power control logic.

Anyway, Intel definitely uses more power in burst scenarios while offering a corresponding performance increase - so what? Considering that the majority of workloads for the devices in question greatly benefit from such there's no downside.

Khato · Sep 11, 2015

Stef123 said:
I don't think you're correct. Phoronix is running benchs by running GNU/Linux on ARM chips: http://www.phoronix.com/scan.php?page=article&item=nvidia-tegra-x1&num=2

According to their latest results, Tegra X1 (which uses last year's Cortex-A57 cores, which are a more close performance equivalent to A7's swift cores) almost matches (in most tests) mobile-i3 performance.

Let's see, time in seconds for C-Ray v1.1 is 79.67 for the i3 versus 84.3 for the A57... but what's this? A Z3735F takes 171.95 versus 205.93 for an N2820? Oh, that explains it - multithreading. Yup, 4 A57 cores almost match the 2 cores in the i3 - real impressive...

Then for Stockfish v2014-11-26 no analysis is necessary, 5832 for the i3 versus 11461 for the A57, so again basically half the speed per core since this one definitely is single threaded.

John the Ripper is the only one where A57 is actually competitive - the N2820 is faster than the Z3735F as it should be for a single-threaded benchmark, so core count isn't at play. But the A57 at 1681 is still quite close to the i3 at 1979. Either way, for most tests an A57 core is about half of an i3 core, the X1 just gets to being close on one of them due to having twice the number of cores.

TuxDave · Sep 11, 2015

Thala said:
b) An optimal implementation of DGEMM on i7 4770 would use AVX most likely via Intel Math Kernel Libraries. Geekbench deliberately chose not to use NEON/SSE/AVX extensions.

What's the point of a floating point benchmark that deliberately doesn't use the vast majority of the provided floating point hardware?

jhu · Sep 11, 2015

TuxDave said:
What's the point of a floating point benchmark that deliberately doesn't use the vast majority of the provided floating point hardware?

This does seem like an odd omission.

Headcool · Sep 12, 2015

thunng8 said:
You keep on bringing up marketing slides. If that slide is to believed, then the max sustained load for the a8x in the ipad air 2 for it to be passively cooled is <2.5W since in an ipad Air 2 is a 6mm thick.

You originally claimed that the a8x uses much more power than the Core M.

I "keep" bringing on marketing slides? That is actually the first one. And it does not state anything in or against favour of Intel. It states the results of some physical experiments that are valid for Core M processors with Core M die size and the low quality TIM used in Core M.

TAmbient = 25ºC and Tskin = 41ºC according to the graph. Thats a delta of 16ºC. But according to notebookcheck the delta of the iPad Air 2 is 21.7ºC.

I never claimed that the A8X uses more power than the Core M. But I think the A9X does.

We do not know how much the A8X consumes in a sustained load. We know the Core M consumes up to 6W sustained. We know that iPad Air consumes 10 to 11W in a sustained load. If you can design a device where all components like memory, mainboard, ssd, screen, etc. consume up to 5W than it is possible to build a Core M tablet with the same power consumption as the iPad Air 2. I personally think that should be totally possible if the right parts are used.

Stef123 said:
I don't think you're correct. Phoronix is running benchs by running GNU/Linux on ARM chips: http://www.phoronix.com/scan.php?page=article&item=nvidia-tegra-x1&num=2

According to their latest results, Tegra X1 (which uses last year's Cortex-A57 cores, which are a more close performance equivalent to A7's swift cores) almost matches (in most tests) mobile-i3 performance.

That's *very* close to the results that I found by running GNU/Linux cortex-A57 machines (it's a close equivalent to an i3-Mobile machine).

It's *not* an overstatement to say that A9x produces i5-grade performance. In fact I would be surprised if it was not to. It's unfortunate that I can't install Linux ARM on it to show that fact to more people.

Contrary to your belief, I believe that a benchmark that really stresses the CPU is useless for any important conclusion. Most people on most operations never stress the cpu as much.

While I agree that Geekbench is not the best test around (and as I have found out it often overestimates/underestimates certain chips) it tracks quite well what I see in my Linux experiences... So just because a benchmark does not stress a CPU correctly doesn't mean that the stress it actually puts on it doesn't reflect real-time performance, even if it is by accident...

Please to me a favor and don't use the terms i3, i5 and i7 without a power specification. A Pentium G3258 can beat shit of a 15W i7, especially when overclocked.
The number behind the x,m,i is not as much important as the power consumption.

What you want to say it that a 11W Tegra K1 can beat a 15W i3. Thats actually fine the K1 is a decent chip. But don't forget one thing. If you benchmark a software that only supports 128-bit SIMD like SSE and NEON the K1 will run under full load, while the i3 can use only a fourth of its peak performance. That is actually perfectly fine as not every software has AVX and FMA support. But it also means that the K1 uses its full 11W, while the i3 may only use a fraction of its 15W.
So I expect the i3 to be faster or to use far less power.

Take a look at the second page of the Phoronix review you posted. There is a list of all compiler flags used. The flags used for the i3 include --with-arch-32=i686 and --with-tune=generic. I have not found an explanation for these flags, but from my understanding it only generates i686 code without SSE or AVX at all. On the other side it uses --target=x86_64-linux-gnu which indicates 64bit support and thus includes SSE. Regardless which interpretation is correct AVX and FMA seems not to be used.

Since the K1 is not used in many mobile devices, there must be something wrong with it and I guess it is the effiency.

VirtualLarry · Sep 12, 2015

ShintaiDK said:
If we are to believe Apple, its also a console killer. Intel mobile CPU performance and AMD console GPU performance. All in a phone/tablet. Sounds too good to be true doesnt it?

I went into this thread to post the possibility of future consoles' being based on ARM (mobile gaming market compatibility), and using Apple SoCs (seemingly, the highest-performing ARM SoCs with decent iGPUs too.)

Stef123 · Sep 12, 2015

Khato said:
Either way, for most tests an A57 core is about half of an i3 core, the X1 just gets to being close on one of them due to having twice the number of cores.

John the Ripper
C-Ray
FFTE

3 out of 6 tests is on par. On another 2 is on par with older i3s - i5s and in only one (OpenSSL) A57 ... is truly bad. Also having fewer cores is Intel's choice, it *has to* be taken in account, you can't just say "buuut I have fewer cores!", yeah but you also have "fatter" cores...

This is very close to my results. Intels are more general-purpose, but not very fast, so unless you're using very specific kind of calculations (the ones where A57 is severely left behind), you are going to experience i3-level of performance.

I only had limited time to "play" with Shield Console, once I'd receive it I'd have more conclusive results, but I would be very much surprised if it's not around the mobile-i3 ballpark.

Extrapolating from that I'd expect A9X to be i5-grade to most of its operations... So yeah we're getting there, especially given Intel's slow pace.

Since the K1 is not used in many mobile devices, there must be something wrong with it and I guess it is the effiency

It doesn't matter that X1 is not used in many mobile devices. It's indicative of how Cortex-A57 cores are performing when contrasted to Intel processors.
And yes I did specify that I'm talking about "mobile-i3" ... so I didn't use the generic i3-i5-i7 term that you imply that I did.

Eug · Sep 12, 2015

Apple TV is a console.

jhu · Sep 12, 2015

Stef123 said:
It doesn't matter that X1 is not used in many mobile devices. It's indicative of how Cortex-A57 cores are performing when contrasted to Intel processors.
And yes I did specify that I'm talking about "mobile-i3" ... so I didn't use the generic i3-i5-i7 term that you imply that I did.

I used to think so too. However, just because it's a "run of the mill" Cortex doesn't mean there are no other differences in implementation. From my testing, for example, the Cortex A9 in OMAP 4430/4470 has a higher IPC than the Cortex A9 in Exynos 4210

Stef123 · Sep 12, 2015

jhu said:
I used to think so too. However, just because it's a "run of the mill" Cortex doesn't mean there are no other differences in implementation. From my testing, for example, the Cortex A9 in OMAP 4430/4470 has a higher IPC than the Cortex A9 in Exynos 4210

Maybe, but in this case, it doesn't show (single core performance):

nVidia shield (A57 @ 1.9Ghz): http://browser.primatelabs.com/geekbench3/search?q=NVIDIA+SHIELD+Console

Exynos 7420 (A57 @ 2.1Ghz): http://browser.primatelabs.com/geekbench3/search?dir=desc&q=SM-N920T&sort=score

jhu · Sep 12, 2015

Stef123 said:
Maybe, but in this case, it doesn't show (single core performance):

nVidia shield (A57 @ 1.9Ghz): http://browser.primatelabs.com/geekbench3/search?q=NVIDIA+SHIELD+Console

Exynos 7420 (A57 @ 2.1Ghz): http://browser.primatelabs.com/geekbench3/search?dir=desc&q=SM-N920T&sort=score

I think the difference is that Povray has to go to main memory since it misses branches a lot whereas Geekbench might be small enough to fit in cache.

Arachnotronic · Sep 12, 2015

Eug said:
Apple TV is a console.

Wish they'd stuck an A8X in there. Much more powerful for games.

Apple A9X the new mobile SoC king

Member

Member

Junior Member

Member

Golden Member

Member

Junior Member

Lifer

Member

Junior Member

Junior Member

Member

Junior Member

Golden Member

Golden Member

Lifer

Lifer

Junior Member

No Lifer

Junior Member

Lifer

Lifer

Junior Member

Lifer

Lifer