Qualcomm moves Cortex A72 to the mid-range

witeken · Feb 19, 2015

NTMBK said:
Hah, right. For a more realistic analysis of Broadwell-U/Y, take a look at TechReport's review of the Broadwell NUC: http://techreport.com/review/27798/intel-broadwell-powered-nuc-mini-pc-reviewed It gets compared with the Haswell NUC, which has the same form factor and power budget for the APU. It's the fairest comparison we can get, with no massive 13" display sucking down power and throwing power consumption comparisons way out. The result? Broadwell-U is slightly faster than Haswell-U , with ~10-15% power reduction. Not shabby, but none of this 3X madness.

I was talking about BDW-Y vs. HSW-U i3. If you're talking about i5, i7, it does not compete. No surprise. It also depends on the benchmark: Photoshop and Lightroom are close, but in Handbrake, Haswell i3 wins quite substantially against 5Y10, but 5Y70 is close. So Core m does compete against i3, certainly for graphics, so never mind comparisons against 15W Celerons. Core m is a great achievement of Intel, the 14nm skepticism is overblown.

The heat production is also good with Prime95 and 24° (average) / 38° (max).

NTMBK · Feb 19, 2015

witeken said:
I was talking about BDW-Y vs. HSW-U i3. If you're talking about i5, i7, it does not compete. No surprise. It also depends on the benchmark: Photoshop and Lightroom are close, but in Handbrake, Haswell i3 wins quite substantially against 5Y10, but 5Y70 is close. So Core m does compete against i3, certainly for graphics, so never mind comparisons against 15W Celerons. Core m is a great achievement of Intel, the 14nm skepticism is overblown.

What you're seeing is not Haswell->Broadwell improvement, you're seeing just how crippled the i3-U is. Turbo boost is massively important to a mobile device, and any mobile part which ships without it is basically dead to me. (I'm looking at you, Temash.) Haswell-U i3 is 1.7-2GHz with no turbo, while the 5Y70 can turbo up to 2.6GHz (and has an extra 1MB of cache)- and in a CPU-only benchmark with no GPU load, that CPU turbo will be kicking in nicely.

I do agree that Core M is a nice looking chip, but I wouldn't overstate things too much.

oobydoobydoo · Feb 19, 2015

NTMBK said:
Hah, right. For a more realistic analysis of Broadwell-U/Y, take a look at TechReport's review of the Broadwell NUC: http://techreport.com/review/27798/intel-broadwell-powered-nuc-mini-pc-reviewed It gets compared with the Haswell NUC, which has the same form factor and power budget for the APU. It's the fairest comparison we can get, with no massive 13" display sucking down power and throwing power consumption comparisons way out. The result? Broadwell-U is slightly faster than Haswell-U , with ~10-15% power reduction. Not shabby, but none of this 3X madness.

Thank you for posting this! I hadn't read it yet, and I've been looking to find a fair comparison to see power consumption delta. I will edit this post after I read it.

Edit: After reading this article... I'm not sure what to think. I was very worried that intel was hiding poor load power consumption (and that Broadwell would be worse in this respect to Haswell) but that doesn't appear to be the case. What does appear to be the case is that the idle power consumption hasn't changed, while load power consumption has been reduced by ~15%. That is in addition to a roughly 6-10% increase in performance.

So Broadwell is an improvement basically on the level of haswell for performance, but not nearly as good as haswell for power consumption. Is that poor for a node shrink? It definitely looks poor in comparison with what Qualcomm, Samsung, and Apple appear to be achieving with a node shrink.

Not very impressive.

imported_ats · Feb 19, 2015

lopri said:
But big.LITTLE was conceived not just for phones, at least from ARM's perspective. In MT bench it shows its strength. The latest implementation (Exynos 7420) is fast in ST performance and its MT performance is often 5 times the ST performance or more, per Geekbench subtests.

MT geekbench is beyond useless. Basically, the only time you should pay any attention at all to MT geekbench is when there is no or negative speedup!

And BL was pretty much conceived just for phones.

I have my misgivings on these 8 core little.LITTLE configurations, though. Also I am not sure if 2 + 4 configuration is really all that. I know 2 big + 4 LITTLE has a certain intuitive appeal, but we are talking about general purpose cores that are supposed to run the same instructions. If there is a significant overhead already in 4+4, I wonder if the overhead will be even bigger in an asymmetrical design like 2+4? (stalls, misses, waking up wrong cores, etc.)

It largely doesn't matter cause all the thing you worry about are all things that it already does on symmetrical configurations. The original version required symmetrical configurations because it was even more brain dead and basically completely replicated the state of one cluster to the other.

imported_ats · Feb 19, 2015

lopri said:
I too would like to see more 2+2 designs. (even 1+1 design) I guess A7 and A53 are so small and cheap that OEMs do not feel the need for more optimization there. According to AT's latest investigation, the LITTLE cores are scary small. (0.40mm² for A7, 0.70mm² for A53, on 20nm)

It doesn't matter how small they are if they are not needed. You could literally replace 2 of the 4 with cache and get better performance in most workloads!

MisterLilBig said:
Less cores sure didn't help the Nexus 9!

People need to stop generalizing, Apple made a great SoC, and expect it to keep growing. Jeez, its not like they will stay at 2-3 cores forever.

But what other SoC that has less than 4 cores is great? None!

Any Nexus 9 issues don't have anything to do with the number of cores but what those cores are.

And any smart consumer is going to want to stay at 2-3 cores forever unless there is a workload reason to change. And clue drop, unless things radically change with iOS or Android and there is real useful multitasking going on that can use those additional cores, there will never be a reason for more!

You do realize there is a reason that PCs have stayed at basically 2-4 cores for a decade, right? And that PCs generally have a whole lot more multi-threading and multi-tasking than any phone!

AKA, you are buying into pure marketing BS and not engineering. Take any 4+4 or 8 core phone, cut the cores in half, replace with cache, get better performance!

imported_ats · Feb 19, 2015

Nothingness said:
Clock gating is used but as far as I know no CPU limits resources the way you describe (width reduction, queue restrictions, etc.) dynamically to reduce power consumption. Power gating is very expensive and is done only on larger blocks (FPU for instance).

I might be wrong but I'll need a serious reference to be convinced

There are several CPUs that have anti-speculation for power savings and more are coming.

krumme · Feb 19, 2015

imported_ats said:
Why? The phone can't physically actually run all 4 a57 cores without burning up. And the Cyclone core gives you the performance you actually need with is ST performance.

The area is basically immaterial even in a phone SOC. The area is generally vastly dominated by non-gpu/non-cpu logic. Power is what is important, and at any given power level, Cyclone delivers vastly superior performance not the least of which is that cyclone can actually run continuously unlike a57 as aptly demonstrated by this very site.

Its not a shame. Its reality. Its like people saying just give VLIW more time for the compilers to eventually catch up. Except, its been decades and the compilers still haven't caught up. bl relies on something that basically requires precognition.

The fundamental problem with bl is that the costs of switching processes between contexts is too high and always will be too high from both a power and performance perspective. The only way that bl works is if you integrate the bl into a single core. Basically you have your high performance core design that has advanced clock and power gating and de-spectulation abilities.

So what you actually do is design a 3-4 wide core that can reduce fetch and fetch related speculation down to 2 wide when required. You design your OoO queues such that they have be effectively reduced to cover pipeline delays only. You design your various pipelines such that pipelines beyond the bare minimum can be clock and power gates. And what you end up with is in reality what already exists in advanced power efficient cores like Cyclone, Core ix, etc.

bl therefore is basically a dead end. it is a nice intellectual idea that ignores all the realities.

Caling the size of Cyclone immaterial is only right viewed from the likes of Apple with premium products. The argument about immaterial size could just as well be applied to the rest of the SOC different parts. That argument goes nowhere in any industrial production.

But 0.7mm2 is immaterial for all, so who cares if its marketing - as it is ofcource - look at Tegra 3 success.

Besides the entire idea that eg. it should be replaces by cache giving better performance is nonsense, as dev cost would far outweight the production cost.

And that also touches the idea of bl. The idea of moving cost from producting (damn expensive proces competences) to the arch is brilliant as the effect would last years; in order benefits is here to stay. Secondly it fit minor producers that dont have the ability to tune process the same way the big guys have.

There is a reason ARM keeps it for A72, so lets wait and see. There is some slight progress for each gen, but even if it takes 5 years to get the software right it will be worth it. But ofcource its a risk.

lopri · Feb 19, 2015

NTMBK said:
Asymmetrical core counts implies that they are using global task scheduling, which should work much better than the old cluster migration. Put main app thread on an A72, put the garbage collector on an A53, get a responsive app without blowing up the power budget.

My question already took GTS into account. My question was whether the GTS overhead, which is quite high already in a symmetrical design (4+4), negates whatever theoretical gain on power consumption by an asymmetrical design (in this case 2+4).

"Putting main app on A72, put the garbage collector on A53" kind of scenario is what I alluded to be "intuitive in theory," but we have to remember that the LITTLE cores are not specialized co-processors, but full-blown, independent cores that are capable of everything that big cores are. In practice, all workloads are treated equal until they reach certain thresholds, and I doubt such clean divides are possible (e.g. big cores -> big apps, small cores -> small apps). GTS overhead that was investigated by AT last week shows just that.

Imagine a situation where 4 active threads occupying 4 little cores. What happens when 3 of them need higher performance starting the next cycle? Which of the 3 threads are "discriminated" in a 2+4 configuration? The CPU will suddenly have to make decisions like this which was not present in 4+4 configuration. That was the gist of my question and I have not seen anyone's research on this yet. I hope Andre will get to explore this question some day.

@imported_ats: You seem to prefer "expressing" yourself to explaining your reasoning or providing evidence. Sorry to say but I am not persuaded.

Andrei. · Feb 19, 2015

lopri said:
What happens when 3 of them need higher performance starting the next cycle? Which of the 3 threads are "discriminated" in a 2+4 configuration? The CPU will suddenly have to make decisions like this which was not present in 4+4 configuration. That was the gist of my question and I have not seen anyone's research on this yet. I hope Andre will get to explore this question some day.

Well I can answer that already now. All three threads will move onto the 2 big cores. As long as free "capacity" which is defined by IPC and frequency and actual load on the big cores and the the maximum capacity of the little cores is exceeded, those will be used. Those 3 threads then get balanced around on the big cluster. This problem is no different than in aSMP systems like Kraits where different CPUs run at different speeds. What does the scheduler do when there's 6+ threads? Same thing.

lopri · Feb 19, 2015

Thank you for the answer, Andrei. Given your explanation, do you think 2+4 configuration has the "intuitive" advantage over the 4+4 configuration that many of us think of on mobile, at least for now, until much higher workloads are processed on mobile in the future?

I suppose the answer highly depends on workloads, power consumption on each frequency, how the scheduler works, how well the scheduler works,.... all that. But I assume Qualcomm et al. must have considered all those and then some prior to settle on each SKU.

P.S. What is your view on little.LITTLE configuration, such as (4xA7)+(4xA7) or (4xA53)+(4xA53)? Do you plan to explore one of those in the future?

NTMBK · Feb 19, 2015

lopri said:
My question already took GTS into account. My question was whether the GTS overhead, which is quite high already in a symmetrical design (4+4), negates whatever theoretical gain on power consumption by an asymmetrical design (in this case 2+4).

"Putting main app on A72, put the garbage collector on A53" kind of scenario is what I alluded to be "intuitive in theory," but we have to remember that the LITTLE cores are not specialized co-processors, but full-blown, independent cores that are capable of everything that big cores are. In practice, all workloads are treated equal until they reach certain thresholds, and I doubt such clean divides are possible (e.g. big cores -> big apps, small cores -> small apps). GTS overhead that was investigated by AT last week shows just that.

Imagine a situation where 4 active threads occupying 4 little cores. What happens when 3 of them need higher performance starting the next cycle? Which of the 3 threads are "discriminated" in a 2+4 configuration? The CPU will suddenly have to make decisions like this which was not present in 4+4 configuration. That was the gist of my question and I have not seen anyone's research on this yet. I hope Andre will get to explore this question some day.

@imported_ats: You seem to prefer "expressing" yourself to explaining your reasoning or providing evidence. Sorry to say but I am not persuaded.

Oh I certainly haven't done any benchmarking or measurement into this- definitely agree that it would be very interesting to see if someone could measure this well. It's the sort of thing that traditional benchmarks completely fail to expose, but makes a big difference to overall system performance- and I suspect will vary significantly between different vendors' implementations, even if they have the "same" licensed ARM cores.

Bit of interesting reading- Qualcomm has been working on a multithreaded runtime which is meant to make this problem a little easier by adding "hints" into code: http://semiaccurate.com/2014/12/04/qualcomm-shows-off-mare-parallel-api-runtime/ (i.e. tagging a thread as either high or low intensity, so the scheduler can predict its behaviour better.) Obviously it will need code changes to work, so it's not going to improve your average application from the app store, but I wouldn't be surprised if Qualcomm hasn't patched it into chunks of the OS and key apps like Chrome for their device images.

imported_ats · Feb 19, 2015

krumme said:
Caling the size of Cyclone immaterial is only right viewed from the likes of Apple with premium products. The argument about immaterial size could just as well be applied to the rest of the SOC different parts. That argument goes nowhere in any industrial production.

The size of the SoC designs is largely immaterial to the sizes of the cores. In fact, the cores tend to make up on the order of 10-15% of the SoC sizes across a wide range of vendors.

But 0.7mm2 is immaterial for all, so who cares if its marketing - as it is ofcource - look at Tegra 3 success.

Because the marketing and marketing trend is actually detrimental to actual delivered end user performance.

Besides the entire idea that eg. it should be replaces by cache giving better performance is nonsense, as dev cost would far outweight the production cost.

No dev cost wouldn't. Most of the designs aren't using the maximum cache sizes.

And that also touches the idea of bl. The idea of moving cost from producting (damn expensive proces competences) to the arch is brilliant as the effect would last years; in order benefits is here to stay. Secondly it fit minor producers that dont have the ability to tune process the same way the big guys have.

There is a reason ARM keeps it for A72, so lets wait and see. There is some slight progress for each gen, but even if it takes 5 years to get the software right it will be worth it. But ofcource its a risk.

BL does jack all to shift costs. Cause it really is like VLIW or anything else where you hope software will figure it out. If you are relying on software figuring it out, you are relying on failure. This isn't an opinion. Its fact born out again and again and again. Every single time hardware designers have done one of these "let software figure it out" things, it has failed.

BL is something that THEORETICALLY CANNOT BE DONE RIGHT! It requires precognition!

imported_ats · Feb 19, 2015

lopri said:
@imported_ats: You seem to prefer "expressing" yourself to explaining your reasoning or providing evidence. Sorry to say but I am not persuaded.

Cause this has been gone over ad infinitum. MT geekbench is useless because it has no correlation to any actual workload every run on any phone/tablet SoC. Its meaningless. Its like telling someone 0-60 numbers when they want to know how much it can tow. MT geekbench is largely worthless for the same reasons 4+ core phone/tablet SoCs are worthless, because its virtually impossible to actually use the cores. So you are just wasting area and power for no benefit except marketing, marketing which is actually delivering worse end user performance.

And BL is largely useless because the only way to get rid of the problems with BL is to either have precognition of the workloads or integrate the big and little core together so that the cache/state issues are not longer issues which means you don't actually have bl but a dynamically power optimized OoE core.

These are all things that have been covered on this forum and many other places. Its not "expressing" myself, its giving the answer because all the reasoning and evidence is already widely known and documented.

simboss · Feb 19, 2015

imported_ats said:
BL is something that THEORETICALLY CANNOT BE DONE RIGHT! It require precognition!

Any form of power management would work much better if you knew what the workload will be in the future

The only difference is that bL requires a lot of SW to work, whereas more traditional techniques are mostly HW based.
So its reaction time is longer, which means that some use cases are going to be very bad.
Automated web browsing tests are probably a worst case scenario where you will have to switch back and forth for each new page, much more often than a real user would do in most cases.
Same thing for benchmarks with a mixture of CPU and GPU heavy stuff. The overhead will appear much bigger than it will be for a normal user that is not constantly changing what he is doing.

The little cores are very well suited to handle background tasks which afaik, no one is really testing when doing battery life tests.
So in the end the benchmark/number fanatics (which most people on these forums are ) will always see the not very convincing hard data, while standard user will enjoy their improved endurance without even knowing what bL is.

Andrei. · Feb 19, 2015

lopri said:
Thank you for the answer, Andrei. Given your explanation, do you think 2+4 configuration has the "intuitive" advantage over the 4+4 configuration that many of us think of on mobile, at least for now, until much higher workloads are processed on mobile in the future?

I suppose the answer highly depends on workloads, power consumption on each frequency, how the scheduler works, how well the scheduler works,.... all that. But I assume Qualcomm et al. must have considered all those and then some prior to settle on each SKU.

P.S. What is your view on little.LITTLE configuration, such as (4xA7)+(4xA7) or (4xA53)+(4xA53)? Do you plan to explore one of those in the future?

2+4 has the advantage of having to deal less with overkill scheduling on the big cores but otherwise there is no real advantage. Spreading threads out to more cores whenever possible is still the best power efficient method of operation, as long as it's actually possible to do that.

I look more or less in such a scenario in the MX4Pro review: http://www.anandtech.com/show/8800/the-meizu-mx4-pro-review/6

You see the 1+4 use-case is the best perf/W. But the problem on the 4+4 scenario is that it is trying to perform too well and going past the peak in the perf/W curve and wasting power for little performance benefit. In a world where the scheduler would be more aware of such large time-scales (several seconds) instead of just only 32 milliseconds then a 4+4 solution would always be better.

I don't think we have any incoming device with little.little configs. I guess it's a stopgap solution if the faster cluster is actually much higher clocked than the low power cluster. To be honest even then the power differences is so small that I don't see the point other than it having 8 cores. I have no real opinion on these SoCs.

Nothingness · Feb 19, 2015

imported_ats said:
There are several CPUs that have anti-speculation for power savings and more are coming.

Sorry, but without any reference I can't just take your words as true. None of the designs I have worked on do that for reasons applicable to any CPU: on top of added complexity and potentially increased critical paths, it's more efficient to simply downclock and lower voltage.

Again some designs might be doing it, but I want more than a statement with no ref.

MisterLilBig · Feb 20, 2015

imported_ats said:
Any Nexus 9 issues don't have anything to do with the number of cores but what those cores are.

A buggy implementation of a core.

And any smart consumer is going to want to stay at 2-3 cores forever unless there is a workload reason to change. And clue drop, unless things radically change with iOS or Android and there is real useful multitasking going on that can use those additional cores, there will never be a reason for more! You do realize there is a reason that PCs have stayed at basically 2-4 cores for a decade, right? And that PCs generally have a whole lot more multi-threading and multi-tasking than any phone!

Considering that most companies will go for a mass market product, why build the software when the hardware hasn't been there. And advising people to never move away from 2-3 cores? Damn, Apple better stuck themselves with 3 cores forever!

AKA, you are buying into pure marketing BS and not engineering.

Software needs hardware.

imported_ats said:
If you are relying on software figuring it out, you are relying on failure.

Denver.

cut the cores in half, replace with cache, get better performance!

Like the L4 made incredible difference for Intel?

witeken · Feb 20, 2015

MisterLilBig said:
Like the L4 made incredible difference for Intel?

30% for games.

MisterLilBig · Feb 20, 2015

witeken said:
30% for games.

Where is the proof of that?

witeken · Feb 20, 2015

MisterLilBig said:
Where is the proof of that?

intel crystalwell filetypedf

Page 13.

Nothingness · Feb 20, 2015

witeken said:
intel crystalwell filetypedf

Page 13.

I guess there's a missing link. But if it's Intel marketing, don't bother to fix it :biggrin:

krumme · Feb 20, 2015

Nothingness said:
I guess there's a missing link. But if it's Intel marketing, don't bother to fix it :biggrin:

Lol. We missed some blue ppt !!! Arggg
But well not for the cost of 2 times 0.7mm2 as two a53 anyway where this cache nonsense started. You dont get much l4 for that. Even with intels stellar sram density.

MisterLilBig · Feb 20, 2015

Nothingness said:
But if it's Intel marketing, don't bother to fix it

The fact that he tried to provide proof that way...lol!

witeken · Feb 20, 2015

Nothingness said:
I guess there's a missing link. But if it's Intel marketing, don't bother to fix it :biggrin:

Are people still having issues to discern the difference between marketing slides and presentations given for developers (hence: IDF).

You were actually supposed to copy and paste that line into Google, but I guess you didn't bother. You can also just take my word for it that there are a bunch of benchmarks showing a 1.2 to 1.7x or so improvement.

Arachnotronic · Feb 20, 2015

witeken said:
Are people still having issues to discern the difference between marketing slides and presentations given for developers (hence: IDF).

You were actually supposed to copy and paste that line into Google, but I guess you didn't bother. You can also just take my word for it that there are a bunch of benchmarks showing a 1.2 to 1.7x or so improvement.

While I tend to agree with Nothingness that taking Intel's word for something isn't always the best way to go (independent verification is helpful), I did find that slide deck useful. Thanks.

Qualcomm moves Cortex A72 to the mid-range

Diamond Member

Lifer

Senior member

Senior member

Senior member

Senior member

Diamond Member

Elite Member

Senior member

Elite Member

Lifer

Senior member

Senior member

Member

Senior member

Platinum Member

Senior member

Diamond Member

Senior member

Diamond Member

Platinum Member

Diamond Member

Senior member

Diamond Member

Lifer