Samsung outs Exynos 9 Series 9810

itsmydamnation · Feb 16, 2018

Thala said:
Yes x86 vs x86 uop cache saves power as you do not have to run full decode every time. It is not about powering-down though, as the decoder do not have separat/split power domains - it is just about less activity in the decoders, what saves power.

They state in the Core design they have aggressive clock gating with multiple regions. They dont say what the regions are but it would be surprising if the picker/decode doesn't clock gate off because there are queues either side of it. Generally speaking, being a heavily speculative core taking a few cycles to clock up isn't costing very much performance, you can/would also speculatively clock it up.

I just happen to disagree with Linus. Seems he is no CPU architect

Honest question are you?

and his SW arguments are blown up out of proportion.

At this point im sure every argument raised is minimized/blown out of proportion based on beliefs. like 1/2 of these:

There are lots of other issues with x86 on top like memory model - which also impacts cache coherency implementation, small architectural register set, memory operands, atomic operations, descriptor tables, segmentation etc. Many of the things, which were okayish in the seventies you still find today only in x86.

We are not running a small core in order core or small oooe core, these are big Cores with very large RPF's, schedulers, load store queues. Sure the compiler is going to push a store sooner on x86 then ARMv8, but the core is going to store as late as late as practical, store-to-load forward if needed. This generally bares out in all the dynamic x86 vs x64 instruction counts i've ever seen.

Im sure code density is another thing that doesn't matter........

Now the funny thing about all of this is, what was the actual clock of the M3, 2.5 or 2.9. When looking to make a comparison 2.5 v 2.9 shaves another 100mV off Zen power consumption.

I asked simple question before (because i dont know) like what is the peak/95th percentile power a wide oooe arm core uses during a non thermally constrained benchmark, get crickets. Without data like this its all just going to be opinion, We can get this data for x86, we can manually set ctdp states and test on x86.........

DeletedMember377562 · Feb 16, 2018

el etro said:
I completely disagree. Soon enough they will all meet the same dead end ILP extraction wise that Intel (and soon AMD) met. It's easier to follow than to trail blaze, don't expect the meteoric performance jumps we see right now to continue for long. There isn't much more that can be improved hardware wise for absolute performance, not without completely blowing up power budgets to the point where just adding more cores is far more efficient.

And even though it appears like Apple and soon Samsung are gaining on Intel and AMD, remember that the two have been stuck on 14nm for years now, while Apple and Samsung are getting the benefits of a node shrink. Once Intel and AMD move to 7nm (Well, 10nm for Intel), the bar will be set higher for ARM designs to beat.

What is there to catch up to? This statement about "it will stop at some time" has been repeated over and over again, and has been proven wrong again and again. It was first said to me on this forum when the A9 released. Now we're at A11 with 60% improved single core performance since then. And they (and Samsung) are achieving this with 4W SOCs! This fact can't be repeated enough. Compare that to Intel's 4.5W Core-M CPUs, and there is a clear difference in actual performance between them.

We might as well mention Intel's attempt at entering the smartphone segment with their Atom SOCs, which despite a ton of resources behind it and enough attempts, was still considerably inferior to Qualcomm, Samsung and Apple alternatives. After stupendous amount of subsidies, Intel ended up acknowledging defeat and escaping the segment altogether. When a company of Intel’s size, and with their enormous R&D, walks away from a huge money-making machine like the smartphone sector, it should tell you something…Intel also knew these SOCs would eventually catch up to and threaten the laptop segment of which they are currently dominating in.

Also, you say they (Apple, Samsung, Qualcomm, etc.) have the benefit of lower process node, which I also am unsure about is the case. One, because Intel's 14nm++ is as good as (or better than) TSMC and Samsung's 10nm. And two, because process node shrink in the past has not shown any attempt by Intel to make larger cores or increase performance in any way. SB to IB gave us next to no difference in performance. HW, which was on the same process node as IB, had "huge" leap of ~10%. HW to BW again had little no to difference, with BW to SL -- again same process node -- being around ~6% jump.

People on this forum have constantly defended Intel, claiming they aren't doing more because more can't be done. But look at the mobile platform. Even AMD are beating Intel's ass with their first mobile Ryzen attempt, with almost equal CPU performance, and a whopping ~250% GPU performance advantage. And they're doing that on an inferior 14nm process. Now, Apple and Samsung are catching up to them, with similiar CPU performance on a third of the power usage. And there's no sign of them stopping their yearly performance improvements either. For Apple that's now down to 15-20% improvements -- but that's still as much as Intel did in 5 redacted years with 3 different architectures.

28nm to 14nm (or rather "fake 14nm", as TSMC's and GloFo's defnition seems to be different of Intel) gave us a performance increase of around ~70-80% in for example the GPU segment, where AMD and NVIDIA took advantage of smaller nodes to increase the amount of transistors. What did 32m to 14nm give us on the desktop CPU segment? IPC increase of 20%....I've never understood that. Then again, we're talking about the same high tech company whose IHS soldering is so bad, a simple delid and removal of residue glue improves temperatures by 15C. A company whose thermal paste is as effective as toothpaste in terms of cooling abilities. A company whose "let's do redacted-all" attitude has now made ARM catch up to them on mobile -- something that has been happening for a long while. Hell, even Qualcomm with their weak SOCs, are now striking deals with Microsoft, Lenovo, HP and others to use Snapdragon chips in their future products. Intel are even being pressured in the server segment.

It kind of makes you think what their R&D budget has been going to.

CatMerc said:
While I disagree about the GPU notion, I do agree that we should be more careful about interpretation of comments. It just degrades discussion quality otherwise. +1

I agree as well, which is why I won't allow Thala to lie himself out of this situation by claiming I'm putting words in his mouth.

In Thala's comment about valuing GPU perf more than singlethreaded perf, I sarcastically remarked "Yes, 10% higher GPU performance is clearly better than 70% higher singlethreadead performance..."

To which Thala responded (To make it easier, I have shortened part of the post; you can find it in its entirety here).

Thala said:
Might depend on the particular use-case...when firing up a game CPU performance will gain you nothing while you see immediate benefits of higher GPU performance.

To be clear i am talking about user-experience and do not just go by numbers, which would indicate that 70%>10%.

He is very well aware we're talking about a 70% singlethreaded advantage and 10% GPU perfm disadvantage for the E9810 vs the SD845. And yet he goes on about how he values the GPU performance increase more, again and again and again.

You've been told several times already,
but there is NO profanity in the tech areas.

AT Mod Usandthem

Dolan · Feb 17, 2018

ILP is given by implemented algorithm and ISA.

And since ARM and x86 ISA are completely different there is no reason to believe, that ARM will hit same IPC wall as x86.

ARM has for example advantage in limited backward compatibility, so new instructions are encoded more effectively. When ARM has peak decode rate 4 then that means 4 instruction per cycle. For comparison when Skylake is working with some new instructions then decode rate might fall to 1 (single) instruction per cycle limited purely by decoders even when there is more badwidth/resource free and available.

Lodix · Feb 17, 2018

I think this picture speaks for itself for comparing ISAs and architectures.

Andrei. · Feb 17, 2018

Lodix said:
I think this picture speaks for itself for comparing ISAs and architectures.

That picture says absolutely nothing about ISAs and architectures. Beyond the fact the ARM core values are larger than reality, we have extreme counter-examples of this over the last two mobile generations.
I take Linley's group shot of cores :

Again some of these values are incorrect but let's just go with the general point of the shot.

At one extreme you had Qualcomm's Kryo being absolutely huge and power inefficient and in the other extreme you had ARM's designs. The A73 is smaller than an A72 and performs far better. These are the same ISA architectures at very similar performance levels yet we see PPA differences of 2-3x just in-between vendors.

Intel's cores are huge because they use vastly bigger transistor libraries to reach the core clocks where they are used - again this is no indication of ISA but simply an effect of the implementation of that product. There are so many factors at play here going from architecture (ISA), micro-architecture, circuit-design quality, physical implementation quality (RTL to GDSII isn't some one step process) to market decisions on where the core ends up being used in and subsequent decisions which compromise area for performance.

Die shot comparisons are nice but the only thing they're good for is showing an estimate of the competitive PPA within the market segment and utterly useless for things like architecture discussions.

CatMerc · Feb 17, 2018

Andrei. said:
That picture says absolutely nothing about ISAs and architectures. Beyond the fact the ARM core values are larger than reality, we have extreme counter-examples of this over the last two mobile generations.
I take Linley's group shot of cores :

Again some of these values are incorrect but let's just go with the general point of the shot.

At one extreme you had Qualcomm's Kryo being absolutely huge and power inefficient and in the other extreme you had ARM's designs. The A73 is smaller than an A72 and performs far better. These are the same ISA architectures at very similar performance levels yet we see PPA differences of 2-3x just in-between vendors.

Intel's cores are huge because they use vastly bigger transistor libraries to reach the core clocks where they are used - again this is no indication of ISA but simply an effect of the implementation of that product. There are so many factors at play here going from architecture (ISA), micro-architecture, circuit-design quality, physical implementation quality (RTL to GDSII isn't some one step process) to market decisions on where the core ends up being used in and subsequent decisions which compromise area for performance.

Die shot comparisons are nice but the only thing they're good for is showing an estimate of the competitive PPA within the market segment and utterly useless for things like architecture discussions.

I'd be happy to update the chart if you tell me where the core actually starts and where it ends on the Snapdragon 835 Kyro 280. I assumes taking the cluster and cutting it in four (plus 1/4th of the L2) would be good enough.

As for the rest, I agree. I never intended this to be an ISA comparison or even an architecture comparison, I made this for fun.

Dolan said:
ILP is given by implemented algorithm and ISA.

And since ARM and x86 ISA are completely different there is no reason to believe, that ARM will hit same IPC wall as x86.

ARM has for example advantage in limited backward compatibility, so new instructions are encoded more effectively. When ARM has peak decode rate 4 then that means 4 instruction per cycle. For comparison when Skylake is working with some new instructions then decode rate might fall to 1 (single) instruction per cycle limited purely by decoders even when there is more badwidth/resource free and available.

But at the same time the x86 instructions tend to be denser. One instruction decode can turn into multiple micro ops.

french toast · Feb 17, 2018

Andrei. said:
That picture says absolutely nothing about ISAs and architectures. Beyond the fact the ARM core values are larger than reality, we have extreme counter-examples of this over the last two mobile generations.
I take Linley's group shot of cores :

Again some of these values are incorrect but let's just go with the general point of the shot.

At one extreme you had Qualcomm's Kryo being absolutely huge and power inefficient and in the other extreme you had ARM's designs. The A73 is smaller than an A72 and performs far better. These are the same ISA architectures at very similar performance levels yet we see PPA differences of 2-3x just in-between vendors.

Intel's cores are huge because they use vastly bigger transistor libraries to reach the core clocks where they are used - again this is no indication of ISA but simply an effect of the implementation of that product. There are so many factors at play here going from architecture (ISA), micro-architecture, circuit-design quality, physical implementation quality (RTL to GDSII isn't some one step process) to market decisions on where the core ends up being used in and subsequent decisions which compromise area for performance.

Die shot comparisons are nice but the only thing they're good for is showing an estimate of the competitive PPA within the market segment and utterly useless for things like architecture discussions.

Does this mean ARM cores will have to dedicate more transistors to enable higher frequencies? Like Vega and those intel cores?.

Intel 14nm ++ is probably capable of higher frequencies than tsmc/Samsung 10nm, but the latter is likely better at low power and slightly better density.
In other words they are not too far from each other.

Yet these ARM cores are catching up fast... frequency is also increasing with ipc in sub 4w form factors, I have to believe if intel made an ARM desktop core with all of their IP and technology, baked it on 14nm++ that it would completely destroy their x86 skylake designs in perf/watt.
Those intel ARM cores would lose some die space to enable higher frequencies, SMT and beefy SIMD/FP units, but probably still be smaller than skylake cores.

Andrei. · Feb 17, 2018

french toast said:
Does this mean ARM cores will have to dedicate more transistors to enable higher frequencies? Like Vega and those intel cores?.

This is universal of any circuit.

eastofeastside · Feb 17, 2018

https://www.anandtech.com/show/1231...n-exclusive-interview-with-dr-lisa-su-amd-ceo

Q25: Several years ago, there was the expectation that AMD would be producing a number of ARM based processors, and then we got the singular A1100 product. Then it fell by the wayside: are there any plans along those lines anymore? Or is it evaluated?

LS: We continue to do ARM-based development in a couple of areas. We have custom products around ARM, and we actually use ARM in our PSP as you know. We always look at ARM, and if it makes sense down the road to introduce another ARM standard product, we’ll consider that.

But the focus for us has really been around x86 and our GPU roadmap. Part of what we have done is really focus the R&D efforts. I still believe that there are lots of good places for ARM in the industry so we’ll continue to take a look at that.

So if K12, or some other Zen based ARM development was continuing would she say? This is a recent article, Jan, 2018.

I was hopeful K12 would reboot around the next-gen ARM core, but these comments don't give much encouragement to that expectation.

Andrei. · Feb 17, 2018

eastofeastside said:
https://www.anandtech.com/show/1231...n-exclusive-interview-with-dr-lisa-su-amd-ceo

So if K12, or some other Zen based ARM development was continuing would she say? This is a recent article, Jan, 2018.

I was hopeful K12 would reboot around the next-gen ARM core, but these comments don't give much encouragement to that expectation.

AMD is resource constrained. It's better for them to double-down on x86 Zen than water down R&D in uncertain avenues.

Nothingness · Feb 17, 2018

CatMerc said:
But at the same time the x86 instructions tend to be denser. One instruction decode can turn into multiple micro ops.

If x86_64 is slightly denser than AArch64 it's not due instruction complexity, but rather their encoding.

I made some measures years ago on SPECint 2006 comparing x86_64 vs AArch64 with similar versions of gcc. The dynamic number of instructions was almost identical. Total dynamic code size executed was about 5% larger on AArch64.

Dolan · Feb 17, 2018

CatMerc said:
But at the same time the x86 instructions tend to be denser. One instruction decode can turn into multiple micro ops.

You can find different studies on this topic, some in favor x86, some in favor ARM... And mostly it is just 2 operations.

Little off topic since there it is not about process, but i want to add: In theory Intel's process looks good, thanks to split pitching they achieved best dimension in class. But in reality working in 1D is so difficult so their routed-gate density is disaster. Generation behind others. Given by fact that ARM designs are usually transistor hungry it might not be good. Did you ever wonder why every single ICF customer ran away from them (or got bought)?

Intel actually already showed some ARM cores and none of them was good enough.

Performance is bad too. Skylake can hit 5GHz only because of good optimization and careful binding. But it won't tell much about performance of process itself. There is lot of other factors in high-level design like CPU. But if you look at serdes performance (which is relatively simpler) then Intel can support only 10-12G transceivers while others has not problem with 25-32G.

CatMerc · Feb 17, 2018

Andrei. said:
AMD is resource constrained. It's better for them to double-down on x86 Zen than water down R&D in uncertain avenues.

That and they'd need to build demand for high performance ARM systems. It's better for them to let others pave the way and get server customers on board and then swoop in with their own design.

That and keeping x86 is better for them from a competitive standpoint. Growing the ARM ecosystem is increasing competition for AMD.

CatMerc · Feb 17, 2018

@Andrei.
I updated the comparison after our exchange on twitter. I double checked the Monsoon values, and either the die size is incorrect, or there aren't any accurate annotations that I could find of where the big core begins and ends. I did update the image with the better core shot though.
As for the Kyro 280, I discovered I goofed. Aside from your correction about shared cluster logic, it appears I made an error in my calculation, causing the values to be larger than they actually are. I'll upload the corrected version on Twitter if you don't see any more errors:

french toast · Feb 17, 2018

Great work cat for trying to compile size list, do we have any data for A73 & A75?
Edit; kryo 280 is A73 based correct?

Lodix · Feb 18, 2018

french toast said:
Great work cat for trying to compile size list, do we have any data for A73 & A75?
Edit; kryo 280 is A73 based correct?

Kryo 280 is Cortex A73 yep. And for the A75 is too early, we will have to wait for the release of the Samsung Galaxy S9.

CatMerc · Feb 19, 2018

Here is the corrected chart including the Monsoon. Thanks to @Andrei. for the corrections!

french toast · Feb 19, 2018

Lodix said:
Kryo 280 is Cortex A73 yep. And for the A75 is too early, we will have to wait for the release of the Samsung Galaxy S9.

Be interesting to see what the differences are between A73 Vs A75...which look to tiny compared to monsoon Vs M3...

ARM have done remarkably well with power and area efficiency join their current designs, I have no doubt if they dedicated full resources and engineering talent that they could make a competitive big core to rival monsoon.

itsmydamnation · Feb 19, 2018

CatMerc said:
Here is the corrected chart including the Monsoon. Thanks to @Andrei. for the corrections!

Can you add more of the Apple cores , would be great to see evolution over time.

I've always found it interesting how everyone except amd's cores are very square and ridged and AMD are very bloomy(since jaguar/piledriver), I assume thats from their automated layout. It would be really interesting to find out the man hour effort of each core ( we can only dream ).

We will have to wait for 7nm to find out but Zen/broadwell Core on a TMSC like 10nm process excluding the L2 would probably have been 2.5 to 3mm sq. MTr/mm2 of 32 for 14lpp , 60 for TSMC 10. But what Monsoon really shows is just how much 10nm has really hurt intel, they should have been at 100 MTr in the middle of 16 and smashing everybody, now it looks like in 2018 they will loose on there own metric to TSMC 7nm (116).

Thala · Feb 19, 2018

generalako said:
I agree as well, which is why I won't allow Thala to lie himself out of this situation by claiming I'm putting words in his mouth.

Ok first you were putting words into my mouth now you accuse me of lying.
I made it very clear that this is my personal preference. I also made it very clear that i am looking at a particular use-case. Still you insists that i did make a general statement.

You made it first here:

Yes, 10% higher GPU performance is clearly better than 70% higher singlethreadead performance...

You made it again here:

According to Thala, it's completely fine to have a 60-75% single core + ~15% multi core disadvantage, as long as we have 10% GPU advantage on the SD845...

I did ask you politely to stop this nonsense - but you did chose to go on...

You inability to comprehend the simplest statements either willingly or by mistake slowly getting tiresome. In any case i will report your invalid accusations.

Thala · Feb 19, 2018

Andrei. said:
This is universal of any circuit.

Yes its universal, but there are some measures, which are not linked to change of architecture. If you have a mobile design with ultra low power requirements you typically have:

1) Low leakage, high-vt cell library
2) low voltage
3) low track cell library with reduced drive current
4) typically no binning, so you have to sign-off frequency for worst case conditions and worst/slow corner of the die
5) no additional drivers for signal propagation

Just modifying above parameter will enable you a very wide range of frequencies in particular if you start from mobile use-case.
Anyway the increase in size is not necessarily related to changing the architecture at RTL level but just in backend.

Andrei. · Feb 19, 2018

Thala said:
4) typically no binning, so you have to sign-off frequency for worst case conditions and worst/slow corner of the die

Every mobile chip does binning, it's just that they do power binning instead of performance binning that's usually done in the desktop space. You might end up with a turd using 30% or more power than a good chip.

eastofeastside · Feb 19, 2018

Looking at the bigger picture, it just seems very apparently obvious that the time of x86's inevitable technological obsolescence is here. It's not even a matter of debating the technical merits of one ISA versus another. Mobile disrupted previously dominant PC's and by extension the ARM ISA and business model is disrupting x86 and Intel's business.

Intel was dominant for a long time and got complacent resting on its laurels (as every successful unchallenged business does) with no incentive to give consumers the bleeding edge of efficiency, and now the reality of competitive challenge has set in too late. Intel x86 is now in less of a technological battle, than locked in a more serious irreversible trend of decreasing relevance.

Every technology has its time of inevitable obsolescence and x86 isn't a special all enduring exception. The big picture is obvious looking at the forest from the trees.

Thala · Feb 19, 2018

Andrei. said:
Every mobile chip does binning, it's just that they do power binning instead of performance binning that's usually done in the desktop space. You might end up with a turd using 30% or more power than a good chip.

You are right, i was referring to performance binning, because as i said its about frequency sign-off considering slowest corner. Ironically for power its the other way around, the fastest chips makes you most headaches regarding thermals and power. In any case, if you only have one SKU you cannot sell those fast chips as a faster and more power consuming variant, instead you have to cope with the slowest chips when deciding frequency.

Lodix · Feb 20, 2018

This new score is higher but the reported frequency is 1'95GHz.

Samsung outs Exynos 9 Series 9810

Platinum Member

DeletedMember377562

Junior Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Junior Member

Senior member

Platinum Member

Junior Member

Golden Member

Golden Member

Senior member

Senior member

Golden Member

Senior member

Platinum Member

Golden Member

Golden Member

Senior member

Junior Member

Golden Member

Senior member