Samsung outs Exynos 9 Series 9810

Andrei. · Feb 15, 2018

Gideon said:
Yeah, claiming "better than ZEN IPC 100%" is jumping the gun "a bit". At least a while ago samsung invested heavy resources into optimizing for benchmarks.

I don't understand what the fuss is about admitting that is has higher IPC. It's a much lower clocked architecture. Apple has the highest IPC in the industry right now.

Gideon said:
I still remember a A9 Samsung S3 outpacing a Krait Nexus-4 in many popular CPU benchmarks in phone reviews (while Krait essentially had 40%+ more IPC and similar clocks).

Err no. Krait had lower IPC than an A9. The 4412 was the better SoC that generation.

Gideon · Feb 15, 2018

Yeah my bad, I was confusing it with A8 I guess (though Krait might have had better FP performance). I just remembered seeing benches, where a Samsung phone, that definitely had an old in-order architecture, soundly beat the Nexus 4 left and right (except the GPU benches).

And I wouldn't consider 2.9 Ghz to be "low". Well I guess then the X86 chipmakers should really be ashamed of themselves, delivering such uncompetitive designs (especially Intel, for all those years!) While mobile chipmakers manage to extract so much extra performance out of every generation.

CatMerc · Feb 15, 2018

Zen and Skylake aside from clocking higher also have to deal with workloads that no smartphone is ever expected to do. SPECInt is a better representative of desktop (yes not just workstation) performance than Geekbench, as the old and crusty Windows software environment is quite different to Android or iOS, and in these workloads you see just how big the gap is. At 3.2GHz EPYC is easily 3x the single threaded performance of Exynos 8895, and with SMT that's 5x per core.

Not to take anything away from Apple and Samsung, but this is complete apples to oranges (heh) to an extreme measure. I don't think it's fruitful to make IPC comparisons of mobile SoC's vs full blown x86 behemoths in Geekbench. It doesn't really tell you anything of value. The x86 designs aren't made to scale down to phones, and the ARM designs aren't made to scale up to servers.

It's alright to make the comparison in Geekbench, but it needs to be understood and not just looked at as is. Otherwise you draw conclusions that are based on false premises.

asendra · Feb 15, 2018

CatMerc said:
Zen and Skylake aside from clocking higher also have to deal with workloads that no smartphone is ever expected to do. SPECInt is a better representative of desktop (yes not just workstation) performance than Geekbench, as the old and crusty Windows software environment is quite different to Android or iOS, and in these workloads you see just how big the gap is. At 3.2GHz EPYC is easily 3x the single threaded performance of Exynos 8895, and with SMT that's 5x per core.

Not to take anything away from Apple and Samsung, but this is complete apples to oranges (heh) to an extreme measure. I don't think it's fruitful to make IPC comparisons of mobile SoC's vs full blown x86 behemoths in Geekbench. It doesn't really tell you anything of value. The x86 designs aren't made to scale down to phones, and the ARM designs aren't made to scale up to servers.

It's alright to make the comparison in Geekbench, but it needs to be understood and not just looked at as is. Otherwise you draw conclusions that are based on false premises.

https://www.anandtech.com/show/9766/the-apple-ipad-pro-review/4

A9x wasn't 3x to 5x slower than Skylake on SPEC06, and that was two SOC generations ago for Apple.
Also, I would say Apples has advanced their CPU performance quite a bit more than Intel during this time..

but yes, in general there's little to gain by comparing such different designs. I only find it interesting in Apples case due to the theoretical overlap between iPad Pros and MacBooks, which share very similar TDPs and underlying OSs, which might allow Apple to ditch Intel if they wanted to

CatMerc · Feb 15, 2018

asendra said:
https://www.anandtech.com/show/9766/the-apple-ipad-pro-review/4

A9x wasn't 3x to 5x slower than Skylake on SPEC06, and that was two SOC generations ago for Apple.
Also, I would say Apples has advanced their CPU performance quite a bit more than Intel during this time..

but yes, in general there's little to gain by comparing such different designs. I only find it interesting in Apples case due to the theoretical overlap between iPad Pros and MacBooks, which share very similar TDPs and underlying OSs, which might allow Apple to ditch Intel if they wanted to

The CPU's here have an order of magnitude less power available to them than the one I was using. That and the A9X is still faster than the SoC's I mentioned in single core, so some of the gap closes there.

So in the end we are both getting the same comparative results.

Nothingness · Feb 15, 2018

asendra said:
A9x wasn't 3x to 5x slower than Skylake on SPEC06, and that was two SOC generations ago for Apple.
Also, I would say Apples has advanced their CPU performance quite a bit more than Intel during this time..

That comparison was heavily biased due to the use of icc in 32-bit mode for x86[*]. It's pointless to compare x86 vs ARM chips.

[*] Last time I checked icc vs gcc on an i3770 the geomean of the int part was 45% better for icc. icc is a SPEC compiler

itsmydamnation · Feb 15, 2018

Andrei. said:
I thought you were following the thread.

No need to be smart....

Based on these two since we can be relatively certain about the clocks:
http://browser.geekbench.com/v4/cpu/3301296
http://browser.geekbench.com/v4/cpu/4534181

I can see the issue, I thought the claimed peak clock was 2.9 not 2.5. That accounts for the difference in our numbers.

Thala · Feb 15, 2018

Nothingness said:
That comparison was heavily biased due to the use of icc in 32-bit mode for x86[*]. It's pointless to compare x86 vs ARM chips.

[*] Last time I checked icc vs gcc on an i3770 the geomean of the int part was 45% better for icc. icc is a SPEC compiler

Indeed. Contrary to popular belief its only the compiler that matters for low level benchmark and not the OS. Likewise even when going with SPEC when comparing ARM vs. x86...the used compilers should be the same (either llvm or gcc).

Thala · Feb 15, 2018

itsmydamnation said:
Jim said we can for the same number of transistors have about a 10% bigger OOOE engine with arm then with x86, Micheal Clake said we can deliver Zen level of performance regardless of ISA.

Except of course that only getting Zen performance would not be an achievement for any ARM architecture of similar size and power.

itsmydamnation said:
IF you search RWT you will find the wars about ISA covered very well, to me i would summarize the issue as at 4 wide decode x86 spends more transistors on the front end but it doesn't cost you power, uop caches help that limit and save power, over 4 is a big problem. ARM ISA has some nicer load operations.

At that point your done, everything else weak vs strong memory ordering are all just different trade offs for different workloads.

There is so much wrong with the x86 ISA, that i do not even know where to start. But claiming that uop cache saves power is...interesting. In addition i do not know any workload where a strong memory ordering model gives you advantages. In contrary implementing sequential consistency cost you gates and power in addition to limiting your performance. It is very natural for an OOO architecture to have weak memory ordering.
Issue is, that due to backwards compatibility x86 never made the jump to weak ordering. Back in the seventies sequential consistency was naturally given. Today however, when at any given point in time you have several transactions ongoing, it is a challenge to make them observable in program order without barriers.

Thala · Feb 15, 2018

double post

CatMerc · Feb 15, 2018

Thala said:
Except of course that only getting Zen performance would not be an achievement for any ARM architecture of similar size and power.

There is so much wrong with the x86 ISA, that i do not even know where to start. But claiming that uop cache saves power is...interesting. In addition i do not know any workload where a strong memory ordering model gives you advantages. In contrary implementing sequential consistency cost you gates and power in addition to limiting your performance. It is very natural for an OOO architecture to have weak memory ordering.
Issue is, that due to backwards compatibility x86 never made the jump to weak ordering. Back in the seventies sequential consistency was naturally given. Today however, when at any given point in time you have several transactions ongoing, it is a challenge to make them observable in program order without barriers.

uOp cache saves power on x86 vs not having one. Of course not having the power hungry decoder saves more, but x86 is x86.

itsmydamnation · Feb 15, 2018

Thala said:
Except of course that only getting Zen performance would not be an achievement for any ARM architecture of similar size and power.

There is so much wrong with the x86 ISA, that i do not even know where to start. But claiming that uop cache saves power is...interesting. In addition i do not know any workload where a strong memory ordering model gives you advantages. In contrary implementing sequential consistency cost you gates and power in addition to limiting your performance. It is very natural for an OOO architecture to have weak memory ordering.
Issue is, that due to backwards compatibility x86 never made the jump to weak ordering. Back in the seventies sequential consistency was naturally given. Today however, when at any given point in time you have several transactions ongoing, it is a challenge to make them observable in program order without barriers.

Uop cache is claimed buy both intel and amd to save power, do arm claim loop caches save power ( i would assume they do). Any time you can power down your front end is a good thing.

I dont really have time to write a full reply myself, so it just link this:
https://www.realworldtech.com/forum/?threadid=131745&curpostid=131806

So how strong is the memory ordering in the M3............

eastofeastside · Feb 15, 2018

Great Zen vs M3 discussion.

My interest in asking is on the potential for ARM to have an impact over x86 in low to mid-range Windows 10 laptops and Chromebooks.

And secondly, on the possibility for an ARM based PS5/XBOX next-gen console CPU.

I specifically wanted to know if a big ARM core could have an advantage over mobile Ryzen and i3 and i5. Ryzen is not a pure mobile core like M3, perhaps K12 was supposed to address the mobile low-power part of AMD's strategy.

ARM Ares next-gen ARM core will launch this summer at Computex. I'm excited for the future of next-gen ARM cores to expand beyond phone and tablets, especially when they hit 7nm.

Thala · Feb 16, 2018

itsmydamnation said:
Uop cache is claimed buy both intel and amd to save power, do arm claim loop caches save power ( i would assume they do). Any time you can power down your front end is a good thing.

Yes x86 vs x86 uop cache saves power as you do not have to run full decode every time. It is not about powering-down though, as the decoder do not have separat/split power domains - it is just about less activity in the decoders, what saves power.

itsmydamnation said:
I dont really have time to write a full reply myself, so it just link this:
https://www.realworldtech.com/forum/?threadid=131745&curpostid=131806

I just happen to disagree with Linus. Seems he is no CPU architect and his SW arguments are blown up out of proportion. Barriers are typically not needed in user level code but are hidden within the OS. The background is, that say two threads/contexts only have to reason about memory ordering at the synchronization points - that is unless you do implement synchronization in application-level code, but this would be bad practice anyway.
Regarding verification of the OS itself, indeed you might end up with more barriers than needed on a particular architecture, but correctness is pretty much decidable at this point.

CatMerc · Feb 16, 2018

eastofeastside said:
Great Zen vs M3 discussion.

My interest in asking is on the potential for ARM to have an impact over x86 in low to mid-range Windows 10 laptops and Chromebooks.

And secondly, on the possibility for an ARM based PS5/XBOX next-gen console CPU.

I specifically wanted to know if a big ARM core could have an advantage over mobile Ryzen and i3 and i5. Ryzen is not a pure mobile core like M3, perhaps K12 was supposed to address the mobile low-power part of AMD's strategy.

ARM Ares next-gen ARM core will launch this summer at Computex. I'm excited for the future of next-gen ARM cores to expand beyond phone and tablets, especially when they hit 7nm.

There wouldn't be much benefit. There's a reason K12 was shelved, despite already having running engineering samples.

The reality is that x86 and ARM these days barely have any difference. You are looking at maybe a mm^2 of saving on chip size and maybe 10% higher efficiency. And with every subsequent node this drops lower and lower, as the x86 parts of the chip become relatively smaller and less power hungry since they don't become more complex.

DeletedMember377562 · Feb 16, 2018

Thala said:
Lol, and this is because you say so? Provided evidence is no prerequisite/necessary condition for the truth of a statement. You statement is irrational.

Actually provided evidence is exactly a prerequisite for truth of statement. My statement is completely rational, unlike yours. You're making prediction claims about the performance of a future architecture based on some evidence you say you saw yourself, personally, but refuse to provide us with that same evidence. If you can't see the issue here then you have some serious issues.

el etro said:
I still think at 2.9Ghz with this IPC on final product. Anyway is still a dead end, being tied to buy a Samsung or a Apple phone to have this kind of performance is far from what we want.

Doesn't matter. According to Thala, it's completely fine to have a 60-75% single core + ~15% multi core disadvantage, as long as we have 10% GPU advantage on the SD845...

eastofeastside · Feb 16, 2018

CatMerc said:
There wouldn't be much benefit. There's a reason K12 was shelved, despite already having running engineering samples.

The reality is that x86 and ARM these days barely have any difference. You are looking at maybe a mm^2 of saving on chip size and maybe 10% higher efficiency. And with every subsequent node this drops lower and lower, as the x86 parts of the chip become relatively smaller and less power hungry since they don't become more complex.

I appreciate what you are saying about ARM versus x86 efficiency differences.

I still have another question from another perspective, though. Ryzen is a server/desktop core from which the lower grade, mobile versions are binned. Take Jaguar and Atom as example of cores architected specifically for low power, isn't there a significant tdp/mm2 advantage gained from using a dedicated low power chip for a low power application, versus using a bigger mobile variation of a desktop grade core clocked down for low power use?

Seeing as Jaguar is dead and Atom isn't an option for a console, would an ARM core made specifically for the thermal range of a console have a significant mm2/tdp advantage over trying to adapt Ryzen for a console application?

Maybe K12 was supposed to be the lower power solution for AMD in the place of Jaguar. I hope it pops back on the radar before too long.

Thala · Feb 16, 2018

The reality is that x86 and ARM these days barely have any difference. You are looking at maybe a mm^2 of saving on chip size and maybe 10% higher efficiency. And with every subsequent node this drops lower and lower, as the x86 parts of the chip become relatively smaller and less power hungry since they don't become more complex.

Where are these numbers coming from? From my experience with both architectures, the efficiency deviation is much higher. Similar with node drops, what i am seeing is, that the gap is not closing due higher impact of leakage. Point in case, uop cache helps to save dynamic power but you have increased leakage. x86 will never become an efficient architecture due to inherent flaws, which only can be worked around with increasing higher cost. On a very abstract level the latest ARM and x86 architectures might look similar, but they look very different when you look into the actual micro architecture.

CatMerc · Feb 16, 2018

Thala said:
Where are these numbers coming from? From my experience with both architectures, the efficiency deviation is much higher. Similar with node drops, what i am seeing is, that the gap is not closing due higher impact of leakage. Point in case, uop cache helps to save dynamic power but you have increased leakage. x86 will never become an efficient architecture due to inherent flaws, which only can be worked around with increasing higher cost. On a very abstract level the latest ARM and x86 architectures might look similar, but they look very different when you look into the actual micro architecture.

The decode part is absolutely tiny, and it doesn't get any more complex with a node shrink or architectural update. It just shrinks, so the relative area it takes shrinks with each generation.

As for power efficiency, this guy knows things about K12 that aren't public.
Considering K12 is as close to an x86 design converted to ARM (or vice versa) we have, it speaks volumes about what the difference is these days. The differences you're talking about are true for simple x86 designs, but Intel and AMD with their decades of iteration on x86 managed to close the gap to the point where it really doesn't matter. The uOp cache was the final nail in the coffin for the ARM vs x86 difference. FinFET's dropped leakage a LOT too.

ARM will still be best for small cores, but x86 isn't going anywhere for tablet and higher performance.

eastofeastside said:
I appreciate what you are saying about ARM versus x86 efficiency differences.

I still have another question from another perspective, though. Ryzen is a server/desktop core from which the lower grade, mobile versions are binned. Take Jaguar and Atom as example of cores architected specifically for low power, isn't there a significant tdp/mm2 advantage gained from using a dedicated low power chip for a low power application, versus using a bigger mobile variation of a desktop grade core clocked down for low power use?

Seeing as Jaguar is dead and Atom isn't an option for a console, would an ARM core made specifically for the thermal range of a console have a significant mm2/tdp advantage over trying to adapt Ryzen for a console application?

Maybe K12 was supposed to be the lower power solution for AMD in the place of Jaguar. I hope it pops back on the radar before too long.

The main objective of Jaguar and Atom was being small and cheap, not just power efficiency. Zen and modern Intel cores can scale down in power well enough to the point where they're often more efficient than the small cores. The problem is they're bigger, and therefore costlier to make. It's actually the main reason why Apple get so far ahead in both performance and efficiency. A big core isn't necessarily less efficient, in fact with the right engineering it can be even more efficient. But it will cost more.

For AMD, they'd rather eat the per chip costs of having a bigger core than the costs of designing a new small core just for consoles.

Thala · Feb 16, 2018

generalako said:
Doesn't matter. According to Thala, it's completely fine to have a 60-75% single core + ~15% multi core disadvantage, as long as we have 10% GPU advantage on the SD845...

Do not put words in my mouth! I never did a general statement like this.

What i said was:

Personally i would value a better GPU performance higher, than better single core performance.

CatMerc · Feb 16, 2018

Thala said:
Do not put words in my mouth! I never did a general statement like this.

What i said was:

While I disagree about the GPU notion, I do agree that we should be more careful about interpretation of comments. It just degrades discussion quality otherwise. +1

Thala · Feb 16, 2018

CatMerc said:
The decode part is absolutely tiny, and it doesn't get any more complex with a node shrink or architectural update. It just shrinks, so the relative area it takes shrinks with each generation.

First if you are going wider you eventually need to add decoders so you have to scale the frontend along with backend. Second the decoders are monsters compared to ARM decoders in particular if you include the uop cache. Missing in the cache costs you additional delays making the pipeline frontend-bound much more often than on ARM. There are lots of other issues with x86 on top like memory model - which also impacts cache coherency implementation, small architectural register set, memory operands, atomic operations, descriptor tables, segmentation etc. Many of the things, which were okayish in the seventies you still find today only in x86.

As for power efficiency, this guy knows things about K12 that aren't public

Ok. Rumored power numbers while running an unknown use-case on a not released architecture. Sounds not particularly credible. From my knowledge K12 was never really finished.

FinFET's dropped leakage a LOT too.

Yes FinFETs would have lower leakage compared to the similar small planar process. However truth is leakage was going up from 28nm planar TSMC to 14nm finFET Intel. It would have increased more when going down to an hypothetical planar 14nm process - but that was not my point.

ARM will still be best for small cores, but x86 isn't going anywhere for tablet and higher performance.

My point is that we will see ARM cores, which are faster than anything x86 at lower power in the not too distant future. If they are easily penetrating the Windows desktop market is a different question. I assume it also depends how well Microsoft plays their cards with Windows on ARM.

Nothingness · Feb 16, 2018

Thala said:
There are lots of other issues with x86 on top like memory model - which also impacts cache coherency implementation, small architectural register set, memory operands, atomic operations, descriptor tables, segmentation etc. Many of the things, which were okayish in the seventies you still find today only in x86.

As a side note, I bet we'll see more ARM cores do "magic" D to I cache snooping. JIT has become too prevalent to ignore the cost of explicit cache maintenance.

eastofeastside · Feb 16, 2018

CatMerc said:
The main objective of Jaguar and Atom was being small and cheap, not just power efficiency. Zen and modern Intel cores can scale down in power well enough to the point where they're often more efficient than the small cores. The problem is they're bigger, and therefore costlier to make. It's actually the main reason why Apple get so far ahead in both performance and efficiency. A big core isn't necessarily less efficient, in fact with the right engineering it can be even more efficient. But it will cost more.

For AMD, they'd rather eat the per chip costs of having a bigger core than the costs of designing a new small core just for consoles.

K12? Was K12 supposed to be the "new small core"? It was obviously designed to a high level before being shelved. Is expecting K12 or a future variant in consoles, low power Windows laptops, or Chromebooks a stretch?

CatMerc · Feb 16, 2018

eastofeastside said:
K12? Was K12 supposed to be the "new small core"? It was obviously designed to a high level before being shelved. Is expecting K12 or a future variant in consoles, low power Windows laptops, or Chromebooks a stretch?

I never said K12 was supposed to be a small core. As for using it on consoles, would make backwards compat harder, which was one of the reasons both Sony and MS moved to x86. When a new gen arrives, backwards compat would be far easier. Especially MS who is moving away from the traditional generations model, and moving more towards having something like the smartphones model.

Low power windows laptops maybe, but that would be assuming there's a benefit for it over normal Zen. Chromebooks maybe.

AMD shelved it as a product, it will only make an appearance if a customer orders it as a semi-custom product, per Lisa's words.

Thala said:
Second the decoders are monsters compared to ARM decoders in particular if you include the uop cache.

https://en.wikichip.org/w/images/0/0e/amd_zen_core_die.png
Out of a 7mm^2 Zen core, the decode area uOp cache included is 0.865mm^2. And out of a 213mm^2 chip that is the final Zeppelin 8 core design, the total cost is 6.92mm^2.

Thala said:
From my knowledge K12 was never really finished.

K12 got to the running engineering sample stage. It just never turned into a product since AMD didn't see the value.

Thala said:
My point is that we will see ARM cores, which are faster than anything x86 at lower power in the not too distant future.

I completely disagree. Soon enough they will all meet the same dead end ILP extraction wise that Intel (and soon AMD) met. It's easier to follow than to trail blaze, don't expect the meteoric performance jumps we see right now to continue for long. There isn't much more that can be improved hardware wise for absolute performance, not without completely blowing up power budgets to the point where just adding more cores is far more efficient.

And even though it appears like Apple and soon Samsung are gaining on Intel and AMD, remember that the two have been stuck on 14nm for years now, while Apple and Samsung are getting the benefits of a node shrink. Once Intel and AMD move to 7nm (Well, 10nm for Intel), the bar will be set higher for ARM designs to beat.

Samsung outs Exynos 9 Series 9810

Senior member

Golden Member

Golden Member

Member

Golden Member

Platinum Member

Platinum Member

Golden Member

Golden Member

Golden Member

Golden Member

Platinum Member

Junior Member

Golden Member

Golden Member

DeletedMember377562

Junior Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Platinum Member

Junior Member

Golden Member