Speculation: Ryzen 4000 series/Zen 3

Page 33 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Ah, but at what clock freq does power consumption jump through the roof due to uArch mobile optimisations?

And we already know the intrinsic vector length limits of NEON SIMD are below that of AMD, let alone Intel with AVX512.

Of course this will change in the future with SVE2, but that is then, this is now.

There still seems to be a gulf between benchmarking the 2 platforms that respects all possible performance avenues, and vector/SIMD length is a big one in certain use cases.
Do you want to question SPEC2006 benchmark, likely the best professional multi-platform benchmark ever?

I'm not saying ARM is better in everything right now. However I do say that ARM has:
  1. first 6xALU core on the world (Apple A11, 12 and 13),
  2. ARM will have SVE2 with scalable vector width 128-2048-bit soon,
  3. ARM performance grows faster than x86
  4. and last thing is Cortex A78/A79 (future generic core 6xALU + 2048-bit FPU) which can be produced by many Chinese manufacturers for a 1/4 of Ryzen price.

AMD should bring 6xALU core in Zen 3 to eliminate the gap to Apple.
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
Do you want to question SPEC2006 benchmark, likely the best professional multi-platform benchmark ever?

I'm not saying ARM is better in everything right now. However I do say that ARM has:
  1. first 6xALU core on the world (Apple A11, 12 and 13),
  2. ARM will have SVE2 with scalable vector width 128-2048-bit soon,
  3. ARM performance grows faster than x86
  4. and last thing is Cortex A78/A79 (future generic core 6xALU + 2048-bit FPU) which can be produced by many Chinese manufacturers for a 1/4 of Ryzen price.

AMD should bring 6xALU core in Zen 3 to eliminate the gap to Apple.
Are you a seronx sockpuppet?



Insults in tech are not allowed.


esquared
Anandtech Forum Director
 
Last edited by a moderator:

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
AMD should bring 6xALU core in Zen 3 to eliminate the gap to Apple.
They don't really compete with Apple, at best they do with the new iPad OS powered iPad's but that is about it - and people buying into that will buy Apple anyway, so it really does them no good to chase Apple specifically.

Their main goals are to claw back market share from Intel in CPU's and nVidia in GPU's.

If anything Apple is a customer rather than a competitor, given Vega 12 exists solely for their benefit, and Vega II Duo as well if I'm not mistaken.
 

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
ARM will have SVE2 with scalable vector width 128-2048-bit soon,
and last thing is Cortex A78/A79 (future generic core 6xALU + 2048-bit FPU)
Processing vector lengths higher than 512 bit do not come without a serious cost in both power and area, regardless of the instruction set design.

At best we may see 512 bit SVE2 in an ARM big core for mobiles and laptops, but even that is a stretch for current process constraints.

256 bit SVE2 is a solid improvement over 128 bit NEON for big cores, with 128 bit SVE2 for little cores for the foreseeable future.

1024 and 2048 bit SVE2 implementations will remain the preserve of custom core systems like the Fujitsu 'post-K' system that is pioneering the first SVE instruction set.
 

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Apple is winning at IPC not by its width , but by its frontend and its big low latency caches.
You sound like Dirk Meyer trying to concentrate on everything else except execution units. As Keller said: "They tried to make ocean boil". It's pretty exact description, they tried to increase performance without execution units. This is not feasible.

I don't understand this cache obsession. Anybody who is talking about INCREASING performance via cache is terribly wrong. Cache is not execution unit, it's just supportive logic which is DECREASING bottlenecks only. 2x cache = +10% perf, another 4x cache = +5%, and it goes to zero in limit. Wasted transistors. On the other side execution units scale almost linear. A12 has 158% IPC of Skylake in SPECint. This cannot be achieved with 4xALU even you give it 1 GB L1 cache (it is like trying to increase performace of 4 cylinder engine from Toyota Corolla by using intake filter from Challenger Demon)

If you want radically step up performance you need to step up number of execution units and built the rest of CPU around it accordingly.
 
Last edited:

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
You sound like Dirk Meyer trying to concentrate on everything else except execution units. As Keller said: "They tried to make ocean boil". It's pretty exact description, they tried to increase performance without execution units. This is not feasible.

I don't understand this cache obsession. Anybody who is talking about INCREASING performance via cache is terribly wrong. Cache is not execution unit, it's just supportive logic which is DECREASING bottlenecks only. 2x cache = +10% perf, another 4x cache = +5%, and it goes to zero in limit. Wasted transistors. On the other side execution units scale almost linear. A12 has 158% IPC of Skylake in SPECint. This cannot be achieved with 4xALU even you give it 1 GB L1 cache (it is like trying to increase performace of 4 cylinder engine from Toyota Corolla by using intake filter from Challenge Demon)

If you want radically step up performance you need to step up number of execution units and built the rest of CPU around it accordingly.
The fact that Cortex A77 does not immediately match Apple Ax 6 wide design contradicts your statement.

Hell, A76 is 4 wide and it competes favourably with Samsung's first gen 6 wide M3 design.

There is a lot more to CPU design than the mere amount of execution units it has, that much is clear from the differences we see in the market.
 

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
Do you want to question SPEC2006 benchmark, likely the best professional multi-platform benchmark ever?
Did any 256 bit SIMD implementations exist in 2006?

Let alone 512 bit....

Benchmarks need to be updated to reflect both uArch capabilities in the market and the workloads that they are used to run, otherwise you are missing out on a significant part of the need for benchmarks in the first place.
 

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
This cannot be achieved with 4xALU even you give it 1 GB L1 cache
L1 is typically tiny, even by the standards of the now defunct floppy disk memory and Bill Gates old 640 KB prediction.

I doubt that raising it above even 1 MB per core would improve any CPU design, to say nothing of the fact that 1 GB of SRAM would be larger than most GPU's given how area inefficient it is.
 

Thunder 57

Platinum Member
Aug 19, 2007
2,811
4,094
136
L1 is typically tiny, even by the standards of the now defunct floppy disk memory and Bill Gates old 640 KB prediction.

I doubt that raising it above even 1 MB per core would improve any CPU design, to say nothing of the fact that 1 GB of SRAM would be larger than most GPU's given how area inefficient it is.

Not to mention it goes completely against the purpose of cache. To be small, very fast and very low latency. There is a reason it is broken down into levels. I mean, I think everyone expects Zen to get wider, but it has to be done in a way that makes sense.
 
Reactions: Tlh97 and soresu

itsmydamnation

Platinum Member
Feb 6, 2011
2,863
3,413
136
You sound like Dirk Meyer trying to concentrate on everything else except execution units.
you sound like a simpleton who just grasped onto the only thing he can see as different

As Keller said: "They tried to make ocean boil". It's pretty exact description, they tried to increase performance without execution units. This is not feasible.
This isn't even close to an accurate description of how and why AMD went with bulldozer and as sad as it is being only a 2 wide design ( some limited instructions could also be done on the AGU's) wasn't even close to being its biggest failing.

I don't understand this cache obsession. Anybody who is talking about INCREASING performance via cache is terribly wrong.
Way to miss the point, the point was about keeping the core feed by having data loaded into it before it actually needs it ( prefetch/predict) cache just helps that happen faster if there has been some form of locality, thats why there are things like L2/3 stream prefetchers that dont even know what the core is doing it just pulls data based of memory address pattern.

Cache is not execution unit, it's just supportive logic which is DECREASING bottlenecks only. 2x cache = +10% perf, another 4x cache = +5%, and it goes to zero in limit.
Why do you keep going on about cache, are you deliberately miss-characterising my post or can you just no read and comprehend properly ( if English isn't your first language this is actually understandable )

On the other side execution units scale almost linear. A12 has 158% IPC of Skylake in SPECint. This cannot be achieved with 4xALU even you give it 1 GB L1 cache (it is like trying to increase performace of 4 cylinder engine from Toyota Corolla by using intake filter from Challenge Demon)
this is just nothing gibberish.



If you want radically step up performance you need to step up number of execution units and built the rest of CPU around it accordingly.

now its time for the money shot.......

for the following im going to assume A12's performance is relative the same in 2019 as in 2006, because a cant find 2019 data for A1* to chekc performance relative to Xeon v4.

spec 2k19 dynamic instruction distribution

isn't int funny how in INTRate 60% of all instructions contain a memory operation and in most cases ~40% are loads. if you then look at instruction breakdown that add's,subs,compares Zen can execute 4 of any of those a cycle with a latency of 1 cycle. if you read the PDF you will see for the vast majority of SPECRate IPC is 1.5 or below.

So I challenge you to explain how simply going to 6 ALU's from 4 will improve IPC when single threaded integer performances is so heavily tied to memory performance, if you miss it's 100ns to memory if your operating a 3ghz thats 300 cycles the cpu is waiting, consider that for the vast majority of actual ALU operations in INTRate Zen has a r,r latency of 1 and a throughput of 4, explain how making that 6 will dramatically increase IPC.

Your 6 Wide ALU fetish is ignorant.
I would suggest if either intel or amd could make the JMP instruction take 1 cycle not 2 that would make a far bigger performance impact then simply being able to execute 6 simple ALU ops a cycle!
 

Thunder 57

Platinum Member
Aug 19, 2007
2,811
4,094
136
...Why do you keep going on about cache, are you deliberately miss-characterising my post or can you just no read and comprehend properly ( if English isn't your first language this is actually understandable )...

I would have to guess that English is not the first language, at least it sure seems that way. So I have a feeling some stuff is getting "lost in translation".
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Cost and available space.
At 7nm+ is there even space for such a wider core on am4?

The history is full of examples of wider cores not living up to expectations. M3 vs a76 mentioned but a73 vs a75 is also a candidate imo. Successfull examples too.

But when did we last see a revamped modern cache subsystem fail to deliver?
 

soresu

Platinum Member
Dec 19, 2014
2,956
2,173
136
a73 vs a75 is also a candidate imo
Eh?

A75 was a follow on from A73, from the same design team (Sophia).

It barely increased power draw per mhz, but got a solid IPC increase thanks to going from 2 to 3 wide - no doubt with other opts, it's been a while so they don't come to mind easily.

Admittedly some of that meager power draw increase may have been due to the bettered process of A75 vs A73 (10nm vs 14/16nm).

Gotta hand it to those Sophia magicians though, they are great at the perf/watt efficiency game, leaves me interested to see what we will get after Hercules from them with Matterhorn, assuming it's their design that is.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
At 7nm+ is there even space for such a wider core on am4?
EUV 7nm is more route-efficient than DUV 7nm. With DUV having the best routing at 9-track, and EUV having the best routing at 6-track. 7.5T-HPC for Zen2 must be heavily inefficient in area, for that reason.

---
Jaguar design goals:
[*]Improve on “Bobcat”: performance in a given power envelope
– More IPC
– Better Frequency at given Voltage
– Improve power efficiency thru clock gating and unit redesign
[*]Update the ISA/Feature Set
[*]Increase Process Portability

14h -> 16h is the minimum bar of 17h -> 19h. A similar case is with 15h 30h-3Fh to 15h 60h-7Fh.
Bobcat to Jaguar => >15% IPC improvement + >10% frequency improvement.
Steamroller to Excavator => 5% IPC improvement + 40% less power + 23% less area.
~Both of which were touted as power efficiency improvements.

78CPP/9T(576nm) to 64CPP/7.5T(300nm) vs same(576nm) to 54CPP/6T(240nm). However, DUV to EUV plus track-height reduction should allow for an average closer to 50%. Putting it at a normal node shrink, with N5's 6T or 5T giving equivelent density of going from 14nm, skipping 10nm, to 7nm.

Static execution width is inefficient with node shrinks on FinFETs. Either, go wide or die (R.I.P. Cannonlake, Icelake, Tigerlake OG cores).



The biggest shrink for XV from 13T to 9T came from the FPU[left to red square]; FPU scheduler 40% & FMAC 35%. Include that with the shrink of the integer units. There is definitely room for a second replication of the integer exe;s 4 ALUs/3 AGUs[red square].

The store-load allocation queues L0ds would probably have the same latency as Neoverse E1: 2-cycle, I don't know about the load-store queue L1d will increase to 6-cycle, but I doubt it.
 
Last edited:
Reactions: lightmanek

amd6502

Senior member
Apr 21, 2017
971
360
136
The interesting thing there is that going from 2 to 3 threads seems to show good gains, its going from 3 to 4 where it really diminishes to the point of not being worth it.

It probably is just the pipelines getting saturated. When available execution units are hard to come by adding a thread only gets small gains.


Thanks. But I think there are two issue with what you just said. One, compare to DEC Alpha simulations, Zen SMT-2 yield is low (55% lower). So, you need to divide the SMT-2 to SMT-3 yield difference by 2, which means it is less significant than it appears.

Yes, this makes sense as EV8 seems to be really wide (6+4 integer units) if i am understanding this right. Richie Rich mentioned before that typical code spends most of the time making do with 2 ALU (which is why dozer designers made the "smart" choice of going 2 ALU wide on its int core.) Zen is only 4+2 wide, so is going to be closer saturating the execution units compared to the 6+4 wide EV8 running SMT2.

We know Zen2 is 4+3 wide. (and due to the added AGU, we expect its SMT2 yield to increase slightly over Zen).

I think for Zen3 it would be nice to again increment, this time adding a simple ALU (bit operations, compare, and addition only). That shouldn't add too much transistors nor complexity, and should be pretty efficient.

I could see Zen4 going 6+4 wide like its Alpha EV8 ancestor, especially if going SMT3/SMT4. If remaining SMT2/SMT2+ maybe only 6+3 wide.

Going from 4 ALU to 6 ALU would give a bigger boost to multithread than it will to single-thread. However both 5 ALU or 6 ALU should still give a bit of ST uplift from Zen2.

Cost and available space.
At 7nm+ is there even space for such a wider core on am4?

Absolutely; the denser process is what allowed them to double the FPU for their 7nm generation. Adding a pipe or two should be far less transitors added than a doubled up FPU.
 
Last edited:

Gideon

Golden Member
Nov 27, 2007
1,710
3,927
136
Thanks, this is an excellent argument for Zen 3 being an optimization round. Some form of advanced TAGE predictor is assumed to be standard in CPUs by Intel and Apple already, and lagging AMD only introduced it in Zen 2 (and even there it was mentioned as one of the parts originally intended for Zen 3). The predictor needs to improve to be able to use the prefetcher more efficiently. A lot of the patents DisEnchantment previously mentioned in this thread are based around cache handling improvements that only make sense in conjunction with significant improvements in the predictor and prefetcher logic. This also meshes well with the leaked announcement that the two CCXes' L3$ on each CCD will be "unified" in Zen 3, which IMO in the light of the above information is not necessarily a statement about the topology but could be more about significant changes in the cache handling.

Anandtech's Andrei mentioned that Ryzen 3xxx prefetcher/predictor is the best available on desktop currently (in both tweets and the review). Now obviously it's compared to Skylake rather than Ice Lake, but how on earth is that lagging behind?
 
Last edited:

Richie Rich

Senior member
Jul 28, 2019
470
229
76
Isn't Cortex A77 already a 6 wide design?
You mean 6 wide in INT core. A77 has 4xALU + 2 exclusive branch units. A12 Vortex has 6xALU where 2 are combined ALU/branch. Main difference between those cores is that Apple 6xALU core is brand new design built around 6xALUs, when A77 is just beefed up 4xALU A76 being limited by rest of the core (AGU and front-end).

The fact that Cortex A77 does not immediately match Apple Ax 6 wide design contradicts your statement.

Hell, A76 is 4 wide and it competes favourably with Samsung's first gen 6 wide M3 design.

There is a lot more to CPU design than the mere amount of execution units it has, that much is clear from the differences we see in the market.
Yeah, keep telling why it's not possible to do it.

Did any 256 bit SIMD implementations exist in 2006?

Let alone 512 bit....

Benchmarks need to be updated to reflect both uArch capabilities in the market and the workloads that they are used to run, otherwise you are missing out on a significant part of the need for benchmarks in the first place.
It's even more impressive when brand new core beats competitors in 13 years old code. It proves that wide core is faster everywhere not only in for example AVX friendly code.


This isn't even close to an accurate description of how and why AMD went with bulldozer and as sad as it is being only a 2 wide design ( some limited instructions could also be done on the AGU's) wasn't even close to being its biggest failing.
I guess you try to say Bulldozer's biggest failing was cache or AGUs, right? Half of execution units (2xALU) according to 4xALU Haswell it's negligible. I'm wonder why they beg Keller to help them design a brand new 4xALU core. I'm open to read your analysis how to make 2xALU core to be faster than Zen core. Do not hesitate to write it down, please.


Way to miss the point, the point was about keeping the core feed by having data loaded into it before it actually needs it ( prefetch/predict) cache just helps that happen faster if there has been some form of locality, thats why there are things like L2/3 stream prefetchers that dont even know what the core is doing it just pulls data based of memory address pattern.
I know how memory system and prefetchers works. However it's just feeding core (thus eliminating bottleneck). You could have the best caches, fastest mem with super low latency, best prefetch ever, tons of AGUs and you will still have terrible slow CPU because of just 2xALUs. All CPU engineers in history try to put more execution units, we went from 2xALU PentiumPro, to 3xALU PIII, to 4xALU Haswell, to 6xALU Apple A12. Everytime they went to from 3xALU down to 2xALU design (PIII -> P4, or K10 -> Bulldozer) is was disaster with significant market loss. You simply try to argue with history proven data, not with me.


So I challenge you to explain how simply going to 6 ALU's from 4 will improve IPC when single threaded integer performances is so heavily tied to memory performance, if you miss it's 100ns to memory if your operating a 3ghz thats 300 cycles the cpu is waiting, consider that for the vast majority of actual ALU operations in INTRate Zen has a r,r latency of 1 and a throughput of 4, explain how making that 6 will dramatically increase IPC.
Apple proved they can gain 158% IPC over Skylake with 6xALU core. There are always thousands people saying that it's not possible. Until somebody (like Jobs, Musk or Keller) do that and show it's possible. Well, if I can I choose Apple's engineers to believe more than you. They have some significant performance results behind. It's pure pragmatism.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
Okay you're definitely a seronx sockpuppet.

This is rather unfair to SOIronx, i have never seen him go aggressive towards those who "doubted" SOI revolution or inevitable return of jedi bulldozer. He just continued to throw random patent, job desc etc links.
This specimen is much worse, when faced with facts, he responds with aggressive BS.

But it is good for all of us, that he choose 6-ALU and not something more sensible. I mean when AMD/Intel/IBM all don't run 6-ALUs there must be more immediate bottlenecks than straightforward increase in execution resources.


 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
This is rather unfair to SOIronx, i have never seen him go aggressive towards those who "doubted" SOI revolution or inevitable return of jedi bulldozer. He just continued to throw random patent, job desc etc links.
This specimen is much worse, when faced with facts, he responds with aggressive BS.

But it is good for all of us, that he choose 6-ALU and not something more sensible. I mean when AMD/Intel/IBM all don't run 6-ALUs there must be more immediate bottlenecks than straightforward increase in execution resources.
Eh let's just Redactedthrow 8 ALUs in like Zen4.
The more, the merrier!



Profanity in tech is not allowed.


esquared
Anandtech Forum Director
 
Last edited by a moderator:

NostaSeronx

Diamond Member
Sep 18, 2011
3,688
1,222
136
¯\_(ツ)_/¯
x86 is limited to two macro-instructions per clock per thread. There is a ~2 barrier, good luck getting past it.

SMT2 = 4 ALUs
SMT4 = 8 ALUs

1x8 ALUs is bad.
1x6 ALUs is bad.
1x4 ALUs is the worst configuration to compete against what is coming in Matterhorn(Neoverse post-N2), Willowcove-tock(TGL)/Goldencove-tick(ADL), etc

AMD caused sharks, so they will get their do if they stagnate.
 
Last edited:
Reactions: Richie Rich
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |