Speculation: Ryzen 4000 series/Zen 3

Saylick · Oct 15, 2019

Cardyak said:
There’s loads of potential for further increases, and that’s without radical redesigns needed.

Just some basic stuff of the top of my head

- More execution units (Doesn’t have to be ALU, can be AGU, LEA, FPU, etc...)
- Larger Caches
- Increased ROB and Memory, Scheduler Buffers
- More ports to dispatch instructions to execution units and reduce back end bottle necks

Makes sense. In a post I wrote about 2 months ago, I thought that the number of micro-ops that could be dispatched was limiting the overall throughput of the core:
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/speculation-ryzen-4000-series-zen-3.2567589/post-39896162

Not to say that the 6 ops/cycle dispatch & 8 ops/cycle retire was the main bottleneck, but given how wide the core is and how many micro-ops the micro-op cache could deliver (8 ops per cycle), AMD could afford to increase the dispatch rate to match. That alone, in theory, gives a maximum of 33% more instructions in a given cycle assuming nothing else is a bottleneck. Everything else is just making sure there's enough load/store resources and enlarging buffers to keep up.

Richie Rich · Oct 15, 2019

amd6502 said:
4ALU is quite good already and got zen the 40%+ ipc gain.

The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.

Why Apple A12 use 6xALU in mobile power sensitive CPU? Why they let these ALU units idle?
Why Apple A12 is +58% IPC faster than Skylake-X in INT? Isn't because Apple find a way how feed them efficiently?

It looks like those 6xALU are not idling much. These are delivering pure performance.

With SMT2 it must be even much easier to utilize all 6xALUs. It looks like the lowest fruit available
It's the same like tune Corolla 4-cyl engine for evey last bit of horsepower. It's way cheaper and easier to buy a car with V6 engine.

itsmydamnation · Oct 15, 2019

Richie Rich said:
Why Apple A12 use 6xALU in mobile power sensitive CPU? Why they let these ALU units idle?
Why Apple A12 is +58% IPC faster than Skylake-X in INT? Isn't because Apple find a way how feed them efficiently?

It looks like those 6xALU are not idling much. These are delivering pure performance.

With SMT2 it must be even much easier to utilize all 6xALUs. It looks like the lowest fruit available
It's the same like tune Corolla 4-cyl engine for evey last bit of horsepower. It's way cheaper and easier to buy a car with V6 engine.

coralation != causation
backup your constant posting with some actual data, otherwise its nothing but https://bit.ly/2Bc1eB8

Also if skylake SPEC int has a average IPC of 1.5 ( that includes memory ops) and A13 is 50% faster ( so 2.25) how the hell are 8 pipelines ( 6 ALU , 2 load/ store ) "not idling much"? Like actually answer a question for once!

Richie Rich · Oct 15, 2019

itsmydamnation said:
coralation != causation
backup your constant posting with some actual data, otherwise its nothing but https://bit.ly/2Bc1eB8

Also if skylake SPEC int has a average IPC of 1.5 ( that includes memory ops) and A13 is 50% faster ( so 2.25) how the hell are 8 pipelines ( 6 ALU , 2 load/ store ) "not idling much"? Like actually answer a question for once!

To be clear. You talk about 1.5 IPC of CISC code, right? And do you about fact that ALUs are executing RISC instructions internally? You sounds like mixing two different things together. Jim Keller said that actuall Intel CPU is executing 3-6 instructions per clock. Is he wrong and why?

Anyway, if Apple A12 with 6xALU (+50% more ALU over 4xALU Skylake) is faster about +58% in SPEC2006int, that means Skylake's ALUs are more idling than Apples. That's pretty impressive to me. That leads me to idea 6xALU core is very efficient way how to increase IPC. I'm not saying it's easy to design such state of art core. Engineering was always harder job than selling burgers in McDonnald's.

JoeRambo · Oct 15, 2019

Richie Rich said:
that means Skylake's ALUs are more idling than Apples.

Yeah, and that in turn means that bottlenecks are elswere and Your fetish with 6 ALU designs is full of uninformed bs.

DrMrLordX · Oct 15, 2019

Saylick said:
What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought. Larger registers? Another L/D unit? More ALUs?

We already know that there will be a shared L3 cache between CCX pairs in Zen3. That has all kinds of interesting performance implications. There may also be some performance improvements for the TAGE branch predictor that was launched a bit early with Zen2 (thanks AMD!). And um, some other stuff.

maddie · Oct 15, 2019

Why all the intense hate for Richie Rich posts?

All he's saying is that 6 ALU will be faster than 4 ALU IF the rest of the support structures ( cache, decode, retire, etc) are in place. Not that 6 is always better than 4 irrespective of anything else.

Atari2600 · Oct 15, 2019

maddie said:
Why all the intense hate for Richie Rich posts?

All he's saying is that 6 ALU will be faster than 4 ALU IF the rest of the support structures ( cache, decode, retire, etc) are in place. Not that 6 is always better than 4 irrespective of anything else.

Those transistors don't come for free.

If you have 4x 4ALU cores vs. 3x 6ALU cores - which is quicker? So which is the best use of budget?

maddie · Oct 15, 2019

Atari2600 said:
Those transistors don't come for free.

If you have 4x 4ALU cores vs. 3x 6ALU cores - which is quicker? So which is the best use of budget?

Of course they don't, but what is the path after you've optimized for the 4 ALU core? There comes a point when you will have to increase them. Are we there yet?

Intel for one, is obviously paying the price for their lack of progress in transistor budgets and resultant stagnation in IPC.

I've often mentioned the Soft Machines work, as I'm interested in its possible use of variable sized virtual cores assembled from simpler structures on the fly.

soresu · Oct 15, 2019

DrMrLordX said:
We already know that there will be a shared L3 cache between CCX pairs in Zen3.

The AMD slide displaying Zen3 unified L3 literally defines a CCX as a complex of cores bound by a shared L3, this effectively makes the new CCD a single CCX.

This will probably count only for the desktop/enthusiast/server parts though.

We may get a multi chiplet APU with a full 8 core CCD in the future, but that will be early 2021 at the earliest.

soresu · Oct 15, 2019

maddie said:
I've often mentioned the Soft Machines work, as I'm interested in its possible use of variable sized virtual cores assembled from simpler structures on the fly.

I was very interested at the time, but I feel that if it was quite as good as they were making it out to be, that they would not have sold out to Intel so easily.

At the very least they would have pitched to Qualcomm, Samsung or some Chinese venture capital interests if it was all that.

Their quick capitulation fealt like they had something, but more of a "look see what we have, come buy us former employers or we will go elsewhere!" vibe.

DrMrLordX · Oct 15, 2019

soresu said:
The AMD slide displaying Zen3 unified L3 literally defines a CCX as a complex of cores bound by a shared L3, this effectively makes the new CCD a single CCX.

It doesn't if there's an IF link between cores 1-4 and cores 5-8 just like in Zen2.

Atari2600 · Oct 15, 2019

maddie said:
Of course they don't, but what is the path after you've optimized for the 4 ALU core? There comes a point when you will have to increase them. Are we there yet?

AMD tried to change the programming model before.

Remind me how Bulldozer worked out again?

maddie · Oct 15, 2019

Atari2600 said:
AMD tried to change the programming model before.

Remind me how Bulldozer worked out again?

I heard it also took quite a few attempts before the Wrights finally flew.

Guru · Oct 15, 2019

birdie said:
The usual questions spring to mind:

Will it support currently existing motherboards (300/400/500 series chipsets)?

What kind of IPC increase are we talking about?

Will AMD manage to squeeze more frequencies?

What node will it use?

What will be its TDP?

Will it support AVX512 instructions?

When and if we can expect Ryzen 4000 CPUs with modern onboard graphics (e.g. Navi10/Navi20)?

We pretty much know most of these.
1. AMD said they are going to support the same chipset until 2020, so yeah, even Zen 3 should be compatible with current chipset and mobo's.
2. This is up in the air, but they won't be doing a Zen 3 design if they can't get at least 5% more IPC gain over Zen 2.
3. Of course, TSMC has said that their 7nm+ is much better than their 7nm, in fact it's even better than their 6nm, though it is also more expensive for now. So yeah, I do expect some small frequency increases, probably 100mhz on the lower to mid end and up to 200mhz on the higher end.
4. Its going to be 7nm+
5. We know 7nm+ can be up to 20% more power efficient or up to 15% faster, so if AMD decides to use 7+ benefits for reduced TDP, then about 20% lower TDP, assuming everything else stays the same, but again if they introduce more frequency, more IPC it might be less.
6. Probably.
7. Earliest Ryzen 4000 cpu's are at least 9 months away, their G series cpu's come few months after that, so at least a year.

soresu · Oct 15, 2019

DrMrLordX said:
It doesn't if there's an IF link between cores 1-4 and cores 5-8 just like in Zen2.

Be careful, it discussed both Rome AND Milan on those slides.

Not sure where you get that from, I don't see it on the slide with the diagram.

Atari2600 · Oct 15, 2019

maddie said:
I heard it also took quite a few attempts before the Wrights finally flew.

Ludicrous comparison.

A more apt analogy would be Airbus building the A380.

Remind me how that went again...?

soresu · Oct 15, 2019

Guru said:
5. We know 7nm+ can be up to 20% more power efficient or up to 15% faster, so if AMD decides to use 7+ benefits for reduced TDP, then about 20% lower TDP, assuming everything else stays the same, but again if they introduce more frequency, more IPC it might be less.

No, it's a 20% area decrease, 15% more power efficient at iso design/clock, or 10% more clock at iso power.

NostaSeronx · Oct 15, 2019

DrMrLordX said:
It doesn't if there's an IF link between cores 1-4 and cores 5-8 just like in Zen2.

Or, the new structure could give the same cores 2x bandwidth. Both IF links drop into the same L3 cache. If the workload on the CCD is only quad-core, in Zen2 only 16 megabytes and 1/2 total CCD GB/s could be used. While, with Zen3 the four cores would have 32+ megabytes and complete global saturation.

DisEnchantment · Oct 15, 2019

Wow, AMD has a slew of customers in this Video.
Including a potential exascale for GENCI from ATOS.
First time I hear about a French exascale effort.

JDG1980 · Oct 15, 2019

Richie Rich said:
Why Apple A12 use 6xALU in mobile power sensitive CPU? Why they let these ALU units idle?
Why Apple A12 is +58% IPC faster than Skylake-X in INT? Isn't because Apple find a way how feed them efficiently?

It looks like those 6xALU are not idling much. These are delivering pure performance.

Perhaps ARM CPUs are more amenable to extreme parallelism than x86 CPUs?

Richie Rich said:
With SMT2 it must be even much easier to utilize all 6xALUs. It looks like the lowest fruit available
It's the same like tune Corolla 4-cyl engine for evey last bit of horsepower. It's way cheaper and easier to buy a car with V6 engine.

I'm not sure if you've noticed, but car manufacturers have been squeezing more performance out of 4-cylinder engines and avoiding 6-cylinder engines wherever possible. These days to get a V6 you pretty much have to buy a truck or a specialty car.

soresu · Oct 15, 2019

JDG1980 said:
I'm not sure if you've noticed, but car manufacturers have been squeezing more performance out of 4-cylinder engines and avoiding 6-cylinder engines wherever possible. These days to get a V6 you pretty much have to buy a truck or a specialty car.

Oh gawd no, don't encourage people to use car engine metaphors!

amd6502 · Oct 15, 2019

moar of everything

moar car engine metaphors

moar ALUs

moar cores

moar SMT or general MT threads!

moar Nosta speculation and patent digging

soresu · Oct 15, 2019

amd6502 said:
moar of everything

moar car engine metaphors

moar ALUs

moar cores

moar SMT or general MT threads!

moar Nosta speculation and patent digging

Moar 2: Moar Moar.....

DrMrLordX · Oct 16, 2019

soresu said:
Be careful, it discussed both Rome AND Milan on those slides.

Not sure where you get that from, I don't see it on the slide with the diagram.

I saw the slides. It doesn't show how the cores on each side of the massive L3 are linked at all. It shows four cores on the left and four on the right, with the L3 in between. And . . . that's it! No link diagram, no topology, no nothing. It does make it very clear that the cores are separated into two blocks just like in Rome. Just the L3 is different.

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Senior member

Diamond Member

Senior member

Golden Member

Lifer

Diamond Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Diamond Member

Senior member

Diamond Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Senior member

Diamond Member

Lifer