Speculation: Ryzen 4000 series/Zen 3

Page 36 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Why the hell are we still discussing SMT4 in Zen 3?
Well...
Then Milan was designed to further erase any asterisks that remain, so in thinking about it, in the original strategy, Milan was where we expected to be back to IPC (or better) parity across all workloads.
- https://www.anandtech.com/show/14568/an-interview-with-amds-forrest-norrod-naples-rome-milan-genoa

Zen2 in its hi(128-bit) and lo(128-bit) FPU has FP; MUL0/MUL1/MUL2/MUL3 and ADD0/ADD1/ADD2/ADD3, while all ports can do VMISC/FMISC. Which is enough to support AVX512.

Now the SMT4 comes into play as there is 4x FMUL+4x FADD. Each thread thus gets 1xFMUL+1xFADD. Since, majority of legacy code is 128-bit, and there are workloads that can't be scaled to 256-bit. It makes more sense to support 128-bit units over 256-bit/512-bit units.

SMT4 based on AMD's hiring of researchers is constrained quantity, add with the dynamic nature of the new SMT model. Aka, at pre-fetch or dispatch it goes 1T/2T/3T/4T on demand, etc. It is more effective than any previous x86 SMT versions.

No value added to Milan = Intel regains their lead. Icelake-SP(XCC*2) w/ SunnycoveX(10nm++ core) isn't a low-volume product, nor does it have less cores than 64-core. It's a >72 core monster that easily replaces Xeon Phi. There is also the ultra-secret Icelake-MDFI w/ mesh chiplets (w/ L4 depot cache).
 
Last edited:
Reactions: Richie Rich

soresu

Platinum Member
Dec 19, 2014
2,969
2,200
136
No value added to Milan = Intel regains their lead.
They recently made a statement about Milan.

They said it will have superior perf/watt to IceLake, not superior raw performance.

I expect modest IPC gains with some power efficiency gains too, better to expect that and be surprised by more if it comes.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
They recently made a statement about Milan.

They said it will have superior perf/watt to IceLake, not superior raw performance.

I expect modest IPC gains with some power efficiency gains too, better to expect that and be surprised by more if it comes.
If you continue with your line of reasoning, then you're implying that power drops are coming. I take it to mean that the increased perf/W will translate to higher performance as I really don't see them lowering the TDP ratings. Do you?
 

soresu

Platinum Member
Dec 19, 2014
2,969
2,200
136
If you continue with your line of reasoning, then you're implying that power drops are coming. I take it to mean that the increased perf/W will translate to higher performance as I really don't see them lowering the TDP ratings. Do you?
That meant (in my parlance anyways) modest IPC/clock gains and modest power drops, but not a great amount of either given the process change is fairly meagre.

Others have stated otherwise, and some have stated in a rather overly optimistic way that a change to 6 wide is coming - but I prefer to expect less and receive more (if indeed there is more), it's better that way.

Of course I'm just as happy to get a regular 20% IPC bump per gen, but even ARM can't do that all the time - A73 case in point.

Having said that, does anyone have a concrete figure for the Cortex A57 -> A72 IPC improvement?
 

DisEnchantment

Golden Member
Mar 3, 2017
1,687
6,243
136
Yotsugi said:
It's upper teens for IPC and some clocks to boot, the silicon is already up and running, like, Windows.

In the video (around the 104 second mark) they said they are already sampling the chip.
 

soresu

Platinum Member
Dec 19, 2014
2,969
2,200
136
as I really don't see them lowering the TDP ratings. Do you?
Depends on the segment, the 2700E may have been a part which was nigh on impossible to lay your hands on but it was a significant TDP drop for a meagre clock decrease at 45W.

Also at 14nm Zen there was only a single 65W 8 core product (1700), now we have several at 7nm, wasn't there a 12 core 65W too?

In the APU segment I would definitely expect a sub 15W TDP SKU, they have more than enough efficiency now to achieve a very good performer at 10W or below, especially is Navi performs as efficiently as you might hope at lower clockspeeds.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
Depends on the segment, the 2700E may have been a part which was nigh on impossible to lay your hands on but it was a significant TDP drop for a meagre clock decrease at 45W.

Also at 14nm Zen there was only a single 65W 8 core product (1700), now we have several at 7nm, wasn't there a 12 core 65W too?

In the APU segment I would definitely expect a sub 15W TDP SKU, they have more than enough efficiency now to achieve a very good performer at 10W or below, especially is Navi performs as efficiently as you might hope at lower clockspeeds.
Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.
This
Perfect example was r7 1700 to the r7 2700. Although it was amplified by the better boost algorithms, the 2700 maintains higher clock speeds at the same 65w tdp thanks to the 12ff inprovements.
 

soresu

Platinum Member
Dec 19, 2014
2,969
2,200
136
Doesn't matter. Once they keep the 95W or any other existing rating then increased perf/W is really increased perf.
Not really how I see it, but I have offline CG rendering goggles on.

I will always prefer more cores at the same power rather than a few hundred mhz on the same number of cores.

If a 65W 16 core model comes out, I would probably buy it even if it costs more than the higher clocked model at 95W-105W.
 

Saylick

Diamond Member
Sep 10, 2012
3,389
7,154
136
It will have superior everything.

It's upper teens for IPC and some clocks to boot, the silicon is already up and running, like, Windows.
Out of curiosity, how do you know the IPC gains are upper teens? Is this just speculation?
 

Saylick

Diamond Member
Sep 10, 2012
3,389
7,154
136
Nah, the cat's outta the bag already in China.
I haven't been keeping up too frequently in this thread so sorry if it's already been posted, but can you provide the source (assuming it's on Chiphell or some Chinese forum) of this info?

Also, "upper teen" IPC improvement implies the jump from Zen 2 to Zen 3 is as large, if not larger, than from Zen+ to Zen 2. I'd really like to see some proof because that's a BIG jump in IPC.
 

Saylick

Diamond Member
Sep 10, 2012
3,389
7,154
136
if Zen3 has the same amount or more development effort then high teens doesn't seem unreasonable, they did what 13-15%ish while also doubling datapath width with Zen2.
I have an easier time understanding the 15% IPC gains in Zen 2 because the larger mop cache and improved predictor are items that I've seen in the past that directly improves IPC. What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought. Larger registers? Another L/D unit? More ALUs?
 

Yotsugi

Golden Member
Oct 16, 2017
1,029
487
106
but can you provide the source (assuming it's on Chiphell or some Chinese forum) of this info?
Later, when I dig it out of Twitter DMs.
Also, "upper teen" IPC improvement implies the jump from Zen 2 to Zen 3 is as large, if not larger, than from Zen+ to Zen 2
What's so special about that? Every numbered core is a tock.
What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought
You'll see.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
I have an easier time understanding the 15% IPC gains in Zen 2 because the larger mop cache and improved predictor are items that I've seen in the past that directly improves IPC. What I'm curious about is what else is there to do that would also give another 15% IPC on top of the 15% that Zen 2 brought. Larger registers? Another L/D unit? More ALUs?
The answers easy, keep the core fed, so bigger better front end, yes more PRF, more dispatch/retire, increase the OOOe window. The known l3 cache change can make a big IPC difference to various different workloads.

In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.
 
Reactions: lightmanek

Saylick

Diamond Member
Sep 10, 2012
3,389
7,154
136
Later, when I dig it out of Twitter DMs.

I look forward to it.

What's so special about that? Every numbered core is a tock.

That's a fair point, but then again, Intel has use 10% IPC gains as a tock and that's considered a respectable IPC gain for an architectural improvement. 15% on top of another 15% is a fresh change of pace given the incremental improvements we've seen from Intel in the last few years.

The answers easy, keep the core fed, so bigger better front end, yes more PRF, more dispatch/retire, increase the OOOe window. The known l3 cache change can make a big IPC difference to various different workloads.

In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.
Hahaha, fair enough. That's like me asking, "How do you make a new Corvette faster than the generation before it?", and you replying, "Well, you can make it have more horsepower, fatter tires, better weight distribution, and more downforce." I mean, you'd be right because it's true, but it's the smaller details and reasoning behind certain design decisions that I think that are more interesting.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
Hahaha, fair enough. That's like me asking, "How do you make a new Corvette faster than the generation before it?", and you replying, "Well, you can make it have more horsepower, fatter tires, better weight distribution, and more downforce." I mean, you'd be right because it's true, but it's the smaller details and reasoning behind certain design decisions that I think that are more interesting.
well of course,

have a look at the patent for how they made the 3rd AGU work, its quite interesting. It makes me wonder is the reason apple have 4 cycle latency on simple ALU ops because they are doing the same kind of thing AMD did for the AGU's on the ALU's as well, if they did something like that they probably wouldn't have to do any kind of internal clustering of ALU's.
 
Reactions: amd6502

Cardyak

Member
Sep 12, 2018
73
161
106
There’s loads of potential for further increases, and that’s without radical redesigns needed.

Just some basic stuff off the top of my head

- More execution units (Doesn’t have to be ALU, can be AGU, LEA, FPU, etc...)
- Larger Caches
- Increased ROB and Memory, Scheduler Buffers
- More ports to dispatch instructions to execution units and reduce back end bottle necks
 
Last edited:

amd6502

Senior member
Apr 21, 2017
971
360
136
In terms of ALU's still waiting for this mythical single thread integer workload that doesn't do any load or store and has an IPC of >4 with heaps of ILP just lying around waiting for more ALU's.

You're seeing it all the time. It's just that it's only (likely) a smaller percentage of the code.

4ALU is quite good already and got zen the 40%+ ipc gain.

The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.

The slight downside is more idle pipes means it would need gating to avoid loosing efficiency. Or a 4-way MT scheme; SMT2+?
 
Reactions: jaymc

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
You're seeing it all the time. It's just that it's only (likely) a smaller percentage of the code.

4ALU is quite good already and got zen the 40%+ ipc gain.

The potentiall monothreading gains from 4ALU to 5ALU (or 6ALU) are going to be much less. But here, even a ~5% IPC increase is going to count a lot. And for (SMT2) multithread IPC gains, it's bound to be double digits.

The slight downside is more idle pipes means it would need gating to avoid loosing efficiency. Or a 4-way MT scheme; SMT2+?
This is just hand waving, nothing of value, for example your only waiting one extra cycle to execute from 4 wide Zen to 6 wide A12 and Zen has lower latency for simple ALU ops, So unless you have sustained 6 ALU ops a cycle over many cycles back to back your not gaining anything yet A12 has an IPC advantage. Why is that? How do you propose to load or store a damn thing while sustaining 6 ALU ops? Zen2 doesn't even have enough issue width right now to sustain the 4 ALU's + 3 AGU's. As i have already proved for SPEC int ( something you moar ALU guys have yet to do) x86 instructions with memory ops make up a very large amount of instructions.

going to 4 alu's alone did not get anywhere near 40% gain, if you want to be specific and correct, bulldozer could already do 4 ALU ops in a core in a single cycle. ( not that it would practically happen or you would want to) .

lets be clear here:
much improved L1I cache ( no more aliasing)
much improved L1D cache
improved instruction fetch/increased fetch
adding of a UOP cache
significantly improved cache hierarchy
dedicated hardware for stack handling ( store to load forwarding at the frontend of the pipeline)
increased instruction dispatch
significantly increased PRF (96 to 168)
improved branch predictors
improved prefetch
improved store forwarding
increased ALU's to 4.

All those thing got 40% performance uplift, not 4 ALU's FFS.

if you go back and look the initial thoughts of people like David Kanter were that Zen's 4:2 ALU:AGU configuration was sub optimal and 3:3 would have been better. Zen 2 comes along and makes it 4:3.... funny that.......

So instead of BS handwaving show me the money! Show me the SPEC int workload that has cycle over cycle over cycle the need to issue 6 ALU instructions while not loading or storing a thing.

im just going to quote agner:

Bottlenecks in AMD Ryzen The throughput of each core in the Ryzen is higher than on any previous AMD or Intel x86 processor, except for 256-bit vector instructions. Loops that fit into the µop cache can have a throughput of five instructions or six µops per clock cycle. Code that does not fit into the µop cache can have a throughput of four instructions or six µops or approximately 16 bytes of code per clock cycle, whichever is smaller. The 16 bytes fetch rate is a likely bottleneck for CPU intensive code with large loops.

funny how there nothing about ALU bottlenecks in his "optimization guide for assembly programmers and compiler makers" guide.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |