Speculation: Ryzen 4000 series/Zen 3

Richie Rich · Oct 30, 2019

DrMrLordX said:
I would assume that 256-bit vector processing is much faster on x86 hardware already since that really isn't a thing on mobile hardware. There are also scenarios where AMD's implementation of SMT in particular makes Zen2 much more attractive. For example, I can easily clear an MT score of 14000 in Geekbench 5 on a 3900x with clockspeeds sitting in, I dunno, the 4.2 GHz range? An A13 with 2 Lightning (2.66 GHz) and 4 Thunder (??? GHz) cores scores a measily GB5 MT score of 3400-3500 (varies). I have twice the cores and . . . I guess ~57% (or more) of the clockspeed of an A13, but better than 400% the MT performance. Take two A13s, jack up their clockspeeds by +57%, and you get an MT score of around 11k (hypothetically). Yeah my 3900x sucks power, but big deal. Let's see Apple scale that A13 up to a 95W TDP (or higher).

That ST score is scary, and the MT score may be more a result of throttling than anything else. So A13 deserves a lot of credit. Just not all the credit.

Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4 which gives you score of 13800 @ 2.66 GHz. This match 12c Ryzen 3900X. And if they will clock it +30% higher at 3.5 GHz, score will rise +25%. And we're talking about old A13, Macbook will have new A14 IMO.

soresu · Oct 30, 2019

Richie Rich said:
So right math is to multiply A13 score by 4 which gives you score of 13800 @ 2.66 GHz.

Past scores have not shown a fully linear scaling on the MT scores, so not right math really.

dnavas · Oct 30, 2019

Richie Rich said:
Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4...

Only if the Thunder cores are not counted for Macbook and/or disabled for the base number. Is the rumor that the Macbook will include 16 thunder cores?

TheGiant · Oct 30, 2019

Richie Rich said:
Rumor says Macbook will have 8c. So right math is to multiply A13 score by 4 which gives you score of 13800 @ 2.66 GHz. This match 12c Ryzen 3900X. And if they will clock it +30% higher at 3.5 GHz, score will rise +25%. And we're talking about old A13, Macbook will have new A14 IMO.

dream on
by dumping x86 the will lose so much
they will improve ipad pro as second computing machine but keep the macbook within x86 world
icelake and tigerlake dont look bad at all

DrMrLordX · Oct 30, 2019

Richie Rich said:
Rumor says Macbook will have 8c.

About that . . .

dnavas said:
Only if the Thunder cores are not counted for Macbook and/or disabled for the base number. Is the rumor that the Macbook will include 16 thunder cores?

Exactly. If it's 3 Lightning 5 Thunder or 4 Lightning 4 Thunder then you aren't getting into the 13800 territory on GB5. And even if you did, my 3900x would still be faster in that bench that is admittedly very friendly to mobile hardware. So ha!

@TheGiant

4c TigerLake looks pretty good actually.

amd6502 · Oct 30, 2019

Richie Rich said:
Why are you mixing 1.5 CISC ILP with RISC execution units? Keller said IceLake is executing 3-6 instructions at once. Maybe you can explain why Apple moved from A7 (4xALU, 2xLSU) to wider A11/12/13 (6xALUs with still only 2xLSU). I think they had pretty good reason to do that (especially when we know there is massive +58% IPC gain over SkyLake).

The interesting thing is that AVT saw exactly same slides (graphics) just with SMT4 on it. This is the point. They put it there for identifying leakers or because Zen 3 is SMT4 capable. Could be both.

Cache, cache, cache. I feel like in Tron movie surrounded by programs caught in endless cycle. No offense however it's funny how many people want to increase code execution by not increasing exe units. Leaked Zen 3 IPC gain of >8% (other says >10%) cannot be achieved by just L3 cache.

BTW a comparison of evolution of Apple/Intel cores:

2012 - Intel IvyBridge (3xALU)... Apple A6 (2xALU) .... Apple is way behind

2013 - Intel Haswell (4xALU)... Apple A7 (4xALU) .... Apple is on par with Intel

2017 - Intel CoffieLake (4xALU)... Apple A11 (6xALU) .... Apple became tech leader

Isn't this interesting?

I think they may go wider but 6 ALU (with 4 AGU?) would be a little overkill. I think 5 ALU with one (or some other mix) of ALUs being a simple unit would be quite enough.

I also think that if Zen3 were SMT4 it would have come out in the HPC presentation/leak. However, if it was some other 4 way MT that in performance is closer to SMT2 (I call it SMT2+ or aSMT4) then it would for all practical purposes of an HPC oriented talk be called SMT2. Supercomputing (numerical modeling) could really care less if you have extra background threads. Conversely odds of SMT2+ is also not high; under 10%. But odds of SMT2+ for Zen4 go up quite a bit. This would be a feature very useful in server and also somewhat useful in mobile.

Saylick · Oct 31, 2019

I'm still thinking they'll beef up the core with an additional L/S Unit (so 2 Load/2 Store), widen dispatch to 8 ops/cycle, widen retire to 10 ops/cycle, and shared L3 across all 8 cores on each CCD. This is on top of the usual enlarging of registers and buffers.

soresu · Oct 31, 2019

Saylick said:
shared L3 across all 8 cores on each CCD

That much was confirmed by the new presentation slide, exact quantity of the L3 is still up in the air though beyond 32MB+.

Given how much area L3 takes up though, I'm inclined to think snything more than 40MB (+25%) might be too big for the 7nm+ move.

Gideon · Oct 31, 2019

soresu said:
That much was confirmed by the new presentation slide, exact quantity of the L3 is still up in the air though beyond 32MB+.

Given how much area L3 takes up though, I'm inclined to think snything more than 40MB (+25%) might be too big for the 7nm+ move.

My guess is, that it will be an all inclusive cache and will therefore grow, to fit the L2 of all cores in CCD (in order not to effectively shrink in size compared to zen 2). This means extra 4MB, provided the L2 remains unchanged - so in total 36MB of L3 per chiplet.

As the L3 latency in Zen2 is already measurably slower than Zen+ (11ns vs 9ns) and unifying the cache will probably make it a tad worse still, I wouldn't rule out L2 being enlarged to compensate it, so 1MB L1 per core + 40MB L3 per Chiplet is also a (less likely) possibility.

If one is to believe the ~15% IPC gain rumors, I think the entire cache hierarchy will be redesigned as the unification of L3 is a major redesign anyway. My (somewhat wild and wishful) predictions for Milan memory hierarchy in that case are:

Memory Compression for chiplet-to-chiplet communication at least on server (probably configurable in BIOS). They have issued patents for it a while a go and it would save considerable amount of power (in EPYC and Threadripper) that could be used in the core-chiplets instead of it being wasted transporting data.
40MB of all-inclusive L3 cache per CCD (36MB if L2 stays the same)
1MB of L2
48KB of 12-way L1 Data Cache "Ice Lake Style" (this is the least likely prediciton IMO)
improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).

Thunder 57 · Oct 31, 2019

Gideon said:
My guess is, that it will be an all inclusive cache and will therefore grow, to fit the L2 of all cores in CCD (in order not to effectively shrink in size compared to zen 2). This means extra 4MB, provided the L2 remains unchanged - so in total 36MB of L3 per chiplet.

As the L3 latency in Zen2 is already measurably slower than Zen+ (11ns vs 9ns) and unifying the cache will probably make it a tad worse still, I wouldn't rule out L2 being enlarged to compensate it, so 1MB L1 per core + 40MB L3 per Chiplet is also a (less likely) possibility.

If one is to believe the ~15% IPC gain rumors, I think the entire cache hierarchy will be redesigned as the unification of L3 is a major redesign anyway. My (somewhat wild and wishful) predictions for Milan memory hierarchy in that case are:

Memory Compression for chiplet-to-chiplet communication at least on server (probably configurable in BIOS). They have issued patents for it a while a go and it would save considerable amount of power (in EPYC and Threadripper) that could be used in the core-chiplets instead of it being wasted transporting data.

40MB of all-inclusive L3 cache per CCD (36MB if L2 stays the same)

1MB of L2

48KB of 12-way L1 Data Cache "Ice Lake Style" (this is the least likely prediciton IMO)

improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).

Have to admit, an inclusive L3 cache would be interesting. It is certainly more realistic than some other ideas being floated around here.

Ajay · Oct 31, 2019

Gideon said:
If one is to believe the ~15% IPC gain rumors

I'll be shocked if Zen3 has this much IPC gain. And I'll be dismayed, slightly, if I build a Ryzen 3000 series system.
I think a 15% overall performance gain would be impressive, and would finally leave Intel behind in gaming.

Cardyak · Oct 31, 2019

Gideon said:
improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned (effectively doubling it for lightly threaded loads).

That’s an interesting notion. Similar to Ivy Bridge where improvements were made to the ROB and other sections to reduce static partitioning.

This alone could offer a modest IPC increase.

soresu · Oct 31, 2019

Ajay said:
I think a 15% overall performance gain would be impressive

Impressive, most impressive.....

I'm just going to leave now.

amd6502 · Oct 31, 2019

inclusive means L1/L2 data is copied to L3.

L2 doubling to 1MB seems like a very good bet. And semi-decent chance L1 growing too.

Seems they took the all roads lead to Rome theme further in Zen3 by doing this at the scale of the CCX.

Does Zen2 follow Zen1 in L3 being victim cache?

For Zen3 the L3 cache unit may be a complex compound of its own and also now may have the role of acting as hub. It may be a smart hybrid thing that is not one or the other; internally it may dedicate a good fraction ~30% capacity as L4-ish victim cache (with L3 = hub+L3+L4).

inclusive would help in its role as hub when there are shared memory addresses being updated by several cores (more energy efficient, less latency than exclusive, but you have slightly smaller total cache footprint: 32 MB vs a 36 or 40 MB. something more flexible (non-exclusive or partially-inclusive) based on whether addresses are shared between cores would have best of both worlds.

Tarkin77 · Nov 1, 2019

Direct quote from Dr. Lisa Su

Going forward, we are not relying on process technology as the main driver. We think process technology is necessary. It’s necessary to be sort of at the leading edge of process technology. And so, today, 7-nanometer is a great node, and we’re getting a lot of benefit from it. We will transition to the 5-nanometer node at the appropriate time and get great benefit from that as well. But we’re doing a lot in architecture. And I would say, that the architecture is where we believe the highest leverage is for our product portfolio going forward.

from the Q3 conference call this week.
source: https://www.overclock3d.net/news/cp...tecture_not_process_tech_says_amd_s_lisa_su/1

Arzachel · Nov 1, 2019

Ajay said:
I'll be shocked if Zen3 has this much IPC gain. And I'll be dismayed, slightly, if I build a Ryzen 3000 series system.
I think a 15% overall performance gain would be impressive, and would finally leave Intel behind in gaming.

It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.

Ajay · Nov 1, 2019

Arzachel said:
It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.

AMD doesn’t really have anything to worry about on the desktop till 2022! They will definitively take not just the multithreaded crown but also the single threaded crown in 2020. So yes, they to have to keep up the pace; I believe Intel will finally bring their A game in 2022 (unless something is horribly wrong with 7nm EUV).

Abwx · Nov 1, 2019

Arzachel said:
It would be impressive but also absolutely necessary to match Intel's pace. Ice Lake might be stuck on a dud node but clock for clock it's still the fastest x86 cpu.

Clock for clock, and in Cinebench R15, a Zen 2 is....2% faster in single thread than Icelake, so that doesnt bode well for Intel s alleged 18% IPC improvement...

Topweasel · Nov 1, 2019

Abwx said:
Clock for clock, and in Cinebench R15, a Zen 2 is....2% faster in single thread than Icelake, so that doesnt bode well for Intel s alleged 18% IPC improvement...

Depends on the tests, I have seen -10% to a - 6% to a +2%. I think the key is IF speed can be key here at least in these kind of benches. But even at 10% better then Zen 2 its a wash with their current clocks. If they can get another 5-7% with Zen 3 and Zen 4 while maintaining clocks, then I think we start to really see AMD pull away. Everything Intel is trying AMD basically has a smoother thing going. I mean look at IF/Zen2/and their packaging Vs. where Intel is with EMIB. Intel's solution seems like it would be more elegant. But does that matter if AMD is selling twice as many cores for half the price of their best chips at higher clocks.

Abwx · Nov 1, 2019

Topweasel said:
Depends on the tests, I have seen -10% to a - 6% to a +2%. I think the key is IF speed can be key here at least in these kind of benches. But even at 10% better then Zen 2 its a wash with their current clocks. If they can get another 5-7% with Zen 3 and Zen 4 while maintaining clocks, then I think we start to really see AMD pull away. Everything Intel is trying AMD basically has a smoother thing going. I mean look at IF/Zen2/and their packaging Vs. where Intel is with EMIB. Intel's solution seems like it would be more elegant. But does that matter if AMD is selling twice as many cores for half the price of their best chips at higher clocks.

Dunno what are thoses "other" tests you re talking about, so far i ve seen no exhaustive review with fixed or known clocks if we except the CB R15 ST score, all other tests are done at a given power, wich doesnt say what is the clock rate..

yuri69 · Nov 1, 2019

Abwx said:
Dunno what are thoses "other" tests you re talking about, so far i ve seen no exhaustive review with fixed or known clocks if we except the CB R15 ST score, all other tests are done at a given power, wich doesnt say what is the clock rate..

OT, but Phoronix managed to obtain ~18% IPC uplift compared to Kaby. This was done using their standard suite.

uzzi38 · Nov 1, 2019

yuri69 said:
OT, but Phoronix managed to obtain ~18% IPC uplift compared to Kaby. This was done using their standard suite.

Anandtech also validated the 18% IPC uplift.

Ajay · Nov 1, 2019

Gideon said:
Memory Compression for chiplet-to-chiplet communication at least on server

Hmm, but it would introduce more latency.

Gideon said:
improvements to the uop cache, so that is competitively shared between SMT threads, rather than statically partitioned

AMD should just double the uop cache anyway.

Thunder 57 · Nov 1, 2019

Ajay said:
Hmm, but it would introduce more latency.

AMD should just double the uop cache anyway.

What?? They just doubled it with Zen 2 at the expensive of L1-I cache. Pretty sure that stays put for Zen 3.

soresu · Nov 1, 2019

yuri69 said:
OT, but Phoronix managed to obtain ~18% IPC uplift compared to Kaby. This was done using their standard suite.

uzzi38 said:
Anandtech also validated the 18% IPC uplift.

With or without all pertinent mitigations of security problems?

Speculation: Ryzen 4000 series/Zen 3

Senior member

Platinum Member

Senior member

Senior member

Lifer

Senior member

Diamond Member

Platinum Member

Golden Member

Platinum Member

Lifer

Member

Platinum Member

Senior member

Member

Senior member

Lifer

Lifer

Diamond Member

Lifer

Senior member

Platinum Member

Lifer

Platinum Member

Platinum Member