Speculation: Ryzen 4000 series/Zen 3

mikk · Aug 23, 2020

uzzi38 said:
Minimum 8 CUs, which is 512 shaders.

So do you have a source?

uzzi38 said:
Don't worry about tests for the time being, just note RDNA's improvement to graphics and bandwidth efficiency over Vega. A 5600XT performs on par with a Vega64 despite just over half the CUs and just over half the memory bandwidth.

You should have a test because you did claim it will demolish Tigerlake. I have to wonder because we still don't even know exactly how Tigerlake performs in real world and at a lower power. A statement like this is bold, so I'm asking for your source.

uzzi38 · Aug 23, 2020

mikk said:
So do you have a source?

You should have a test because you did claim it will demolish Tigerlake. I have to wonder because we still don't even know exactly how Tigerlake performs in real world and at a lower power. A statement like this is bold, so I'm asking for your source.

Yeah, _rogame's findings in driver code that tells us we're lookibg at a multiple of 4WGPs. 4WGPs is 8 CUs, which is 512 shaders.

As for the second half, wait and see. But Van Gogh and Mero are both not normal parts anyway, so w/e.

Antey · Aug 23, 2020

mikk said:
Interesting observation, did you see a test? Could you share the link? How many units it will have?

And what about tiger lake? did you see a test? we don't even know if we will se tiger lake <10W this year

RTX2080 · Aug 23, 2020

uzzi38 said:
I have two points to make to that.

1. AMD's roadmaps in the last 2 years have changed incredibly drastically. I doubt any claims that this is very old, and even if it were it's most certainly not accurate. You've seen one of the changes to the roadmap in this thread I'm sure - Zen 3 was originally going to have SMT4, and AMD have talked about other changes such as Zen 2 getting the TAGE branch predictor of Zen 3.

2. The people that I've talked to that spend time on Weibo have said not to trust him as he's highly pro-Intel and has gone so far as to make up stuff regarding AMD to make Intel look better. Treat his roadmaps as something that need to be proved accurate first before we take then as anything remotely close to fact.

I don't know what amd will going to do with Zen3 or later gen also I don't know that guy's personality, but that roadmap looks a bit odd because I never see amd made such style of roadmap, Xnm + codename + mem gen + arch in a single blank lol

uzzi38 · Aug 23, 2020

cortexa99 said:
I don't know what amd will going to do with Zen3 or later gen also I don't know that guy's personality, but that roadmap looks a bit odd because I never see amd made such style of roadmap, Xnm + codename + mem gen + arch in a single blank lol

No comment on AMD's normal roadmap style. However, I will point out there is never a reason to have so many different types of chips on the same roadmap.

AMD is barely telling OEMs about the silicon they need to know of, forget the silicon they may never see (such as Van Gogh)

mikk · Aug 23, 2020

uzzi38 said:
Yeah, _rogame's findings in driver code that tells us we're lookibg at a multiple of 4WGPs. 4WGPs is 8 CUs, which is 512 shaders.

As for the second half, wait and see. But Van Gogh and Mero are both not normal parts anyway, so w/e.

You should wait see, I didn't make a bold statement like Van Gogh will demolish Tigerlake without real infos. The big Vega variants couldn't utilize all of its shaders well enough, it remains to be seen how big the difference will be on a very low shader count GPU. RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.

lobz · Aug 23, 2020

NostaSeronx said:
We have just received bad news.

AMD aren't going to be able to make timelines. They won't be able to use chiplets to jump node to node. It is simply too expensive, AMD is broke and they aren't going to keep up leading edge anymore. With this Frontier and El Capitan is planned to be delayed to 2023/2025 respectively.

7nm for 2019, 2020, 2021, 2022(APU). It is obviously more expensive to have 5nm APUs before 5nm CPU chiplets. So, Durango/Rembrandt aren't 5nm.

AMD's venture in 5nm full-fledged EUV is a failure just like everyone else. Failed, unyielding, test chips have doomed AMD to a later fixed 5nm node for 2022 CPUs and 2023 APUs.

I just have recieved an awful notice AMD CEO Lisa Su has had an internal company affair before becoming CEO and will be booted out of AMD immediately. /s

I think this was the first time ever that you didn't forget to put '/s' at the end of your post

Kenmitch · Aug 23, 2020

lobz said:
I think this was the first time ever that you didn't forget to put '/s' at the end of your post

He should delete the post before it's spreads all over the internet as the latest rumors....lol

uzzi38 · Aug 23, 2020

mikk said:
You should wait see, I didn't make a bold statement like Van Gogh will demolish Tigerlake without real infos. The big Vega variants couldn't utilize all of its shaders well enough, it remains to be seen how big the difference will be on a very low shader count GPU. RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.

5500XT has plenty of other problems on it's plate regarding non core nor memory related bottlenecks amongst other stuff. AMD made a lot of cuts to minimise costs on that die.

Process node has no effect on clearly uArch related bottlenecks like what we're discussing.

Anyway, you get to witness firsthand when AMD starts diverting resources to Radeon properly when RDNA2 launches soon. All the comments that AMD are going to quickly become third place in the GPU market in terms of performance are hilarious, to say the least.

uzzi38 · Aug 23, 2020

lobz said:
I think this was the first time ever that you didn't forget to put '/s' at the end of your post

I did not see that post until you quoted it just now.

Holy hell I'm dying :')

mikk · Aug 23, 2020

uzzi38 said:
Process node has no effect on clearly uArch related bottlenecks like what we're discussing.

It has a big effect on efficiency and reachable clock speeds, both is crucial.

uzzi38 · Aug 23, 2020

mikk said:
It has a big effect on efficiency and reachable clock speeds, both is crucial.

We're discussing memory and uArch bottlenecks which could cause issues in APUs. Neither are dependant on process node in the slightest.

Tuna-Fish · Aug 23, 2020

moinmoin said:
The change from 6 (4 cores) to 32 (8 cores) links to ensure every slice is connected to every slice seems to me a rather huge change to be sure that this is no problem and without any additional power cost.

There are not 6 links in the system right now, there are 16. Every core has a direct link to every L3 slice, for a total of 16. We've been over this before.

On x86, the only way for two cores to communicate with each other is through memory access, and on the Zen architecture this happens at the L3. Cores do not link to other cores, they link to L3. The link that core 1 uses to communicate with L3 slice 4 cannot be used by the core 4 to connect to L3 slice 1. It is easy to show this by measuring that there is no contention whatsoever if you do just this. The only way you can measure any contention is if multiple cores are simultaneously making requests to the same L3 slice, ergo cores are fully connected to L3 with dedicated links. Do not draw too many conclusions about sloppily drawn arrows on top of marketing material.

(also, assuming 8 L3 slices (and it is not at all necessary for there to be as many of them as there are cores!), the total amount of links is 64.)

moinmoin said:
If that's really no problem without any compromises, to what amount of cores is that scalable?

The compromise is that every core/L3 slice you add adds physical distance, which in turn adds latency to the worst-case L3 access. Because all cache lines are equally striped over every L3 slice, this directly hurts average case latency. This is why everyone expects the L3 latency of Zen3 to be slightly worse than it is for Zen2. Also power goes up directly proportional to the added length of the links. Other than that, having more links costs (very) few extra transistors at both ends of the link, at the L3 slice and at the core, because arbitrating access times between the different cores would probably be a very bad idea, so the L3 will likely have a dedicated "reservation station" for every core that can fit at least as many requests as a core can pass on before getting a message back that the L3 slice is overworked. But the links themselves are free, because they occupy a part of the die that is not really used for anything else, the upper metal layers above the caches.

mikk · Aug 23, 2020

uzzi38 said:
We're discussing memory and uArch bottlenecks which could cause issues in APUs. Neither are dependant on process node in the slightest.

This isn't just architecture related:

uzzi38 said:
A 5600XT performs on par with a Vega64 despite just over half the CUs and just over half the memory bandwidth.

This is process related as well because a better performing process node is a big factor, it allows for higher clock speeds. But anyway comparing Vega with a low shader GPU could be a bad idea.

leoneazzurro · Aug 23, 2020

5600XT clocks are not SO MUCH higher than Vega64 (Vega64: 1247/1546 max boost - 5600XT:1130/1560). True that better PP allows to keep power on check-> boost can last longer, but Vega64 has also a LOT more power and cooling at disposal. Btw it's also true that the revision of Vega in Renoir is quite different (not only in production process but also in refinement of the power gating and management.

maddie · Aug 23, 2020

Tuna-Fish said:
There are not 6 links in the system right now, there are 16. Every core has a direct link to every L3 slice, for a total of 16. We've been over this before.

On x86, the only way for two cores to communicate with each other is through memory access, and on the Zen architecture this happens at the L3. Cores do not link to other cores, they link to L3. The link that core 1 uses to communicate with L3 slice 4 cannot be used by the core 4 to connect to L3 slice 1. It is easy to show this by measuring that there is no contention whatsoever if you do just this. The only way you can measure any contention is if multiple cores are simultaneously making requests to the same L3 slice, ergo cores are fully connected to L3 with dedicated links. Do not draw too many conclusions about sloppily drawn arrows on top of marketing material.

(also, assuming 8 L3 slices (and it is not at all necessary for there to be as many of them as there are cores!), the total amount of links is 64.)

The compromise is that every core/L3 slice you add adds physical distance, which in turn adds latency to the worst-case L3 access. Because all cache lines are equally striped over every L3 slice, this directly hurts average case latency. This is why everyone expects the L3 latency of Zen3 to be slightly worse than it is for Zen2. Also power goes up directly proportional to the added length of the links. Other than that, having more links costs (very) few extra transistors at both ends of the link, at the L3 slice and at the core, because arbitrating access times between the different cores would probably be a very bad idea, so the L3 will likely have a dedicated "reservation station" for every core that can fit at least as many requests as a core can pass on before getting a message back that the L3 slice is overworked. But the links themselves are free, because they occupy a part of the die that is not really used for anything else, the upper metal layers above the caches.

How are you getting this number of links?

If you're using a 4 core CCX as we have at present, then the number is 6 links to fully connect each L3 to the others. 4 cores with 3 links each divided by 2 as they are bidirectional. (4!/(4-2)!)/2 =6

For an 8 core CCX with 8 L3 slices you will get (8!/(8-2)!)/2 =28

Tuna-Fish · Aug 23, 2020

maddie said:
How are you getting this number of links?

If you're using a 4 core CCX as we have at present, then the number is 6 links to fully connect each L3 to the others. 4 cores with 3 links each divided by 2 as they are bidirectional. (4!/(4-2)!)/2 =6

Each core needs a link to each L3 slice. The controller on the closest L3 slice is still quite far from the core. The L3 slice closest to a core is not in any special way associated with it, and requests going to other slices do not travel through it. The links are only bidirectional in the sense that a core can both read and write through the link, Core 1 communicating with L3 slice 4 does not use the same link as core 4 communicating to L3 slice 1. The links do not even terminate near each other, there is no reason why you'd need to take that huge detour.

So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

Or, with my terrible paint skills:

Note that the orange link is still a massive distance on die that requires significant infrastructure to cross, you can't just magic it away just because the two blocks are close to each other in the block diagram.

maddie · Aug 23, 2020

Tuna-Fish said:
Each core needs a link to each L3 slice. The controller on the closest L3 slice is still quite far from the core. The L3 slice closest to a core is not in any special way associated with it, and requests going to other slices do not travel through it. The links are only bidirectional in the sense that a core can both read and write through the link, Core 1 communicating with L3 slice 4 does not use the same link as core 4 communicating to L3 slice 1. The links do not even terminate near each other, there is no reason why you'd need to take that huge detour.

So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

Or, with my terrible paint skills:
View attachment 28498
Note that the orange link is still a massive distance on die that requires significant infrastructure to cross, you can't just magic it away just because the two blocks are close to each other in the block diagram.

Asking. Are you sure that is the connection?

The reason I ask is that if each (green) is connected to it's own (red) and then the (red) are fully interlinked, you would only need 10 (6+4) links to do what you're saying. A big saving on 16.

Tuna-Fish · Aug 23, 2020

maddie said:
Asking. Are you sure that is the connection?

The reason I ask is that if each (green) is connected to it's own (red) and then the (red) are fully interlinked, you would only need 10 (6+4) links to do what you're saying.

Yes, very sure, because otherwise one of the L3 slices would be much faster than the other 3, due to only having to do one hop. Their speeds are too similar for this.

maddie said:
A big saving on 16.

It's a big saving on something that does not need to be saved. The links themselves are free (as they are implemented in the upper metal layers that the cache does not need above it), only the endpoints have any cost, and by having dedicated links you reduce the complexity of the endpoints, as you remove any need to share the links themselves.

Other than that, by having dedicated links you get a nice latency win both because the signal doesn't need to come down from the upper layers until it's at the endpoint, and because there is no processing at the midpoint.

Shivansps · Aug 23, 2020

itsmydamnation said:
surprised no one linked this:

AMD Ryzen 2021-2022 roadmap partially leaks - VideoCardz.com

Back in May, @MebiuW revealed a new codename for AMD Ryzen processor series – Warhol. Today the same leaker revealed a new part of the same roadmap. AMD Ryzen in 2021: Warhol, Van Gogh and Cezanne? As we learned from MebiuW directly, the full roadmap has already partially been confirmed, but...

videocardz.com

so im guessing Warhol is a DDR5 part on new socket with same chiplet but different IO die.
Ra.* looks to be the first Zen4 parts with Rembrandt , i wonder if Ra* will be I/O die, GPI die (navi3) , chiplet die while Rembrandt being a monolithic die targeting Mobile.

would have preferred Zen4 to be a 2021 product, hopefully its early 22.
i also wonder if whatever comes after VanGo is 8 core Zen3 and if Rembrandt is > 8 , both of those should be 5nm which could allow for that.

Notice how there is a Navi IGP on Raphael, it looks like CPU-only is going to disappear.

Also Cezanne is Vega 7? WTH AMD.

soresu · Aug 23, 2020

Shivansps said:
Notice how there is a Navi IGP on Raphael

I thought Raphael was the Vermeer successor?

Edit: Nevermind read the rest of your post.

It doesn't necessarily mean that CPU only will die out for Raphael to be an APU - it may just mean that they are segmenting APU's more finally with a high and low end.

Threadripper will then become the enthusiast CPU only platform.

eek2121 · Aug 23, 2020

Antey said:
And what about tiger lake? did you see a test? we don't even know if we will se tiger lake <10W this year

Considering Intel has stated that Tiger Lake is a 10-65nm design, you will likely never see parts that are <10W. Both Intel and AMD rarely release 6W parts. AMD’s 6W parts are still 14nm as an example.

lobz said:
I think this was the first time ever that you didn't forget to put '/s' at the end of your post

Most of us have ignored him at this point. He does not contribute to the discussion.

jpiniero · Aug 23, 2020

soresu said:
I thought Raphael was the Vermeer successor?

At some point you figure AMD will start doing IGP chiplets.

Antey · Aug 23, 2020

eek2121 said:
Considering Intel has stated that Tiger Lake is a 10-65nm design, you will likely never see parts that are <10W. Both Intel and AMD rarely release 6W parts. AMD’s 6W parts are still 14nm as an example.

if their plan is to replace kaby lake Y series with a lakefield successor in late 2021 or early 2022 i think van gogh has nothing to fear from intel.

uzzi38 · Aug 23, 2020

Shivansps said:
Notice how there is a Navi IGP on Raphael, it looks like CPU-only is going to disappear.

Only thing I'll say for a long time in regards to AM5 is a lot of possibilities are opened with the new socket.

Shivansps said:
Also Cezanne is Vega 7? WTH AMD.

It's literally Renoir but Zen 3 cores.

Speculation: Ryzen 4000 series/Zen 3

Diamond Member

Platinum Member

Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Attachments

Golden Member

Diamond Member

Platinum Member

Diamond Member

Lifer

Member

Platinum Member