Speculation: Ryzen 4000 series/Zen 3

Page 159 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mikk

Diamond Member
May 15, 2012
4,173
2,211
136
Minimum 8 CUs, which is 512 shaders.

So do you have a source?

Don't worry about tests for the time being, just note RDNA's improvement to graphics and bandwidth efficiency over Vega. A 5600XT performs on par with a Vega64 despite just over half the CUs and just over half the memory bandwidth.

You should have a test because you did claim it will demolish Tigerlake. I have to wonder because we still don't even know exactly how Tigerlake performs in real world and at a lower power. A statement like this is bold, so I'm asking for your source.
 

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146
So do you have a source?



You should have a test because you did claim it will demolish Tigerlake. I have to wonder because we still don't even know exactly how Tigerlake performs in real world and at a lower power. A statement like this is bold, so I'm asking for your source.
Yeah, _rogame's findings in driver code that tells us we're lookibg at a multiple of 4WGPs. 4WGPs is 8 CUs, which is 512 shaders.

As for the second half, wait and see. But Van Gogh and Mero are both not normal parts anyway, so w/e.
 
Reactions: raghu78

Antey

Member
Jul 4, 2019
105
153
116
Interesting observation, did you see a test? Could you share the link? How many units it will have?

And what about tiger lake? did you see a test? we don't even know if we will se tiger lake <10W this year
 

RTX2080

Senior member
Jul 2, 2018
322
511
136
I have two points to make to that.

1. AMD's roadmaps in the last 2 years have changed incredibly drastically. I doubt any claims that this is very old, and even if it were it's most certainly not accurate. You've seen one of the changes to the roadmap in this thread I'm sure - Zen 3 was originally going to have SMT4, and AMD have talked about other changes such as Zen 2 getting the TAGE branch predictor of Zen 3.

2. The people that I've talked to that spend time on Weibo have said not to trust him as he's highly pro-Intel and has gone so far as to make up stuff regarding AMD to make Intel look better. Treat his roadmaps as something that need to be proved accurate first before we take then as anything remotely close to fact.

I don't know what amd will going to do with Zen3 or later gen also I don't know that guy's personality, but that roadmap looks a bit odd because I never see amd made such style of roadmap, Xnm + codename + mem gen + arch in a single blank lol
 

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146
I don't know what amd will going to do with Zen3 or later gen also I don't know that guy's personality, but that roadmap looks a bit odd because I never see amd made such style of roadmap, Xnm + codename + mem gen + arch in a single blank lol
No comment on AMD's normal roadmap style. However, I will point out there is never a reason to have so many different types of chips on the same roadmap.

AMD is barely telling OEMs about the silicon they need to know of, forget the silicon they may never see (such as Van Gogh)
 

mikk

Diamond Member
May 15, 2012
4,173
2,211
136
Yeah, _rogame's findings in driver code that tells us we're lookibg at a multiple of 4WGPs. 4WGPs is 8 CUs, which is 512 shaders.

As for the second half, wait and see. But Van Gogh and Mero are both not normal parts anyway, so w/e.

You should wait see, I didn't make a bold statement like Van Gogh will demolish Tigerlake without real infos. The big Vega variants couldn't utilize all of its shaders well enough, it remains to be seen how big the difference will be on a very low shader count GPU. RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.
 

lobz

Platinum Member
Feb 10, 2017
2,057
2,856
136
We have just received bad news.

AMD aren't going to be able to make timelines. They won't be able to use chiplets to jump node to node. It is simply too expensive, AMD is broke and they aren't going to keep up leading edge anymore. With this Frontier and El Capitan is planned to be delayed to 2023/2025 respectively.

7nm for 2019, 2020, 2021, 2022(APU). It is obviously more expensive to have 5nm APUs before 5nm CPU chiplets. So, Durango/Rembrandt aren't 5nm.

AMD's venture in 5nm full-fledged EUV is a failure just like everyone else. Failed, unyielding, test chips have doomed AMD to a later fixed 5nm node for 2022 CPUs and 2023 APUs.



I just have recieved an awful notice AMD CEO Lisa Su has had an internal company affair before becoming CEO and will be booted out of AMD immediately. /s
I think this was the first time ever that you didn't forget to put '/s' at the end of your post
 

uzzi38

Platinum Member
Oct 16, 2019
2,703
6,405
146
You should wait see, I didn't make a bold statement like Van Gogh will demolish Tigerlake without real infos. The big Vega variants couldn't utilize all of its shaders well enough, it remains to be seen how big the difference will be on a very low shader count GPU. RX 570 and RX 5500 XT have a similar FP32 performance and the performance differs by 20%. And Vega/Polaris were manufactured on GlobalFoundries 14 nm process, TSMC 7nm is vastly better than this.

5500XT has plenty of other problems on it's plate regarding non core nor memory related bottlenecks amongst other stuff. AMD made a lot of cuts to minimise costs on that die.

Process node has no effect on clearly uArch related bottlenecks like what we're discussing.

Anyway, you get to witness firsthand when AMD starts diverting resources to Radeon properly when RDNA2 launches soon. All the comments that AMD are going to quickly become third place in the GPU market in terms of performance are hilarious, to say the least.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
The change from 6 (4 cores) to 32 (8 cores) links to ensure every slice is connected to every slice seems to me a rather huge change to be sure that this is no problem and without any additional power cost.

There are not 6 links in the system right now, there are 16. Every core has a direct link to every L3 slice, for a total of 16. We've been over this before.

On x86, the only way for two cores to communicate with each other is through memory access, and on the Zen architecture this happens at the L3. Cores do not link to other cores, they link to L3. The link that core 1 uses to communicate with L3 slice 4 cannot be used by the core 4 to connect to L3 slice 1. It is easy to show this by measuring that there is no contention whatsoever if you do just this. The only way you can measure any contention is if multiple cores are simultaneously making requests to the same L3 slice, ergo cores are fully connected to L3 with dedicated links. Do not draw too many conclusions about sloppily drawn arrows on top of marketing material.

(also, assuming 8 L3 slices (and it is not at all necessary for there to be as many of them as there are cores!), the total amount of links is 64.)

If that's really no problem without any compromises, to what amount of cores is that scalable?
The compromise is that every core/L3 slice you add adds physical distance, which in turn adds latency to the worst-case L3 access. Because all cache lines are equally striped over every L3 slice, this directly hurts average case latency. This is why everyone expects the L3 latency of Zen3 to be slightly worse than it is for Zen2. Also power goes up directly proportional to the added length of the links. Other than that, having more links costs (very) few extra transistors at both ends of the link, at the L3 slice and at the core, because arbitrating access times between the different cores would probably be a very bad idea, so the L3 will likely have a dedicated "reservation station" for every core that can fit at least as many requests as a core can pass on before getting a message back that the L3 slice is overworked. But the links themselves are free, because they occupy a part of the die that is not really used for anything else, the upper metal layers above the caches.
 

mikk

Diamond Member
May 15, 2012
4,173
2,211
136
We're discussing memory and uArch bottlenecks which could cause issues in APUs. Neither are dependant on process node in the slightest.

This isn't just architecture related:

A 5600XT performs on par with a Vega64 despite just over half the CUs and just over half the memory bandwidth.

This is process related as well because a better performing process node is a big factor, it allows for higher clock speeds. But anyway comparing Vega with a low shader GPU could be a bad idea.
 

leoneazzurro

Golden Member
Jul 26, 2016
1,010
1,608
136
5600XT clocks are not SO MUCH higher than Vega64 (Vega64: 1247/1546 max boost - 5600XT:1130/1560). True that better PP allows to keep power on check-> boost can last longer, but Vega64 has also a LOT more power and cooling at disposal. Btw it's also true that the revision of Vega in Renoir is quite different (not only in production process but also in refinement of the power gating and management.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
There are not 6 links in the system right now, there are 16. Every core has a direct link to every L3 slice, for a total of 16. We've been over this before.

On x86, the only way for two cores to communicate with each other is through memory access, and on the Zen architecture this happens at the L3. Cores do not link to other cores, they link to L3. The link that core 1 uses to communicate with L3 slice 4 cannot be used by the core 4 to connect to L3 slice 1. It is easy to show this by measuring that there is no contention whatsoever if you do just this. The only way you can measure any contention is if multiple cores are simultaneously making requests to the same L3 slice, ergo cores are fully connected to L3 with dedicated links. Do not draw too many conclusions about sloppily drawn arrows on top of marketing material.

(also, assuming 8 L3 slices (and it is not at all necessary for there to be as many of them as there are cores!), the total amount of links is 64.)


The compromise is that every core/L3 slice you add adds physical distance, which in turn adds latency to the worst-case L3 access. Because all cache lines are equally striped over every L3 slice, this directly hurts average case latency. This is why everyone expects the L3 latency of Zen3 to be slightly worse than it is for Zen2. Also power goes up directly proportional to the added length of the links. Other than that, having more links costs (very) few extra transistors at both ends of the link, at the L3 slice and at the core, because arbitrating access times between the different cores would probably be a very bad idea, so the L3 will likely have a dedicated "reservation station" for every core that can fit at least as many requests as a core can pass on before getting a message back that the L3 slice is overworked. But the links themselves are free, because they occupy a part of the die that is not really used for anything else, the upper metal layers above the caches.
How are you getting this number of links?

If you're using a 4 core CCX as we have at present, then the number is 6 links to fully connect each L3 to the others. 4 cores with 3 links each divided by 2 as they are bidirectional. (4!/(4-2)!)/2 =6

For an 8 core CCX with 8 L3 slices you will get (8!/(8-2)!)/2 =28
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
How are you getting this number of links?

If you're using a 4 core CCX as we have at present, then the number is 6 links to fully connect each L3 to the others. 4 cores with 3 links each divided by 2 as they are bidirectional. (4!/(4-2)!)/2 =6

Each core needs a link to each L3 slice. The controller on the closest L3 slice is still quite far from the core. The L3 slice closest to a core is not in any special way associated with it, and requests going to other slices do not travel through it. The links are only bidirectional in the sense that a core can both read and write through the link, Core 1 communicating with L3 slice 4 does not use the same link as core 4 communicating to L3 slice 1. The links do not even terminate near each other, there is no reason why you'd need to take that huge detour.

So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

Or, with my terrible paint skills:

Note that the orange link is still a massive distance on die that requires significant infrastructure to cross, you can't just magic it away just because the two blocks are close to each other in the block diagram.
 

maddie

Diamond Member
Jul 18, 2010
4,787
4,771
136
Each core needs a link to each L3 slice. The controller on the closest L3 slice is still quite far from the core. The L3 slice closest to a core is not in any special way associated with it, and requests going to other slices do not travel through it. The links are only bidirectional in the sense that a core can both read and write through the link, Core 1 communicating with L3 slice 4 does not use the same link as core 4 communicating to L3 slice 1. The links do not even terminate near each other, there is no reason why you'd need to take that huge detour.

So every core has a link to every L3 slice, or 4*4 = 16. With a 8-core CCX with 8 L3 slices (and I keep pointing this out, there is no fundamental reason why L3 slice count must be equal to core count!), there will be 8*8 = 64 links.

Or, with my terrible paint skills:
View attachment 28498
Note that the orange link is still a massive distance on die that requires significant infrastructure to cross, you can't just magic it away just because the two blocks are close to each other in the block diagram.
Asking. Are you sure that is the connection?

The reason I ask is that if each (green) is connected to it's own (red) and then the (red) are fully interlinked, you would only need 10 (6+4) links to do what you're saying. A big saving on 16.

 

Attachments

  • AMD_Zen_2_CCD_links-mod.png
    697 KB · Views: 10
Reactions: Vattila

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Asking. Are you sure that is the connection?

The reason I ask is that if each (green) is connected to it's own (red) and then the (red) are fully interlinked, you would only need 10 (6+4) links to do what you're saying.

Yes, very sure, because otherwise one of the L3 slices would be much faster than the other 3, due to only having to do one hop. Their speeds are too similar for this.

A big saving on 16.

It's a big saving on something that does not need to be saved. The links themselves are free (as they are implemented in the upper metal layers that the cache does not need above it), only the endpoints have any cost, and by having dedicated links you reduce the complexity of the endpoints, as you remove any need to share the links themselves.

Other than that, by having dedicated links you get a nice latency win both because the signal doesn't need to come down from the upper layers until it's at the endpoint, and because there is no processing at the midpoint.
 
Reactions: Ajay

Shivansps

Diamond Member
Sep 11, 2013
3,873
1,527
136
surprised no one linked this:


so im guessing Warhol is a DDR5 part on new socket with same chiplet but different IO die.
Ra.* looks to be the first Zen4 parts with Rembrandt , i wonder if Ra* will be I/O die, GPI die (navi3) , chiplet die while Rembrandt being a monolithic die targeting Mobile.

would have preferred Zen4 to be a 2021 product, hopefully its early 22.
i also wonder if whatever comes after VanGo is 8 core Zen3 and if Rembrandt is > 8 , both of those should be 5nm which could allow for that.



Notice how there is a Navi IGP on Raphael, it looks like CPU-only is going to disappear.

Also Cezanne is Vega 7? WTH AMD.
 

soresu

Platinum Member
Dec 19, 2014
2,966
2,188
136
Notice how there is a Navi IGP on Raphael
I thought Raphael was the Vermeer successor?

Edit: Nevermind read the rest of your post.

It doesn't necessarily mean that CPU only will die out for Raphael to be an APU - it may just mean that they are segmenting APU's more finally with a high and low end.

Threadripper will then become the enthusiast CPU only platform.
 

eek2121

Diamond Member
Aug 2, 2005
3,051
4,276
136
And what about tiger lake? did you see a test? we don't even know if we will se tiger lake <10W this year

Considering Intel has stated that Tiger Lake is a 10-65nm design, you will likely never see parts that are <10W. Both Intel and AMD rarely release 6W parts. AMD’s 6W parts are still 14nm as an example.

I think this was the first time ever that you didn't forget to put '/s' at the end of your post

Most of us have ignored him at this point. He does not contribute to the discussion.
 

Antey

Member
Jul 4, 2019
105
153
116
Considering Intel has stated that Tiger Lake is a 10-65nm design, you will likely never see parts that are <10W. Both Intel and AMD rarely release 6W parts. AMD’s 6W parts are still 14nm as an example.

if their plan is to replace kaby lake Y series with a lakefield successor in late 2021 or early 2022 i think van gogh has nothing to fear from intel.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |