AMD Raven Ridge 'Zen APU' Thread

Page 46 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

cbn

Lifer
Mar 27, 2009
12,968
221
106
Reactions: Schmide

cbn

Lifer
Mar 27, 2009
12,968
221
106
@NTMBK,

Will be interesting to see what happens to the FP64 in the upcoming AMD Vega FirePro Workstation/GPU Server cards (including ones designed with Vega 11)? Could it be Vega was designed with a high amount of F64 (ie, 1:2 DP to SP ratio)......and that AMD is merely disabling a large amount of that FP64 in the RX Vega Frontier Edition and RX Vega 64/56? (See above post as an example)

If so, I wonder if we see (for that reason) a repeat 1:2 DP to SP with Raven Ridge? (ie, AMD does not disable F64 like it does throughout the dGPU line-up)


EDIT: FP64 performance is unchanged when comparing FirePro WX9100 to RX Vega 64----> https://pro.radeon.com/en/product/wx-series/radeon-pro-wx-9100/ , https://en.wikipedia.org/wiki/Radeon_Pro#Chipset_Table

EDIT2: Perhaps the only Vega architecture with 1/2 DP to SP will be Vega 20--> https://videocardz.com/65521/amd-vega-10-and-vega-20-slides-revealed
 
Last edited:

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
@NTMBK,

Will be interesting to see what happens to the FP64 in the upcoming AMD Vega FirePro Workstation/GPU Server cards? Could it be Vega was designed with a high amount of F64 (ie, 1:2 DP to SP ratio)......and that AMD is merely disabling a good amount of it in the Frontier Edition and RX Vega 64/56?

If so, I wonder if we see a repeat 1:2 DP to SP with Raven Ridge? (ie, AMD does not disable F64 like it does throughout the dGPU line-up)
Theoretical GFLOP/s performance is always misleading. Actual sustainable throughput is almost entirely a function of available memory bandwidth.
 
Reactions: Drazick

cbn

Lifer
Mar 27, 2009
12,968
221
106
Theoretical GFLOP/s performance is always misleading. Actual sustainable throughput is almost entirely a function of available memory bandwidth.

For a card with 1:2 DP to SP ratio, do you happen to know much bandwidth is needed for running FP64 compared to running FP32?
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
For a card with 1:2 DP to SP ratio, do you happen to know much bandwidth is needed for running FP64 compared to running FP32?
Generally speaking, for a calculation like x=a*b+c, assuming the RHS is in the shared memory, the actual throughput you'd get is mem bandwidth/(4 bytes for SP)*2(due to it being an FMA op).

As you can see, the marketed numbers using the frequency*GPU core count*2 formula is hilariously inflated.

EDIT: these variables are matrices of size N=10^3 to 10^4, fairly typical of what you have in usual GPGPU/HPC applications.
 
Last edited:
Reactions: Drazick and cbn

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
So you really think AMD is going to have 2 new architecture next year? Zen+ and Zen2?

Even getting one new one out next year will be a big deal. A process tweak (14nm+) in 2018 1H might boost clock speed a bit. Zen 2 with IPC and 7nm in 2H. That is actually an aggressive timeline IMO. Here is the slide:

Would actually make sense somewhat.

Zen ended up being very competitive against Intel, probably far more than they expected back when they started desigining Zen in 2012. By a year or two ago, AMD must have realized what kind of gold they're going to strike soon. So they got far more aggressive with the Zen timeline with the assumption that Zen will provide the necessary funding for such an aggressive timeline.

Hence why we're seeing a Zen+ squeezed between Zen and Zen 2, because that's the best they could realistically do going more aggressive on such a short notice. Having a tweaked design on a tweaked process doesn't take nearly as many man hours as a new architecture, so that decision could be made on a short notice. I wouldn't be surprised if Zen 3 is also coming out 8 months ish sooner than original roadmaps, as the longer time from when the decision was made to when it should be out allows them to do this more aggressive shuffling.
 
Last edited:

CatMerc

Golden Member
Jul 16, 2016
1,114
1,153
136
Interesting. Ryzen 5 2500U is listed as having 4MB L3 cache on Geekbench. Initially I assumed it was just cut from the full 8MB expected from a CCX, but Ryzen 7 2700U has the same reading.

https://browser.geekbench.com/v4/cpu/compare/3989386?baseline=3770816

Could be an error, or maybe AMD made a more area efficient version of the CCX. Doesn't seem right at first glance because the point of the CCX in the first place is to have a modular building block from which to work, but this could be a relic of Zeppelin's server first design. What I mean is that the consumer applications that would benefit from the extra 4MB of L3 cache are minimal, and a desktop/laptop first(or only) design like Raven wouldn't need it. Zeppelin needs to serve both server and desktops, so as a result it gets more L3 cache than it realistically needs for anything you'd find on desktops. With an L2 that size and having an exclusive cache, it doesn't actually make sense for Zen to have more L3 than Skylake.

Considering we haven't seen 14nm+ on the server roadmap, this might even mean that Pinnacle Ridge is a consumer first design, and therefore could be having these leaner CCX's. Either easier to yield, which would mean aside from a better process would also allow for more aggressive performance binning, - or those extra transistors could be reinvested elsewhere, like for example increasing Fmax while avoiding an IPC regression, or beefing up the memory controller, or really anything.


Yes I did just write all of that for what could potentially (and probably) be Geekbench misreading cache. Don't judge.
 
Last edited:

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
Any L3 is already a major improvement over BR. It would make sense that since RR is a mobile focused design, they would make design decisions that make sense to both keep costs down, minimize die size, and make room for a big GPU area. The other possibility is that a significant portion of the L3 area is dedicated to the GPU to help keep it fed. Either decision makes sense for the market segment that it is focused on.
 
Reactions: cbn

cbn

Lifer
Mar 27, 2009
12,968
221
106
Generally speaking, for a calculation like x=a*b+c, assuming the RHS is in the shared memory, the actual throughput you'd get is mem bandwidth/(4 bytes for SP)*2(due to it being an FMA op).

As you can see, the marketed numbers using the frequency*GPU core count*2 formula is hilariously inflated.

EDIT: these variables are matrices of size N=10^3 to 10^4, fairly typical of what you have in usual GPGPU/HPC applications.

This got me thinking if the original reason for the rumored HBM2 on Raven Ridge was for FP64 in HPC?

(Bristol Ridge had the 1/2 rate DP, but lacked the bandwidth)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Interesting. Ryzen 5 2500U is listed as having 4MB L3 cache on Geekbench. Initially I assumed it was just cut from the full 8MB expected from a CCX, but Ryzen 7 2700U has the same reading.

https://browser.geekbench.com/v4/cpu/compare/3989386?baseline=3770816

Could be an error, or maybe AMD made a more area efficient version of the CCX. Doesn't seem right at first glance because the point of the CCX in the first place is to have a modular building block from which to work, but this could be a relic of Zeppelin's server first design. What I mean is that the consumer applications that would benefit from the extra 4MB of L3 cache are minimal, and a desktop/laptop first(or only) design like Raven wouldn't need it. Zeppelin needs to serve both server and desktops, so as a result it gets more L3 cache than it realistically needs for anything you'd find on desktops. With an L2 that size and having an exclusive cache, it doesn't actually make sense for Zen to have more L3 than Skylake.

Considering we haven't seen 14nm+ on the server roadmap, this might even mean that Pinnacle Ridge is a consumer first design, and therefore could be having these leaner CCX's. Either easier to yield, which would mean aside from a better process would also allow for more aggressive performance binning, - or those extra transistors could be reinvested elsewhere, like for example increasing Fmax while avoiding an IPC regression, or beefing up the memory controller, or really anything.


Yes I did just write all of that for what could potentially (and probably) be Geekbench misreading cache. Don't judge.
Reducing the L3 size could be one of the easier things. Indeed, there could be many possible reasons for this. And a full CCX is more like a topological building block, not a physical one.

Shaving off some L3 in a single CCX design might come with a negligible perf loss if at all (no 2nd CCX to share the MCs with). This would save about 4sqmm.

A fully active 8MB L3 at core clock might also need too much power (for a worse energy delay product) in a mobile design.
 
Reactions: Drazick and CatMerc

BigDaveX

Senior member
Jun 12, 2014
440
216
116
Well, that is comparing 4C/4T no L3 to 4C/8T and 4mb of L3.

I think it's the fact that AMD can actually put four physical cores into a mobile APU that's important here. Bearing in mind that their last few APUs were really more akin to a 2C/4T setup, it's the first time since Llano that they've been able to cram four actual cores into a mobile chip.
 
Reactions: Olikan

R0H1T

Platinum Member
Jan 12, 2013
2,582
162
106
I think it's the fact that AMD can actually put four physical cores into a mobile APU that's important here. Bearing in mind that their last few APUs were really more akin to a 2C/4T setup, it's the first time since Llano that they've been able to cram four actual cores into a mobile chip.
Well they did make a few cat cores & puma(+) variants as well, real quad core at that.
 
Last edited:

SPBHM

Diamond Member
Sep 12, 2012
5,058
410
126
I think it's the fact that AMD can actually put four physical cores into a mobile APU that's important here. Bearing in mind that their last few APUs were really more akin to a 2C/4T setup, it's the first time since Llano that they've been able to cram four actual cores into a mobile chip.

what is important is that they now have an architecture that offers decent ST performance at decent power usage, and that they can now have l3 cache and IGP at the same time
check the ST 2500U = 3600; 9800B = 2300, that's not more cores or less module penalty, it's just a way faster core (+l3) making its magic.
 

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
So, mobile R5 = 4C/8T with 4MB L3?

Does this mean that mobile R3 is 4C/4T with 4MB L3 and mobile R7 = 4C/8T and maybe 8MB L3?
 

VirtualLarry

No Lifer
Aug 25, 2001
56,450
10,119
126
Or maybe full fat 8c/16t? That would be interesting....
I don't believe that I've ever even seen that configuration suggested for the Raven Ridge APUs. Rather, one CCX (four cores), rather than the two CCXs present in current Ryzen dies, and then the iGPU bolted on, using the IF.
 
Reactions: Drazick

teejee

Senior member
Jul 4, 2013
361
199
116
I don't believe that I've ever even seen that configuration suggested for the Raven Ridge APUs. Rather, one CCX (four cores), rather than the two CCXs present in current Ryzen dies, and then the iGPU bolted on, using the IF.
Well, two RR dies in one package should be possible as a desktop solution I guess. That would give twice the GPU power and 8c/16t CPU.

Sent from my LG-D855 using Tapatalk
 

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
So, then, there's no room for a mobile R7? If the L3 is capped at 4MB, and the cores are capped at 4, does AMD just not have a mobile R7 solution that isn't entirely based on clocks? As far as anyone has ever mentioned online, the CCX design is going to be left intact between the Zepplin and RR dies, and, if so, that means that the L3 cache on the die is physically still 8MB for that CCX. Now, if AMD chooses to only enable 4MB of that for segmentation purposes, that's one thing. The other is that they actually spent the money to change the CCX design itself, which seems like a waste of effort and resources to deliberately reduce performance where leaving it alone and just disabling half of it would have essentially no effect on power draw and minimal differences on total real estate consumed on the die. Also, in support of my claims here, I submit to you the slides that were presented for Great Horned Owl, which is essentially a SoC embedded version of Raven Ridge. The slides specifically mention that the die includes 8MB of L3.

Here's a link to another thread that details Horned Owl and Banded Kestrel, which are high and low end enterprise embedded products...
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/amd-raven-ridge-zen-apu-thread.2479296/page-33

Why on Earth would AMD shoot themselves in the foot to make a custom CCX just for the low end mobile arena? They've stated before that they have little desire to push into the bottom end of x86, and crippling their CCX just for that just doesn't make much sense. Heck, I'm fairly certain that Banded Kestrel is just made up of recovered defective Horned Owl/Raven Ridge dies. While AMd is plenty innovative, as they've proven with the Epyc/Threadripper/Ryzen stack, they are perfectly willing to base a LOT of different products off of just one die type. Maybe when they move down a node size and have more working cash flow, they will choose to invest in more die types, but, I just don't see it in the cards for AMD yet.
 
Reactions: prtskg

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,655
136
Raven's CCX design always has 4MB L3.

Zeppelin = 8MB per CCX.
Raven = 4MB per CCX.

It's a huge die even without the additional 4MB of L3.

It's not that huge of a die. It would be 90ish mm maybe even smaller for a single CCX die with 8mb L3. The savings would even be that great by cutting it down to 1MB per core. If you look at the die shots the L3 is a minor portion of the CCX.

My guess is the current projections of 4MB L3 is to protect against die issues since they would probably want make sure that most of their APU's matched in L3 cache and 8MB probably would lead them to having to bin a lot more then they want in another tier. Still this would be the first APU's with L3 so whether it's 8MB or 4MB. It's still kind of bonus cache.
 
Reactions: ao_ika_red

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
It's not that huge of a die. It would be 90ish mm maybe even smaller for a single CCX die with 8mb L3. The savings would even be that great by cutting it down to 1MB per core. If you look at the die shots the L3 is a minor portion of the CCX.

My guess is the current projections of 4MB L3 is to protect against die issues since they would probably want make sure that most of their APU's matched in L3 cache and 8MB probably would lead them to having to bin a lot more then they want in another tier. Still this would be the first APU's with L3 so whether it's 8MB or 4MB. It's still kind of bonus cache.

Do you know the size of it, since you say it isn't huge?
 
Reactions: Drazick
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |