New Zen microarchitecture details

Page 98 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
So it will be "unfair" to use modern video codecs, renderers or mathematical workloads / benchmarks which implement AVX2 and newer, just because Zen doesn't have competitive AVX2 performance? It's time to stop having double standards and stop treating AMD like a disabled child.
I didn't post that argument.

What I spoke to was taking a niche context and extrapolating it into general performance. Shenanigans.

If, let's say, ABD (a hypothetical instruction superset) results in a 25% performance boost in 4% of the typical workload an enthusiast gamer deals with and ABD performance is used to construct the backbone of a benchmark (let's say 60% of its total performance) — don't you think it's rather incorrect to make that the community de facto standard for general CPU performance comparisons?

Or, we could ask ourselves if measuring the FPU performance of a quad FPU 8 integer design is equivalent to measuring the performance of an 8 integer quad FPU design — for the purpose of comparing CPUs' overall performance.

Also, still wondering about this: If AMD were to use the maximum possible chip size for its 14nm process and four Zen CPU cores, what is the maximum iGPU it could squeeze in?
Problem is, HD7770 is already marginal for 1080p gaming except for older or less demanding games. And by the time Zen apus come out, there will be an entire new generation of 14/16 nm dgpus in the hundred dollar range that will offer much better performance than the ancient HD7770. And when making bandwidth comparisons with a dgpu one must also consider that the already limited bandwidth (and thermal budget) must be shared with the cpu.
That's why I asked The Stilt that question. It would be interesting to know what the maximum is that AMD could achieve with its tech (assuming it wanted to go the route of a big chip) — in order to put things like a 7770 into context.
 
Last edited:

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Of course, AMD or its supporters would *never* do the same and pick best case benchmarks for AMD.
Even if they did it would be a tu quoque issue at best and ignore the fact that Intel is, by far, the dominant force in x86 CPUs.

AMD's weakened position means it is more vulnerable to benchmark shenanigans.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I didn't post that argument.

If, let's say, ABD (a hypothetical instruction superset) results in a 25% performance boost in 4% of the typical workload an enthusiast gamer deals with and ABD performance is used to construct the backbone of a benchmark (let's say 60% of its total performance) — don't you think it's rather incorrect to make that the community de facto standard for general CPU performance comparisons?

Or, we could ask ourselves if measuring the FPU performance of a quad FPU 8 integer design is equivalent to measuring the performance of an 8 integer quad FPU design — for the purpose of comparing CPUs' overall performance.

Also, still wondering about this: If AMD were to use the maximum possible chip size for its 14nm process and four Zen CPU cores, what is the maximum iGPU it could squeeze in?

That's why I asked The Stilt that question. It would be interesting to know what the maximum is that AMD could achieve with its tech (assuming it wanted to go the route of a big chip) — in order to put things like a 7770 into context.

No you didn't.
As a "de facto" benchmark absolutely not, but neither it should be excluded. Especially when it is a potential game changer for certain workloads.

Technically AMD could probably include any GPU they wanted to into Raven, since the GPU blocks on APUs are pretty much copy and paste. Practically they most likely want and need to stay below ~300mm² size wise, since most likely the yields even on a 232mm² P10 are nothing to brag about.

If we exclude the usual marketing BS targeting purely paper figures, and AMD plays it smart this time:

8GCU @ 920MHz - 1000MHz (2933MHz - 3200MHz DRAM)
11GCU @ 666MHz - 727MHz (2933MHz - 3200MHz DRAM)

Both of these options would be > 30 - 45% faster than the A10-7890K (2400MHz DRAM) iGPU performance wise. That's due the bandwidth limitation which is absolute monstrous on Godavari but nearly non-existent on the Raven scenario. That includes the (potentially) improved DCC over GCN3 too.
 

KTE

Senior member
May 26, 2016
478
130
76
L1d has as least 4 cycles (see GCC patch -> +4 cycles delta for each load+ex op). Going to the FPU adds another 3 cycles.
Which isn't a good thing anyway. Too slow if high IPC is the focus. Typical of a speed demon design...

Same with the L3. Needs do be 35-40 cycles max.

3 -> 14 -> 50 is the worst I'd bank on for this chip to be high performance on DT.

Sent from HTC 10
(Opinions are own)
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Zen is by no means a speed demon design. On the contrary actually. It is closer to Hounds than Bulldozer in terms of latencies.
 

KTE

Senior member
May 26, 2016
478
130
76
Zen is by no means a speed demon design. On the contrary actually. It is closer to Hounds than Bulldozer in terms of latencies.
Do you mean that based on the expected clocks, instruction latencies or in terms of the cache latencies?

Hounds was 3 cycle L1, increased to 4 cycle for BDs speed targets. Any fetch from L3/RAM was so slow, ~20 cycles higher than SB IIRC.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
In terms of clocks similar to Hounds derivate and in terms of cache latencies similar to Hounds. L1 might be an exception thou.
I'd say L1 latency on BD has nothing to do with the design speed targets. L1 never was the limit for Fmax on any BD iteration, while L2 was the sole limitation on all of them.
AMD has never mastered L2 caches (or large caches for that matter), no matter if they belong in a CPU or a GCN GPU :sneaky: On GCN at least you can address it to meet the frequency targets.

Let's wait for HotChips :sneaky:
 
Last edited:

yuri69

Senior member
Jul 16, 2013
536
962
136
Which isn't a good thing anyway. Too slow if high IPC is the focus. Typical of a speed demon design...

Same with the L3. Needs do be 35-40 cycles max.

3 -> 14 -> 50 is the worst I'd bank on for this chip to be high performance on DT.
Intel went from Netburst's 2 cycle L1D to 3 cycle for Conroe and 3 cycle for Conroe to 4 cycle for Nehalem...

Load to use latency is only a part of the whole thing.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
Problem is, HD7770 is already marginal for 1080p gaming except for older or less demanding games. And by the time Zen apus come out, there will be an entire new generation of 14/16 nm dgpus in the hundred dollar range that will offer much better performance than the ancient HD7770. And when making bandwidth comparisons with a dgpu one must also consider that the already limited bandwidth (and thermal budget) must be shared with the cpu.

It's true that Cape Verde won't cut the mustard for 1080p in most newer AAA titles (except perhaps at low details). However, you underestimate the popularity of older and less demanding games. The two most popular titles on Steam are DOTA 2 and CS:GO.

Back in 2013, Tom's Hardware tested DOTA 2 and found that HD 7770 can do a minimum of 77.0 FPS (average 85.5). iGPUs weren't in that benchmark but based on comparison with discrete cards it's clear that all except maybe the expensive Iris Pro would fall below the 60 FPS mark.

CS:GO averages 128 FPS on a HD 7770 at 1080p with high details. Most iGPUs can't consistently hit 60 FPS.

Again, these are the two most popular games on Steam, worldwide. You don't need a very good discrete GPU to play them at a consistent 60 FPS @ 1080p with high details, but you do need a dGPU. Raven Ridge should change that. Saving $99 may not be much for some of us in America and western Europe, but in the emerging markets, it's a big deal.
 
Aug 11, 2008
10,451
642
126
Well, that is assuming it reaches HD7770 speeds, which is not a sure thing, and assuming the user is willing to limit himself to those types of games. I am not going to get into an ongoing argument about the merits of igp gaming, but if one is spending hundreds of dollars for a gaming desktop, I just dont buy the "oh I have to save 50 bucks and give up 50% performance" argument. In fact, if one is that financially strapped, he would probably be better off to get a cheap used or close out/reconditioned OEM system and stick in a dgpu.

In any case, it is just too hard to say now, since we dont know the price, or performance, of either Zen APUs or next gen 100.00 level cards.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
For those assuming performance and making charts there is one important thing to remember.

Like Nvidia with their 1060 performance claims, AMD is most likely lying/exaggerating Zen performance.

Please don't take AMD's numbers with blind faith. The 'Overclocker's dream", and "2.8x the efficiency for the 480" were wrong. Likely Zen's +40% is also wrong.

I'm not knocking AMD specifically. But I would not take AMD, Nvidia or Intel at their word ever. Wait for them to deliver. Hope for the best, but prepare for the worst.
 
Aug 11, 2008
10,451
642
126
For those assuming performance and making charts there is one important thing to remember.

Like Nvidia with their 1060 performance claims, AMD is most likely lying/exaggerating Zen performance.

Please don't take AMD's numbers with blind faith. The 'Overclocker's dream", and "2.8x the efficiency for the 480" were wrong. Likely Zen's +40% is also wrong.

I'm not knocking AMD specifically. But I would not take AMD, Nvidia or Intel at their word ever. Wait for them to deliver. Hope for the best, but prepare for the worst.

Well, they keep sticking to this claim. I do believe that in selected benchmarks, they will meet the claim. However, it remains to be seen if this translates to 40% faster overall performance in real world scenarios. And of course, it is also critical what clockspeeds they are able to reach. If they gain 40% ipc and lose 20% clockspeed, they are screwed. (That is just an example, I think they will do better than that.)
 

coercitiv

Diamond Member
Jan 24, 2014
6,631
14,066
136
What is obviously wrong is your statement, the 2.8x is not related to the 480 but to the 470..
We've been over this, they used the 2.8x figure in the RX 480 presentation made by Raja Koduri. Top left title says "RX 480 on 14nm FinFET" while middle of the screen hosts a glorious "UP TO 2.8X PPW". No 470 in sight. When you hold a presentation specifically for RX 480 and include performance/watt data from another product with only a tiny footnote to warn the unsuspecting audience, you are definitely trying to deceive.

Likely Zen's +40% is also wrong.
We've been over this as well: the +40% is the minimum required for a Zen core to not fall behind a construction module in term of throughput at same speed. It's really not that special, and does not make or break the product. What makes or breaks Zen is the performance @ 95W - perf/w for both arch and process combined.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Quite a straw here, AMD is not Nvidia..

Context: This quote is being used to support the claim that companies deceive.
It is not a behavior specific to AMD.

And is there a particular reason for a Zen core NOT to fall behind a construction module in absolute throughput? Its quite obviously a design goal but it seems that there is really just the assumption that it must be better or it will be a failure that seems to be supporting this claim.

Wider cores always have a worse ratio of theoretical performance to real world performance all other considerations held constant (Compare Atom to Skylake, jaguar to steamroller, POWER 8 to Skylake).

Module: Designed for absolute throughput.
Zen: High(er) IPC and efficiency.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
For those assuming performance and making charts there is one important thing to remember.

Like Nvidia with their 1060 performance claims, AMD is most likely lying/exaggerating Zen performance.

Please don't take AMD's numbers with blind faith. The 'Overclocker's dream", and "2.8x the efficiency for the 480" were wrong. Likely Zen's +40% is also wrong.

I'm not knocking AMD specifically. But I would not take AMD, Nvidia or Intel at their word ever. Wait for them to deliver. Hope for the best, but prepare for the worst.

2.8x isn't really that far off from the improvement over Hawaii.

See here for my power consumption improvement estimate last night (I also did the GTX 1060):

http://www.overclock.net/t/1603915/polaris-rx-480-rx-470-rx-460-discussion-thread/800#post_25377614

TLDR: RX 480 improved 2.36x, GTX 1060 improved 2.0x

As the RX 480 is decidedly faster than the cut-down-to 2304SPs R9 290 would be, the actual performance improvement will be higher. In some cases, the RX 480 is 25% faster than the cut-down R9 290 would be, but it averages about 15%. That gives a range of 2.6x to 2.8x PPW improvement.

In other words, AMD didn't lie at all, they gave the higher range and used the term "up to" as a catch-all.

They don't do that with Zen - at all. When a company doesn't use "up to" in advance, when making this claim so carefully all the rest of the time, and making this new claim many more times - over the course of a year - then that 40% must be a pretty confident figure. The lead designer said that the 40% IPC was an architectural-only improvement and that the process and other factors would provide more.

So 40% should really be our floor for any estimate. My own estimates include a 5% 'process' improvement on top the 40% architectural improvement. I then did SIMD and multi-threaded shaping, which took into account the known characteristics of Excavator and Zen.

 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
That's why I asked The Stilt that question. It would be interesting to know what the maximum is that AMD could achieve with its tech (assuming it wanted to go the route of a big chip) — in order to put things like a 7770 into context.

Playing Devil's Advocate:

7770 performance is in reach, but just. Zen APU will allegedly have up to ~704 GCN4 SPs (or, possibly, the presumably updated Vega SPs). Memory compression will provide a ~20 to 25% increase in effective bandwidth (so about 48GB/s) and the many architectural updates made since 7770 further reduce demand on memory bandwidth by some 10% net over the 7770. So it would be like giving the 7770 52GB/s bandwidth... or almost exactly 2/3 of the bandwidth.

Bandwidth isn't everything, though, but it is very important. RX 480 only loses 1~4% from losing 12.5% of its bandwidth, so the 7770 should be expected to follow a similar curve - losing 5~15% of its more idealized performance, but architectural improvements should make up for that and the Zen APU should be fairly close to the 7770 in performance.

BUT

Doing things another way, though, leads us to believe AMD wouldn't bother with 704 or more SPs on a Zen APU:

RX 480 has 36 CUs, and 256GB/s of hardware memory bandwidth.
ZN APU has 11 CUs and ~40GB/s of hardware memory bandwidth.

RX 480 has 7.1GB/s per CU.
ZN APU has 3.6GB/s per CU

I have a problem with this 11 CU business. It doesn't make any sense given the physical layout of GCN - basically the CUs are organized in pairs, so the number should always be eve unless one CU is disabled.. which is odd for the top APU SKU.

Adding one CU would take it to 768 - XBox One, and RX 460, territory, but the memory bandwidth would make this a waste, and the larger die wouldn't be a good thing either. Bringing this down one CU would take us to 640 CUs. This seems better. Fewer CUs means less power, smaller die, and higher achievable clocks - and more bandwidth per CU.

Bandwidth wise, things look a little better:

RX 480 has 7.1GB/s per CU.
ZN APU has 4.0GB/s per CU

That reduction in bandwidth would easily cost 20% of the performance potential... so why not just take off 20% of the CUs?

So, let's take off two more CUs, and stick with the 512 SPs and 8 CUs AMD has been using on their APUs.

RX 480 has 7.1GB/s per CU.
ZN APU has 5.0GB/s per CU

Now we're talking! Still a memory bandwidth issue, but a much smaller one.

Run that little GPU at 1Ghz to 1.2Ghz and you have a sizeable improvement in performance over current APUs - and you don't threaten the RX 460's meaning of existence. But you also can't reach HD 7770 or modern console levels of performance...

So, read my pecks: More than 512SPs only makes sense if the Zen APU will have an L4 cache. Even a 32MB reasonably fast L4 would permit HD 7770 levels of performance... if HBM is used, then we will likely see effectively dedicated VRAM on the top APUs - and console-like graphics performance.
 

coercitiv

Diamond Member
Jan 24, 2014
6,631
14,066
136
And is there a particular reason for a Zen core NOT to fall behind a construction module in absolute throughput? Its quite obviously a design goal but it seems that there is really just the assumption that it must be better or it will be a failure that seems to be supporting this claim.
High performance server oriented arch with lower throughput than previous arch at comparable transistor count. Ok.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
2.8x isn't really that far off from the improvement over Hawaii.

Every single performance per watt chart on the internet shows nothing close to 2.8x.

See here for my power consumption improvement estimate last night (I also did the GTX 1060):

Nonsense. You can't add and subtract shaders and the like to 'fit' performance and power characteristics. Or boost or lower clock speeds.

Especially the nonsense comparison to cut GM104 when the 1060 is GP 106 and a noncut die.

http://www.hardware.fr/articles/951-9/consommation-efficacite-energetique.html

Worse than Fury, equivalent to Fury X.

TPU shows the same.

420W for a 290X is also astonishingly high.

Then there is the argument that Polaris is NOT a Hawaii replacement. More of a Pitcarin replacement (Hawaii as a ton of DP).

High performance server oriented arch with lower throughput than previous arch at comparable transistor count. Ok.

Bulldozer and its derivatives are a throughput driven design, engineered for die size efficiency.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Each GCN array can contain up to 16 GCUs. Therefore having an odd number of CUs is not a issue as long as there are < 16 CUs. But yeah, 11 GCUs doesn't sound right because it is an odd number which is rarely used. 8 GCUs would be absolutely fine, however I feel like AMD looks to increase the number for marketing purposes. After all they've had 512SPs since 2013.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,706
1,233
136
Each GCN array can contain up to 16 GCUs.
Each GCN Cluster can contain up for 4 CUs. GCN1-GCN3 can support 4 of these clusters. GCN4 can support 6 of these clusters as long as they are configured as such;

Up to 1 Power/Physics CCU (Always On, unless GPU is turned off) // Up to 4 CUs
Up to 4 Graphic CCU (TMUs + RBE) // Up to 16 CUs
Up to 1 Audio CCU (TruAudio Next) // Up to 4 CUs
Supporting up to 24 CUs per CCA

CCA -{Clustered Compute Array}
CCU -{Clustered Compute Unit}

Raven 1X is thus;
1 CU for Power/Physics
2 CU for Audio
4x2 CU for Graphics

Power -> Graphics -> Sound

Polaris 10 = 1 CU[Power] + 2 CU[Graphics] + 2 CU[Graphics] + 2 CU[Pseudo-Physics/Graphics] + 2 CU[Pseudo-Audio/Graphics]
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Because the card hasn't been reviewed yet.

RX 470 is likely to be relatively less power efficient than the RX 480, since it uses harvested ASICs. Unless of course they clock it below the threshold of the process / the design (i.e =< 1GHz range) :sneaky:

The 2.8x total figure AMD displayed for RX 480 is so far off that I wonder if they actually mixed up the numbers. For a P11 the figures might be plausible, if it is clocked low enough. AFAIK P11 is just basically a "cropped" P10 (or vice versa).
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |