New Zen microarchitecture details

Page 97 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

mohit9206

Golden Member
Jul 2, 2013
1,381
511
136
So the rumor is Zen APU will feature Radeon graphics as powerful as RX460.But i imagine that will only be for the top end $200 models.The cheaper ones around $100-120 will probably be much less than that.Is it a good plan to wait for Zen APU if i just play on 1600*900 as i won't have to buy a discrete card.Even a $120 Zen APU should be faster than a GT 730 gddr5 right?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
There is not enough bandwidth to include anything even remotely as powerful as RX 460. Raven will have < 47GB/s (maximum officially supported) of shared bandwidth while RX 460 has 96GB/s of dedicated bandwidth. You could include such GPU, but it would be nothing but a waste of power mostly.
 

mohit9206

Golden Member
Jul 2, 2013
1,381
511
136
There is not enough bandwidth to include anything even remotely as powerful as RX 460. Raven will have < 47GB/s (maximum officially supported) of shared bandwidth while RX 460 has 96GB/s of dedicated bandwidth. You could include such GPU, but it would be nothing but a waste of power mostly.

My 730 already has 40GB/s bandwidth.Aleast m expecting hd7770 performance in $120 APU otherwise the APU will be a flop.
Edit:Nevermind Zen APU will not come out until late next year.Best to go with Intel skylake/kabylake+discrete gpu.
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
That would be a pretty damaging way for them to measure it.

As for the 16B fetch window, that was just a random size chosen by me. It all depends upon how deep the pre-decode logic can look for non-dependent instructions and how accurately it can predict branches. Smaller fetches simply make the task more difficult, but larger fetches are only useful if the hardware is fast and accurate enough to find non-dependent instructions that actually need to be executed.
It would be strange, but not damaging, as a SMT thread has ~65% the performance of a single thread on a single core, while a CMT thread performs at roughly ~80% when sharing the module.

Here is some info about fetching:
U.S. Pat. No. 20150121050 said:
[0045]
The IC pipeline is a three stage pipeline that can fetch 32 bytes of instruction data per cycle. Each address in the PRQ, depending on the predicted start and end location within the 64 byte prediction window, needs either one or two flows down the IC pipeline to forward all of the data to the DE.

For their Last-Level Cache, Level-4 cache.

My speculation is;
64B L1i Fetch Window (AMD64 Variable-op)
2x4 AMD64 to Macro-op Decode (Same as Steamroller/Excavator)
16B or 8 Macro-op L0i Fetch Window (Internal Macro-ops, RISC-like)

3-cycle(simple load), 4-cycle(complex load); L1d
10-cycle; L2
52-cycle; L3 (L3 -> L2 => 64B vs Jaguars 16B)
52-cycles + 50 ns; L4
52-cycles + 100 ns; Memory

Misprediction to L1 BTB; ~18-cycles
Misprediction to L0 BTB/BTQ; ~9-cylcles
L1d has as least 4 cycles (see GCC patch -> +4 cycles delta for each load+ex op). Going to the FPU adds another 3 cycles.


There is not enough bandwidth to include anything even remotely as powerful as RX 460. Raven will have < 47GB/s (maximum officially supported) of shared bandwidth while RX 460 has 96GB/s of dedicated bandwidth. You could include such GPU, but it would be nothing but a waste of power mostly.

Some more of that small area L2 in the GPU might mitigate the bandwidth needs somewhat.
 

coffeemonster

Senior member
Apr 18, 2015
241
86
101
My 730 already has 40GB/s bandwidth.Aleast m expecting hd7770 performance in $120 APU otherwise the APU will be a flop.
Edit:Nevermind Zen APU will not come out until late next year.Best to go with Intel skylake/kabylake+discrete gpu.
the 730 about ties with Kaveri APUs which only have 34GB/s but double the texture mapping units. A ZEN apu would have to be worse than Kaveri to not beat a GT 730.
edit: oh and you can get that performance from a $72 A8-7600
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
the 730 about ties with Kaveri APUs which only have 34GB/s but double the texture mapping units. A ZEN apu would have to be worse than Kaveri to not beat a GT 730.

What's more is that the Zen APU will be using GCN4 memory compression, so effective bandwidth will be higher still. Good caching could conceivably push performance to console level (well, XBox One level...), but it would be more likely to fall just shy of it.
 

laamanaator

Member
Jul 15, 2015
66
10
41
What's more is that the Zen APU will be using GCN4 memory compression, so effective bandwidth will be higher still. Good caching could conceivably push performance to console level (well, XBox One level...), but it would be more likely to fall just shy of it.
An APU without on chip memory (GDDR5, HBM) will never reach Xbone levels of performance. Xbone has almost 5 times more memory bandwidth than what any APU will ever have with two channel DDR3/4 memory. Bandwidth wise it's just not plausible. Sure you could put 4096 SPs on an APU to get insane amount of theoretical performance, but you'd always be bandwidth constrained if you do not have proper high speed on chip memory.
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
My 730 already has 40GB/s bandwidth.Aleast m expecting hd7770 performance in $120 APU otherwise the APU will be a flop.

HD 7770 (Cape Verde) has 10 GCN 1.0 CUs (640 shaders) running at 1000 MHz. The leaks so far indicate that Raven Ridge will have 11 CUs (704 shaders), presumably using the Polaris or Vega architecture. Even if the GPU clock speed was only 900 MHz, that should be about equivalent to Cape Verde in terms of raw compute power. Raven Ridge will only have about 65% as much memory bandwidth as the HD 7770, but the architecture changes between GCN 1.0 and Polaris should be able to make up for that deficit.

Based on what we know so far, Cape Verde-level performance sounds like a reasonable expectation.
 

lefty2

Senior member
May 15, 2013
240
9
81
Quick question. The 40% IPC gain that AMD is claiming, is that for single threaded performance or multi-threaded?
 

jpiniero

Lifer
Oct 1, 2010
14,842
5,457
136
Quick question. The 40% IPC gain that AMD is claiming, is that for single threaded performance or multi-threaded?

I don't think AMD has ever actually clarified it, but the assumption is that they are talking about single threaded, comparing an integer unit versus a Zen core without the HT.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
An APU without on chip memory (GDDR5, HBM) will never reach Xbone levels of performance. Xbone has almost 5 times more memory bandwidth than what any APU will ever have with two channel DDR3/4 memory. Bandwidth wise it's just not plausible. Sure you could put 4096 SPs on an APU to get insane amount of theoretical performance, but you'd always be bandwidth constrained if you do not have proper high speed on chip memory.

XBox One actually only has 68GB/s of memory bandwidth which is then augmented by a small 32MB eSRAM cache. It has zero memory compression with 768 GCN 1.1 SPs running at 853Mhz. That's 'just' 1.23TFLOPS of total shader power.

AMD's new memory compression increases effective memory bandwidth about 18% over GCN 1.2, and about 35% over GCN 1.1, IIRC... at least when the data is compressible. All told, I'd hazard a guess of a net 20% bandwidth gain due to compression.

All that would suggest an effective bandwidth of ~48GB/s for Zen APUs- or about 70% of XBox One's DDR3 bandwidth. While that will hamper performance - we see that 4GB Polaris 10 doesn't get immensely hampered by having only 87.5% of the 8GB Polaris 10 bandwidth - only losing 3~5%. Losing another 10% or so from its max would just make the GCN4 performance closer to that of GCN 1.1 performance.

That only leaves the XBox One eSRAM cache is a notable advantage for that console over a theoretical Zen APU. If the top Zen APU has an L4 cache that advantage could vanish entirely.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
XBox One actually only has 68GB/s of memory bandwidth which is then augmented by a small 32MB eSRAM cache. It has zero memory compression with 768 GCN 1.1 SPs running at 853Mhz. That's 'just' 1.23TFLOPS of total shader power.

AMD's new memory compression increases effective memory bandwidth about 18% over GCN 1.2, and about 35% over GCN 1.1, IIRC... at least when the data is compressible. All told, I'd hazard a guess of a net 20% bandwidth gain due to compression.

All that would suggest an effective bandwidth of ~48GB/s for Zen APUs- or about 70% of XBox One's DDR3 bandwidth. While that will hamper performance - we see that 4GB Polaris 10 doesn't get immensely hampered by having only 87.5% of the 8GB Polaris 10 bandwidth - only losing 3~5%. Losing another 10% or so from its max would just make the GCN4 performance closer to that of GCN 1.1 performance.

That only leaves the XBox One eSRAM cache is a notable advantage for that console over a theoretical Zen APU. If the top Zen APU has an L4 cache that advantage could vanish entirely.

Polaris 10 4GB has 38.4MB/s of dedicated bandwidth per GFLOP, while the 8GB model has 43.9MB/s. If Raven had RX 460 grade iGPU in it, it would have 18.7MB/s (at 2933MHz) of shared bandwidth per GFLOP. Even the GCN3 iGPUs (CZ/BR) are absolutely choked to death with their 41.7MB/s (at 2133MHz / 800MHz GEC) per GLOP shared bandwidth. And GCN3 has already very effective DCC.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Polaris 10 4GB has 38.4MB/s of dedicated bandwidth per GFLOP, while the 8GB model has 43.9MB/s. If Raven had RX 460 grade iGPU in it, it would have 18.7MB/s (at 2933MHz) of shared bandwidth per GFLOP. Even the GCN3 iGPUs (CZ/BR) are absolutely choked to death with their 41.7MB/s (at 2133MHz / 800MHz GEC) per GLOP shared bandwidth. And GCN3 has already very effective DCC.

If AMD were to use the maximum possible chip size for its 14nm process and four Zen CPU cores, what is the maximum iGPU it could squeeze in?

(Also, which library do you think would be the best choice for such a product? High density with low clocks like Excavator, the opposite, or something in-between?)
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Quick question. The 40% IPC gain that AMD is claiming, is that for single threaded performance or multi-threaded?

AMD initially only said IPC, but one other time (which I can't find right now) they specifically said 40% single-threaded IPC.

Multi-threaded performance should be slightly more than doubled (over Piledriver) with the same number of physical cores... at the same clock speed. Most likely, though, Zen will have a clock speed deficit.

Assuming a 700Mhz Zen deficit, Zen should have ~90% higher multi-threaded performance assuming 20% SMT scaling (per core). ~60% higher multi-threaded performance without SMT. And ~35% higher single threaded performance.

All of that is assuming a net 64% IPC improvement over Piledriver - or, specifically, FX-4350 at a fixed 4.2Ghz.
 
Mar 10, 2006
11,715
2,012
126
AMD initially only said IPC, but one other time (which I can't find right now) they specifically said 40% single-threaded IPC.

Multi-threaded performance should be slightly more than doubled (over Piledriver) with the same number of physical cores... at the same clock speed. Most likely, though, Zen will have a clock speed deficit.

Assuming a 700Mhz Zen deficit, Zen should have ~90% higher multi-threaded performance assuming 20% SMT scaling (per core). ~60% higher multi-threaded performance without SMT. And ~35% higher single threaded performance.

All of that is assuming a net 64% IPC improvement over Piledriver - or, specifically, FX-4350 at a fixed 4.2Ghz.

The claim was 40% IPC improvement for Zen core over a single XV core.
 

el etro

Golden Member
Jul 21, 2013
1,581
14
81
Yes, which is about 64% over Piledriver :sneaky:

I like comparing with Piledriver because it has L3 and 8-core configurations.

Looking at FX versus core i7 benches makes it look that is impossible for AMD to achieve the IPC of any Intel recent core.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Looking at FX versus core i7 benches makes it look that is impossible for AMD to achieve the IPC of any Intel recent core.
Zen's architecture is very different.

The biggest concern appears to be the strength of the 14nm process — more than the design. If the chip is well-designed but can't get enough clockspeed at a reasonable power consumption level (while keeping variability between chips low enough) then AMD could be in for quite a bit of trouble, particularly in the server space where performance-per-watt is generally taken into greater account than on the desktop.

The mobile space also, of course, generally places a significant premium on that, although there is more wiggle room since budget models are a thing. In fact, when it comes to those, OEMs sometimes reduce performance-per-watt significantly by doing things like sapping performance with the choice of single-channel RAM.

I would watch for benchmark shenanigans aimed at the desktop space, too. If Zen doesn't do well in AVX 2 or whatever then expect the de facto community benchmark to lean heavily on it. That could include separate code for older Intel chips to mask the AVX 2 focus, for the clever benchmark maker. Give those just enough of a deficit to make it look like the benchmark is completely natural while killing Zen performance. I have a feeling AMD went for a more Intel-like design for Zen to minimize this factor. Of course, it can also be attributed to Intel's design being better. It makes plenty of sense for a company with greater resources to produce a better design. After the Pentium IV fiasco execs are naturally going to be more alert to maintaining competitiveness so people shouldn't expect such a big opportunity again for AMD.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
If Zen doesn't do well in AVX 2 or whatever then expect the de facto community benchmark to lean heavily on it.

So it will be "unfair" to use modern video codecs, renderers or mathematical workloads / benchmarks which implement AVX2 and newer, just because Zen doesn't have competitive AVX2 performance? It's time to stop having double standards and stop treating AMD like a disabled child.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Looking at FX versus core i7 benches makes it look that is impossible for AMD to achieve the IPC of any Intel recent core.

Not exactly. My numbers include an assumed 700Mhz penalty for Zen compared to Piledriver - so there's a 20% penalty built-in. Even with that penalty, the quad core Zen with SMT, in theory, fairs reasonably well against the i7 4790k:

Code:
TEST               ZEN         i7 4790K
Handbrake LQ       583           582.8
Handbrake 4K*      18.7          23.2
WinRAR             56.8          51.3
x265 4K            1.43          1.75
CB R15 ST*         137           181
CB R15 MT*         636           894


(LOWER IS BETTER)
Agisoft Stg1*      6.97s         5.57s
Agisoft Stg2*      7.75s         8.54s
Dolphin*           7.11s         6.8s


* Likely under-representative of Zen's performance improvements due to heavy reliance on areas drastically changed in Zen compared to the construction cores.
Lots of asterisks, I know, but it's important to remember just how poorly Piledriver does in certain categories of performance - areas where Zen will not suffer. If you add 20% to these scores to account for my clock-speed penalty, assuming you can overclock to 4.2GHz, then you have a dead-ringer for performance with the i7 4690k... with some areas greatly outperforming expectations due to Piledriver having some particularly strong performance areas (such as WinRAR).

I have a much more thorough examination of Zen's potential performance, based on my own x4 845 testing, here:

http://excavator.looncraz.net/

This is the most relevant chart, IMHO:

Performance relative to Sandy Bridge. At 3Ghz.
 
Aug 11, 2008
10,451
642
126
HD 7770 (Cape Verde) has 10 GCN 1.0 CUs (640 shaders) running at 1000 MHz. The leaks so far indicate that Raven Ridge will have 11 CUs (704 shaders), presumably using the Polaris or Vega architecture. Even if the GPU clock speed was only 900 MHz, that should be about equivalent to Cape Verde in terms of raw compute power. Raven Ridge will only have about 65% as much memory bandwidth as the HD 7770, but the architecture changes between GCN 1.0 and Polaris should be able to make up for that deficit.

Based on what we know so far, Cape Verde-level performance sounds like a reasonable expectation.

Problem is, HD7770 is already marginal for 1080p gaming except for older or less demanding games. And by the time Zen apus come out, there will be an entire new generation of 14/16 nm dgpus in the hundred dollar range that will offer much better performance than the ancient HD7770. And when making bandwidth comparisons with a dgpu one must also consider that the already limited bandwidth (and thermal budget) must be shared with the cpu.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |