New Zen microarchitecture details

Pilum · Jul 27, 2016

The discussion about the number of ALUs misses an important point: ALUs don't matter if you can't keep them fed. For that you need a sufficient number of Load/Store Units. Zen will retain the two LSUs from the Bulldozer family. This will probably pose a pretty hard limit on the performance of various integer algorithms. Just as reference: Intel moved to three LSUs with Haswell in 2013. While RISC systems (e.g. POWER8, Apple A9) often have only half as many LSUs as their sustained instruction throughput, on x86 you want a higher ratio due to the fused load-execute ops.

I haven't found traces of current x64 software, but for x86 traces from the 90s, load/store ops often amounted for more than 50% of the dynamic instruction count. For SPEC2000 int-gzip, it was nearly 77%; for int-gcc 82%. These are the most extreme cases, but then the lowest percentage is about 40%.

If current x64 code still has similar ratios, we won't see a general 40% integer IPC uplift vs. EXV, as the LSUs will pose a bottleneck for some types of integer code. For FP this is a different story, as FP stores seem to use only one FP pipeline without using a LSU, so Zen will act as having three LSUs when running FP code.

While the design doesn't seem competitive for a 2017 high-performance x86 product, the design choice makes sense if the execution backend was primarily intended for K12, as it would have the 1:2 ratio of LSUs to issue width which seems normal for RISC architectures.

BTW, any news on K12? I haven't seen any news on it for a long time. Is it still mentioned by AMD, or does it appear to be canceled?

looncraz · Jul 27, 2016

Pilum said:
The discussion about the number of ALUs misses an important point: ALUs don't matter if you can't keep them fed.

That hasn't been lost on many of us here, I don't believe :biggrin:

Pilum said:
For that you need a sufficient number of Load/Store Units. Zen will retain the two LSUs from the Bulldozer family. This will probably pose a pretty hard limit on the performance of various integer algorithms.

I think you mean AGU (address generation unit). Zen contains a number of improvements that reduce the stress on the AGUs. And you can usually generate addresses much faster than you can read the data you need from those addresses, which is why you don't see massive increases from Intel adding more AGUs and even dedicate data store units.

Pilum said:
While RISC systems (e.g. POWER8, Apple A9) often have only half as many LSUs as their sustained instruction throughput, on x86 you want a higher ratio due to the fused load-execute ops.

I haven't found traces of current x64 software, but for x86 traces from the 90s, load/store ops often amounted for more than 50% of the dynamic instruction count. For SPEC2000 int-gzip, it was nearly 77%; for int-gcc 82%. These are the most extreme cases, but then the lowest percentage is about 40%.

For many reasons, the actual hardware utilization of the AGUs does not match the memory op calls in the code. Frequently that code will reference the same address over and over, so the CPU contains buffers/caches which are checked prior to scheduling an AGU calculation.

Further, some of those memory ops can be done - as they are in Zen - by ALUs. LEA (Load effective address) is one such optimized address calculation.

Zen has some rather meaningful improvements in this area that can be quite enlightening.

You can see one of the discussions here:

http://forums.anandtech.com/showthread.php?p=37905139#post37905139

superstition · Jul 27, 2016

NostaSeronx said:
22/20/14 nanometer Bulldozer improvements lost while Zen was implemented.

Clustered Multithreading 2.0

I've read conflicting opinions about the value of CMT. This article, for instance, states that it doesn't even exist or "doesn't work", while another one states that there is a third type which I can't recall and treats CMT as if it is potentially just as advantageous as SMT, depending upon the workload targeted. That latter article seems to be significantly more credible, in terms of its tone but that could be misleading (tone policing fallacy). However, given the shrillness of that person's argumentation it seems suspect.

I found this, which is interesting. It says the third type is FMT (fine-grained multithreading).

... a more powerful mechanism than either coarse or fine grained multithreading to exploit TLP. Called Simultaneous Multithreading (SMT), it allows the instructions from two or more threads to be issued to execution units each cycle. The advantage of SMT is that it permits TLP to be exploited all the way down to the most fundamental level of hardware operation - instruction issue slots in a given clock period. This allows instructions from alternate threads to take advantage of individual instruction execution opportunities presented by the normal ILP inefficiencies of single thread program execution. SMT can be thought of as equivalent to the airline practice of using standby passengers to fill seats that would have otherwise flown empty. Consider a single thread executing on a superscalar processor. Conventional superscalar processors such fall well short of utilizing all the available instruction issue slots. This is caused by execution inefficiencies including data dependency stalls, cycle by cycle shortfall between thread ILP and the processor resources given limited re-ordering capability, and memory accesses that miss in cache. The big advantage of SMT over other approaches is its inherent flexibility in providing good performance over a wide spectrum of workloads. Programs that have a lot of extractable ILP can get nearly all the benefit of the wide issue capability of the processor. And programs with poor ILP can share with other threads instruction issue slots and execution resources that otherwise would have gone unused.

So, is this a case of terminology changing? "Coarse-grained" versus "Cluster-based"? Is AMD's CMT something different?

Is it possible to design a CPU that takes advantage of both CMT and SMT or do they step on each others' toes too much? I've read that Bulldozer does have a bit of SMT happening. This suggests a marriage in "CSMT".

Hruska said:
In theory, AMD’s design could have given it an advantage, since each core contains a full set of execution units as opposed to SMT, where those resources are shared, but in practice Bulldozer’s low efficiency crippled its scaling.

This is what I am still unclear about. If CMT theoretically works why doesn't it work in practice? Is it just because AMD's design was flawed? If so, why go to SMT now? I can't find the article at the moment but it listed pros and cons for both CMT and SMT — making the case that SMT isn't necessarily the superior option for all cases. The Stilt said AMD doesn't do large caches well and people have discussed the slowness of Bulldozer's L3 cache in particular. So, has anyone come up with a model/schematic that would show a CMT design that is quite efficient and effective?

NostaSeronx · Jul 27, 2016

Cluster-based Multithreading, Clustered Multithreading, Multiclustered Multithreading, and Chip Multithreading are all based on Simultaneous Multithreading.

The difference is that Simultaneous Multithreading at nominal definition has threads sharing critical resources. Critical resources are the Instruction Bus(Retirement, Schedular, etc), the Data Bus(Load/Store, L1d[First-Level Cache]), the Datapaths(ALUs, AGUs, PRFs), and the Control Unit(Branches, Context, etc). The above forks of simultaneous multithreading replicate these critical resources. This replication of critical resources increases performance that it can operate as if it were using Chip Multiprocessing.

CMT in the context of AMD is;
Step 1. Make Simultaneous Multithreading architecture.
Step 2. Replicate Integer/Memory pipelines and separate thread context across those pipelines.
Step 3. Make it work.
Step 4. Done.

It is now a simultaneous multithreading architecture with the nominal/average performance of a standard dual-core processor. It is clustered multithreading now.

One issue into implementation is that Bulldozer isn't a fully faithful interpretation of CMT. One of the areas where it really ruined the design is the Load-Store pipeline. The design on paper had what is currently the L1d being shared between the cores. The write-coalescing would occur at the L0d -> L1 level with lower latency and faster bandwidth. This would allow the L1d to not be at a premium. Instead, it would follow the L1i cache in being placed right next to the L2 Interface.

So, if AMD came out with an optimized Bulldozer architecture that was faithful to the original; [Using Excavator as basis and following the 8x - 1÷8 rule of caches]
1 MB L2$ => 1x128 KB L1i$ & 1x128 KB L1d$ => 2x8 KB L0i$ & 2x8 KB L0d$ per module [This cache array implies a nice mobile/low power processor sort of like Banias/Dothan][[Possible wonder why AMD made Stoney Ridge so late in the game.]]

There is also evolutions in Simultaneous Multithreading that can be used in AMD's CMT to further increase performance.
Scalable Simultaneous Multithreading => Which would provide Front-end orientated performance increases. [Follows the trend with duplicated decoders in Steamroller]
Clustered Simultaneous Multithreading => Which would provide FPU orientated performance increases.

superstition · Jul 27, 2016

Thanks for the explanation. Do you think AMD made the wrong choice by not using those variations of CMT for Zen and going exclusively to SMT? Is the choice related to reducing the chance of compilers to favor Intel designs? Also, did AMD change the Load-Store pipeline in order to maximize clocks at the expense of efficiency?

NostaSeronx · Jul 27, 2016

superstition said:
Thanks for the explanation. Do you think AMD made the wrong choice by not using those variations of CMT for Zen and going exclusively to SMT?

Yes, I do think AMD has made a bad choice. Zen is literally Bulldozer 32/28 Redux on 14nm. It has none of the proposed modifications in 22/20/14 Bulldozer.

*Zen FPU vs 22/20/14 Hypothetical Bulldozer-derived architecture;
-> Zen => 2 Muls, 2 Adds, or 2 FMAs // Only FP128
-> Bulldozer => 4 Muls, 4 Adds, or 4 FMAs or any mix like 3 adds, 1 mul or 3 muls, 1 add. // FP128 mode ,, 2 Muls, 2 Adds, or 2 FMAs or 1 add/mul and 1 FMA // FP256 mode [Marketing would probably call it FlexFPU 2.0; link]

*Zen Memory vs 22/20/14 Hypothetical Bulldozer-derived architecture;
-> Zen => 2 threads / 2 AGUs / LSU for 2 threads [2x16B_L + 1x16B_S]
-> Bulldozer => 2 threads / 4 AGUs / LSU0 for 1 thread [2x32B_L + 1x32B_S] + LSU1 for 1 thread [2x32B_L + 1x32B_S]

*Zen core vs 22/20/14 Hypothetical Bulldozer-derived architecture core;
-> Zen => 4 ALUs [+ 2 AGUs] with 6 units attached to PRF.
-> Bulldozer => 4 ALUs (2 ALUs intermixed with AGUs) with 4 units attached to PRF
-> The ramifications of above imply the Bulldozer design would clock higher or use lower energy.

superstition said:
Is the choice related to reducing the chance of compilers to favor Intel designs?

The only issue with compilers is with Instruction Set Enhancements. AMD will still have issues if the compiler gives them 128-bit SSE2 2-operand while Intel gets 256-bit/512-bit AVX 3-operand.

superstition said:
Also, did AMD change the Load-Store pipeline in order to maximize clocks at the expense of efficiency?

I think the L0d->WCC->L1d was mostly lost in design proposed to design acceptance. It shows with the WCC write-through policy not only going to the L2 cache, but the L3 and memory as well. This is primarily caused by the WCC location being in the L2 interface. Which means it has access to write into the System Request Interface. The WCC was meant to only write-through to a single cache that is inclusive to cores data caches within the module. This meant the next policy onwards would of course be write-back. Hindsight? Foresight? Nope, just negligence.

KTE · Jul 27, 2016

Abwx said:
Infos are here but what is lacking is either the will or the capability to make an accurate estimation, as said ad nauseam Zen has 33% more ALUs than Sandy Bridge or Ivy Bridge and 100% more than EXV.

Click to expand...

Schematicaly all ops are either computed by the ALUs, or if it s FP they are completed by the ALUs, so throughput is dependent of the ALU count, that s why Piledriver could have lower IPC than a Phenom core in some instances that rely on brute force availability.

Which translates into how much performance increase exactly? (once again )

superstition · Jul 27, 2016

NostaSeronx said:
...

It's hard to imagine that bright engineers would make such serious errors.

I'm also puzzled when someone with the reputation of Jim Keller would allegedly make bad decisions for such a crucial product. Does he have anything in his track record that suggests a capacity for that? The only things I can think of:

A) A desire to get positive press via "AMD dumped Bulldozer/CMT, hooray" sentiment

B) The Zen design team being more familiar with SMT, so their choices are due to a limited vision

C) We're missing information that will show that Zen is quite good.

D) (Paranoia alert) Intel people work at AMD and are sabotaging its products.

It's definitely interesting to see someone say something positive about CMT. All the media coverage I've seen about Zen that mentions the decision to go to SMT exclusively has been favorable.

riggnix · Jul 27, 2016

NostaSeronx said:
-> The ramifications of above imply the Bulldozer design would clock higher or use lower energy.

Maybe thats the whole point. Could it be related to the LP process? Maybe higher clocks aren't possible because of it, so it wouldn't help anyway?

Erenhardt · Jul 27, 2016

Zen+
*mindblown

looncraz · Jul 27, 2016

NostaSeronx said:
*Zen FPU vs 22/20/14 Hypothetical Bulldozer-derived architecture;
-> Zen => 2 Muls, 2 Adds, or 2 FMAs // Only FP128

Zen's FPU can do 3 concurrent adds (SSEi, MMX), using FPU0, FPU!, & FPU3 pipelines. It can do 2 SSE adds. And, maximally, 4 adds when performing MMX/SSEi and SSE or (presumably) x87 at the same time.

It can also do 2 divisions, shuffles, comparisons, etc... vector performance should be notably better relative to other comparisons. x87 should improve as well, but that depends on AMD's microcode, decode, and scheduler setup and optimizations.

It is definitely optimized for SMT, geared towards reducing the number holes in execution with two threads simultaneously executing on the FPU. It should do even better than Intel's port-based solution here.

NostaSeronx said:
-> Zen => 2 threads / 2 AGUs / LSU for 2 threads [2x16B_L + 1x16B_S]
-> Bulldozer => 2 threads / 4 AGUs / LSU0 for 1 thread [2x32B_L + 1x32B_S] + LSU1 for 1 thread [2x32B_L + 1x32B_S]

Each thread would still only have 2AGUs - and they'd be sometimes doing ALU ops - and they'd still be contending with another core for cache access.

Zen compares modestly favorably here - particularly as the second thread is only expected to deliver 20~30% more total performance than just using the first thread.

NostaSeronx said:
-> Zen => 4 ALUs [+ 2 AGUs] with 6 units attached to PRF.
-> Bulldozer => 4 ALUs (2 ALUs intermixed with AGUs) with 4 units attached to PRF
-> The ramifications of above imply the Bulldozer design would clock higher or use lower energy.

Well, and that single threaded IPC should be notably higher on Zen...

Clock speed ramifications would depend on the PRF design - 14nm LPP should provide an assistance there.

Zen's focus is on single threaded performance, Bulldozer is on module throughput. When a Bulldozer module can only just get the throughput of a single Sandy Bridge SMT core, but loses horribly in single threaded work-loads, you have a recipe for market-place disaster.

Continuing with that design philosophy would have only put AMD deeper into its deficit against Intel. Zen refocuses to put priority on the single thread running on a single core - while making a second thread running on that same core a first class citizen, sharing ALL available execution resources.

looncraz · Jul 27, 2016

riggnix said:
Maybe thats the whole point. Could it be related to the LP process? Maybe higher clocks aren't possible because of it, so it wouldn't help anyway?

That is likely part of the reasoning - though the initial design predates 14nm LPP.

AMD knows that getting 5Ghz shipping clocks will never happen with their CMT designs... the cache is their biggest problem - if they had Intel-like caches, their CMT designs could really shine.

So, with Zen, what Jim Keller did was look at where AMD was strong, and where they were weak... and then sought to create a design that exploited AMD's strengths.

In a construction-core module you have four ALUs, but only two can be used on one thread - if all four could be used on one thread you'd have better performance for that one thread. Allowing the other thread to also execute on those four ALUs makes total sense - you don't have to completely recreate everything, AMD already had all the IP... you could get the design done in just a couple of years that way.

Reuse reuse reuse!

(Some of this is assumption)

So Zen uses the original Bulldozer's front-end, slightly modified. It uses the threading logic, decode logic, just about everything really. Then the front-end was updated with the newer prediction logic and an instruction cache for loop optimization and reduced mispredict penalties (Intel does something similar to great effect). The three pathways to the schedulers were repurposed - one for integer, one for memory, and one for the floating point.

All of the ALU's from the module were moved to the single integer scheduler and the execution resources were more evenly distributed according to the needs of two threads, so DIV and MUL were split between two different pipelines, and only one DIV and one MUL unit included - no need to keep two of each... that just wastes power. He kept both branch units - they're even still spaced far apart.

The AGUs were made more pure, but they kept a few abilities that enables their direct use for indirect branching, 'leave', shifts, and for streaming vector math, and other purposes. Using two fully-capable, identical, AGUs made the memory scheduling easier. Adding a third may have been beneficial, but it seems there were concerns about power usage/die size/utilization and who knows what else... or perhaps it was found that the third one was not needed due to other optimizations (improved page walks, caches, etc.) so they excluded the underutilized AGU (or perhaps made those improvements to make up for it...).

The FPU was given the same type of quick special treatment, then they started working on all the nitty gritty of making it all work together.

And that is how the Zen was made.

(or so I believe)

Dresdenboy · Jul 27, 2016

Enigmoid said:
And it does. AMD is using the 2.8x in terms of performance/TDP, not perf/W. Which it looks like will align if the 470 uses around 110W of power.

So forget process and watts for RAM. Just use TDP.

Latest 470 specs leak says 120W TDP now (plus clocks, etc.):
http://videocardz.com/62672/amd-radeon-rx-470-and-radeon-rx-460-official-specs-and-performance

frozentundra123456 · Jul 27, 2016

Dresdenboy said:
Latest 470 specs leak says 120W TDP now (plus clocks, etc.):
http://videocardz.com/62672/amd-radeon-rx-470-and-radeon-rx-460-official-specs-and-performance

Off topic in this forum, but that is very strange spacing for the 3 models. 470 is very close to the 480 in SP, while there is a huge gap down to the 460. I was hoping for something in the range of 1280 SP with a sub hundred watt TDP, or maybe even able to run from the PCIe slot alone.

Looking at compute performance and SP count vs TDP, it doesnt appear markedly more efficient than the 480. Maybe 5 to slightly over 10%. Maybe in the real world it will fare better.

looncraz · Jul 28, 2016

frozentundra123456 said:
Off topic in this forum, but that is very strange spacing for the 3 models. 470 is very close to the 480 in SP, while there is a huge gap down to the 460. I was hoping for something in the range of 1280 SP with a sub hundred watt TDP, or maybe even able to run from the PCIe slot alone.

Looking at compute performance and SP count vs TDP, it doesnt appear markedly more efficient than the 480. Maybe 5 to slightly over 10%. Maybe in the real world it will fare better.

It is strange to me as well. Less than half the SPs of the RX 470... However, the die holds exactly half (1024), which would make sense.

The 470 being so close the 480 is a surprise as well. If I was in charge RX 480 would be 2816 SPs like Hawaii, with only 1Ghz clock speeds (despite its energy efficiency). It would perform better, that's for sure.

The RX 470 would be 2048SPs.

And the RX 460 would be 1280, again with lower clocks.

AMD should have learned by now to keep their clocks in the peak efficiency range... which is made easier by using a larger-than-needed GPU.

The ROPs are all fine, though.

dark zero · Jul 28, 2016

frozentundra123456 said:
Off topic in this forum, but that is very strange spacing for the 3 models. 470 is very close to the 480 in SP, while there is a huge gap down to the 460. I was hoping for something in the range of 1280 SP with a sub hundred watt TDP, or maybe even able to run from the PCIe slot alone.

Looking at compute performance and SP count vs TDP, it doesnt appear markedly more efficient than the 480. Maybe 5 to slightly over 10%. Maybe in the real world it will fare better.

OFF Topis again: Maybe because the 460 is a cut down chip.... the full die is 1280 and the one who is about to sell has less than that.

ON Topic: Ok, Zen is supposed to hit this year, but, I wonder when Zen is about to hit the budger market?
And I read that there is Zen Lite, what is supposed to be?

superstition · Jul 28, 2016

dark zero said:
I wonder when Zen is about to hit the budger market?
And I read that there is Zen Lite, what is supposed to be?

VR World said:
New console will bring not just new levels of performance thanks to AMD’s switch to 14nm FinFET process, enabling the company to deliver a “Zen Lite” CPU and fully fledged Polaris 10 GPU inside a single package a.k.a. APU.

Hardware specifications as per our exclusive story are as follows:

AMD custom-built silicon
GlobalFoundries 14nm FinFET
8 ‘ZEN Lite’ Cores
2304 Polaris 10 GPU Cores
Enhanced Memory controller for APU
512 MB System RAM
8GB GDDR5 in double density configuration (4 instead of 8 chips)

link

Zen Lite is a pretty awful mixed metaphor.

The Stilt · Jul 28, 2016

dark zero said:
ON Topic: Ok, Zen is supposed to hit this year, but, I wonder when Zen is about to hit the budger market?

I'd say right at the launch, unless AMD manages to increase the clocks on Zeppelin significantly (by 15-20%) within ~ a month*. The yields most likely are far from great, so there should be a good supply of harvested dies with two cores per CCX and 4MB of L3 fully functional.

Therefore it would be wise to release harvested SKUs also at the same time, just to gain momentum and to lock people on AM4 / DDR4 infra. I'm aware that AMD has previously stated that initially there wouldn't be 4C/8T models available, but I expect that they made the statement at a time when the behavior of the manufacturing process wasn't known in the full extent yet.

*(if expecting any availability in 2016 and considering the lead time of ~ 3 months from PR status to shipping).

Unless the clock frequencies on all SKUs improve significantly from the current (alleged) figures then I would expect:

- 4C/8T top shelf SKU < 199$
- 8C/16T top shelf SKU < 359$

IMO AMD cannot ask higher than that for any Zeppelin consumer SKU. i7-5820K sells for > 369$ and it should be significantly faster in ST workloads, tie in legacy MT workloads and annihilate in those few rare ST/MT 256-bit workloads. Also it will overclock better than Zeppelin does.

/Speculation

looncraz · Jul 28, 2016

superstition said:
link

Zen Lite is a pretty awful mixed metaphor.

I suspect Zen lite is just Zen without the L3 (or a minimally sized L3 if the intra-module communication requires it), and fewer SoC-loike resources (PCI-e lanes and the like).

The use of fewer memory chips is an issue - it means less bandwidth. Really makes me think that 512MB "System RAM" might be HBM. A singe stack of HBM2 would provide 256GB/s of bandwidth... which looks oddly familiar to another product we know which has 2304SPs...

superstition · Jul 28, 2016

The Stilt said:
Unless the clock frequencies on all SKUs improve significantly from the current (alleged) figures then I would expect:

- 4C/8T top shelf SKU < 199$
- 8C/16T top shelf SKU < 359$

Maybe Zen Lite will be 8C with 8 threads (à la i5).

KTE · Jul 28, 2016

looncraz said:
That is likely part of the reasoning - though the initial design predates 14nm LPP.

So these are two completely separate departments.
A design aims for certain process level characteristics... Many years later, when that design is ready, it then needs to be re-evaluated for that process in time and realistic targets set. It's a management decision whether to go ahead with what they can currently achieve or delay the design for more tweaking on a better process. Even an excellent uarch can fail with a poor process. But the public only sees them as one and the same.

AMD knows that getting 5Ghz shipping clocks will never happen with their CMT designs... the cache is their biggest problem - if they had Intel-like caches, their CMT designs could really shine.

We don't really know if caches are the only major bottleneck here, but I would wager there is A LOT more holding these CMT designs back. Poor branch misprediction rates being one.

In a construction-core module you have four ALUs, but only two can be used on one thread - if all four could be used on one thread you'd have better performance for that one thread.

And significantly more cache contention/thrashing and power draw

Sent from HTC 10
(Opinions are own)

superstition · Jul 28, 2016

looncraz said:
and an instruction cache for loop optimization and reduced mispredict penalties (Intel does something similar to great effect)

I remember reading an analysis that said one reason why Sandy beat Bulldozer so much is because Bulldozer lacked that.

Johan De Gelas said:
Most people pointed to high latency caches as a reason for subpar Bulldozer performance, but the real explanation of why Bulldozer's performance was underwhelming is a lot more complex

KTE said:
We don't really know if caches are the only major bottleneck here

Johan De Gelas said:
First of all, in most applications, an OOO processor can easily hide the 4-cycle latency of an L1 cache. Intel introduced a 4-cycle latency cache three years ago with their Nehalem architecture, and Intel's engineers claim that simulations show that a 3-cycle L1 would only boost performance by 2-3% (at the same clock), which is peanuts compared to the performance boost that is the result of the higher clock speed headroom.

Secondly, a dedicated 4-way 16KB cache, although relatively small, is hardly worse than Intel's 8-way 32KB data cache that is shared by two threads. The cache is also predicted lowering the power to search, so the Bulldozer data cache organisation does have its advantages.

Considering that SAP and Libquantum tell us that Bulldozer's prefetching works quite well, the 20-cycle L2 cache latency might not be a showstopper after all in server and HPC applications.

We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.

KTE said:
I would wager there is A LOT more holding these CMT designs back. Poor branch misprediction rates being one.

Johan De Gelas said:
The Real Shortcomings: Branch Misprediction Penalty and Instruction Cache Hit Rate

Bulldozer is a deeply pipelined CPU, just like Sandy Bridge, but the latter has a µop cache that can cut the fetching and decoding cycles out of the branch misprediction penalty. The lower than expected performance in SAP and SQL Server, plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches, all points to there being a serious problem with branch misprediction.

Our Code Analyst profiling shows that AMD engineers did a good job on the branch prediction unit: the BPU definitely predicts better than the previous AMD designs. The problem is that Bulldozer cannot hide its long misprediction penalty, which Intel does manage with Sandy Bridge.

That also explains why AMD states that branch prediction improvements in "Piledriver" ("Trinity") are only modest (1% performance improvements). It will be interesting to see if AMD will adopt a µop cache in the near future, as it would lower the branch prediction penalty, save power, and lower the pressure on the decoding part. It looks like a perfect match for this architecture.

So, this is probably the analysis I recall reading (my bold added).

Johan De Gelas said:
Another significant problem is that the L1 instruction cache does not seem to cope well with 2-threads. We have measured significantly higher miss rates once we run two threads on the 2-way 64KB L1 instruction cache. It looks like the associativity of that cache is simply too low. There is a reason why Intel has an 8-way associative cache to run two threads.

Desktop Performance Was Not the Priority

No matter how rough the current implementation of Bulldozer is, if you look a bit deeper, this is not the architecture that is made for high-IPC, branch intensive, lightly-threaded applications. Higher clock speeds and Turbo Core should have made Zambezi a decent chip for enthusiasts. The CPU was supposed to offer 20 to 30% higher clock speeds at roughly the same power consumption, but in the end it could only offer a 10% boost at slightly higher power consumption.

Server Workloads: There Is Hope

If there is one thing this article should have made clear, it's that server applications have completely different demands than SPEC CPU or workstation software. They are much more limited by MLP, come with lower IPC, and are more scalable. They also come with a much larger memory footprint and punish small, low latency caches with high miss rates. Therefore a higher latency but larger L2 cache assisted by good prefetchers can perform adequately.

We strongly believe the concepts behind Bulldozer are sound ones for the professional IT world. The trade-offs are well made for these workloads, but there seem to be four show stoppers. So far we found out that the instruction cache, the branch misprediction penalty, and the lack of clock speed are the main reasons why Bulldozer underperforms in the server world.

The lack of clock speed seems to be addressed in Piledriver with the use of hard edge flops and the resonant clock edge, which is especially useful for clock speeds beyond 3GHz.

But what about the fourth show stopper? That is probably one of the most interesting ones because it seems to show up (in a lesser degree) in Sandy Bridge too. However, we're not quite ready with our final investigations into this area, so you'll have to wait a bit longer. To be continued....

Interesting statement in bold. Maybe AMD just has lacked the resources needed to make a competitive CMT-based design.

leoneazzurro · Jul 28, 2016

looncraz said:
The 470 being so close the 480 is a surprise as well. If I was in charge RX 480 would be 2816 SPs like Hawaii, with only 1Ghz clock speeds (despite its energy efficiency). It would perform better, that's for sure.

The RX 470 would be 2048SPs.

And the RX 460 would be 1280, again with lower clocks.

AMD should have learned by now to keep their clocks in the peak efficiency range... which is made easier by using a larger-than-needed GPU.

The ROPs are all fine, though.

It is entirely OT here, but anyway.. RX470 is probably set at a lower voltage, also the boost clock is not so different to RX480's one (and when gaming, this will matter more). What AMD needs is how to fed efficiently its engine - theoretical peak rates on Ellesmere hint for a lot of untapped potential - and how to set up efficiently the memory bandwidth usage/fill rate: it is amusing to see how GTX1060 fares well enough at high resolution despite the bandwidth disadvantage.

looncraz · Jul 28, 2016

KTE said:
So these are two completely separate departments.

Yes, riggnix was responding to NostaSeronx about clock-speed and power usage ramifications of Zen vs Bulldozer designs favoring Bulldozer - and riggnix stated "Maybe that's the whole point."

And it probably was - AMD didn't expect to be able to clock higher, so the design-enabled higher clocks of the Construction cores are going to waste.

With AMD aware that every move they make to a new process with the construction cores has reduced clockspeed they knew upcoming processes would likely follow the same pattern, so they'd need to design for IPC and forego frequency.

KTE said:
We don't really know if caches are the only major bottleneck here, but I would wager there is A LOT more holding these CMT designs back. Poor branch misprediction rates being one.

The construction cores actually have decent branch predictors, it's just the penalty that causes problems - and the penalty is largely derived from the slow caches.

KTE said:
And significantly more cache contention/thrashing and power draw

Not so much, really. You already had many of these resources shared in a module, three schedulers acting on them, and a net 4+3+4 pipeline design. In a way, Zen will have less issues with the 4+2+4 design, improved buffers, and presumably improved schedulers and prediction capabilities.

Con cores have four pipelines per core, and three for the FPU, IIRC, which is 11 pipelines being fed by shared, and inferior, resources to what will be feeding Zen's 10 pipelines.

KTE · Jul 28, 2016

looncraz said:
The construction cores actually have decent branch predictors, it's just the penalty that causes problems - and the penalty is largely derived from the slow caches.

Decent, but not good enough compared to Intels. Run some branchy bench you know of and use CodeAnalyst to check the misprediction rates. I'll run the same with Intel Skylake.

I disagree that it's just cache and the rest was equally matched to Intel.

Slow cache is also a major oversimplification - it's ways, it's how many accesses per line can be made simultaneously, it's the latency and bandwidth with cache contention.

Sent from HTC 10
(Opinions are own)

New Zen microarchitecture details

Member

Senior member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Senior member

Platinum Member

Junior Member

Diamond Member

Senior member

Senior member

Golden Member

Lifer

Senior member

Platinum Member

Platinum Member

Golden Member

Senior member

Platinum Member

Senior member

Platinum Member

Golden Member

Senior member

Senior member