New Zen microarchitecture details

Dresdenboy · May 7, 2016

If AMD planned to design BD for servers and HPC (FMA!), they likely designed for recompiled code. There are many results showing big differences between code optimized for generic, Intel, and bdverX targets. P3DNow! had results for differently optimized versions of Lame on equally clocked PD, SR, and XV:
http://www.planet3dnow.de/cms/18564...tekturen/subpage-praxistests-fritzchess-lame/
LAME - 64 Bit generic:

LAME - 64 Bit amdbd-optimized:

The latter variants are 33% to 35% faster.

This was one of the worse decisions.

The Stilt · May 8, 2016

Did some tests with Bullet physics library using Steamroller, Excavator and Haswell.

I built the libraries and the test program with various settings (e.g instructions up to SSE3 / SSE4.2 / AVX / AVX2 and µarch specific tunes).

Based on the results, Excavator appears to be the slowest of the three. Regardless of the used compiler settings Steamroller is 9.2% faster on average. On both Steamroller and Excavator using a non-architecture specific settings combined with instructions up to AVX & FMA always produce the best results, even better than the specific "march" setting. It doesn't seem that there has been too much efforts in optimizing GCC presets for AMD architectures. Hope this will change with Zen...

Haswell is nearly 61% faster than Steamroller when SSE3 is used and around 74% faster than Excavator when SSE3 or SSE4.2 is used.

Bullet is used pretty widely in recent games, benchmarks and rendering applications (e.g. GTA V, 3DMark, Blender, Cinema 4D, etc). Could partly explain why AMD CPUs do so badly in GTA V and 3DMark physics tests :\

I also noticed an interesting phenomenon with Haswell. When AVX or AVX2 and FMA are enabled simultaneously and no "march" or "mtune" parameter in given, one of the individual tests (136 ragdolls) slows down by +500%. However as soon as "march" parameter is given and both AVX/2 and FMA obviously remain active, the phenomenon seizes to exist

I wouldn't expect GCC 5.3 to have such a bug since Haswell has been supported for a "while" now.

In case anyone wants to try themselves: https://onedrive.live.com/redir?resid=8329B08E8413A80E!546&authkey=!AAnGDZ1Nv6fMfEw&ithint=file,7z

The benchmark itself is from Bullet 2.82 build, while the libraries are from the newest build available at Git (2.83.xxx). They have been compiled with GCC 5.3 x86-64.
It is the same benchmark as OpenBenchmarking uses, however the build options differ. Also it is single threaded only.

Dresdenboy · May 8, 2016

bdver4 uses the bdver3 pipeline description. And according to this quick latency/throughput comparison, XV shouldn't suffer IPC-wise. Is it some thermal limit instead?

-march=bdverX implies -mprefer-avx128, which you might revert by -mno-prefer-avx128. Did you try this?

NostaSeronx · May 8, 2016

It would be preferred for the 15h architecture to use the instruction set that it is built around;

XOP = {x87, MMX, SSE, SSE2, SSE3, SSSE3, SSE4A, SSE4.1, SSE4.2}>Legacy, {SSE5}>XOP

-mno-avx
-mno-fma
-mxop (XOP implies XOP and all Legacy XOP + FMA4 capability for both XOP and Legacy XOP instructions.)

Excavator;
-mno-avx
-mno-avx2
-mno-fma
-mxop (XOP implies XOP and all Legacy XOP + FMA4 capability for both XOP and Legacy XOP instructions.)

If you want to test the legacy function which isn't really talked about at all. Even though it appears in the SSE5 and XOP guide. Simply set -mno for everything not XOP. It is weird to do it, but the code executes faster.

Turning off FMA3 but enabling FMA4 allows for register renaming to primarily focus on dependencies.

The Stilt · May 8, 2016

Dresdenboy said:
bdver4 uses the bdver3 pipeline description. And according to this quick latency/throughput comparison, XV shouldn't suffer IPC-wise. Is it some thermal limit instead?

-march=bdverX implies -mprefer-avx128, which you might revert by -mno-prefer-avx128. Did you try this?

The cTDP was at 35/42W, which is plenty for a single threaded workload when there is no real background activity. No thermal limits either as it hits barely 50°C during the test.

I'll try disabling the splitting, but I don't think it really matters. AVX has no effect to the performance with this library, only FMA does.

In compiler guide for Excavator AMD recommends following optimizations in GCC (5.20):

-march=bdver4
-funroll-all-loops
-fprefetch-loop-arrays --param prefetch-latency=300
-fno-aggressive-loop-optimizations
-fschedule-insns2 -fsched-pressure
-mno-avx256-split-unaligned-store
-flive-range-shrinkage
-ftree-vectorize
-fstack-arrays
-fbranch-target-load-optimize

I'll try adding some of those and do a re-test.

Perhaps it is just the smaller L2 which is giving the pain?

The Stilt · May 8, 2016

Neither disabling the split (mno-prefer-avx128) or using the AMD suggested tunings made any difference. In fact with those tunings (excl. -fstack-arrays, which is Fortran only) the results were somewhat worse than without them. In a single test they improved the performance by 1.5% or so.

Abwx · May 8, 2016

The Stilt said:
Regardless of the used compiler settings Steamroller is 9.2% faster on average.

In some memory latency sensitive integer apps EXV s higher latency sometimes get SR performing better, this can be checked in the link posted by Dresdenboy.

Dresdenboy · May 9, 2016

The Stilt, which clock frequencies or true IPC do you see with SR and XV during the benchmark runs?

According to instlatx64, XV doesn't support special cases for DIV/SQRT anymore (e.g. div by 2.0, sqrt of 1.0). A look at the assembler code of some hot spots might reveal something.

Just as a side node: Intel optimized SKL for higher FO4 per stage, leading to lower latencies but limited clock frequency headroom: http://users.atw.hu/instlatx64/HSWvsBDWvsSKL.txt

The Stilt · May 9, 2016

Dresdenboy said:
The Stilt, which clock frequencies or true IPC do you see with SR and XV during the benchmark runs?

According to instlatx64, XV doesn't support special cases for DIV/SQRT anymore (e.g. div by 2.0, sqrt of 1.0). A look at the assembler code of some hot spots might reveal something.

Just as a side node: Intel optimized SKL for higher FO4 per stage, leading to lower latencies but limited clock frequency headroom: http://users.atw.hu/instlatx64/HSWvsBDWvsSKL.txt

I made a debug build and made the full profiles on both SR & XV using AMD CodeXL (a huge mistake). There was no noticeable difference in hot spots and all of the hot spots contained the same instructions: vmovss and vmovaps. Some of the even smaller hot spots contained mostly mulss, movsxd, adds and imul instructions.

I said using CodeXL was a huge mistake, since it's "save project as" feature doesn't actually save the recorded data... D:

btw. 3DMark uses the same Bullet 2.83 physics library as I'm using. In Sky Diver physics test ("the rock swing") Steamroller is 12.5% faster than Excavator on average.

Also when I'm determining the difference to Haswell I always set the chip to run at 3.2GHz static. That's because 3.2GHz is the highest frequency my Haswell can run at.

yuri69 · May 9, 2016

The main diffs in the XV core, compared to SR, should be cache sizes:

* L1D being doubled from a fireblast 16KiB to a reasonable 32KiB
* L2 being halved to 1MiB per CU and a few cycles faster L2L latency, IIRC

Is the data set small enough to benefit from the larger L1D or is it hurt by the L2, thats the question.

The Stilt · May 9, 2016

Got it verified :sneaky:
It is the smaller L2 which is giving the pain.

I compared Steamroller with different L2 configurations (128x8K & 256x8K) and the difference with otherwise identical settings is 12% in Bullet.

Steamroller 2M:

3000 fall: 10.5812
1000 stack: 12.3339
136 ragdolls: 1.2856
1000 convex: 9.3273
prim-trimesh: 1.9392
convex-trimesh: 2.2484
raytests: 5.3975

Total: 43.1131s

Steamroller 1M:

3000 fall: 12.3344
1000 stack: 13.5746
136 ragdolls: 1.5441
1000 convex: 10.0518
prim-trimesh: 2.3245
convex-trimesh: 2.6162
raytests: 5.8672

Total: 48.3128s

Excavator 1M:

3000 fall: 12.4454
1000 stack: 13.2581
136 ragdolls: 1.5529
1000 convex: 9.7177
prim-trimesh: 2.3886
convex-trimesh: 2.5709
raytests: 5.5253

Total: 47.4588s

Dresdenboy · May 9, 2016

Many thanks! Without the cache test result, I'd have called this investigation a dead end.

What were Zen's L2 and L3 latencies again? :sneaky:

The Stilt · May 9, 2016

Who knows :sneaky:

I just hope that the L2 on Zen is conservative in terms of the latency. AMD has always had hard time with their caches, especially with L2. Partitially probably because they are larger than on Intels, but there has to be more in it. I wouldn't expect a 20 cycle L2 to be the first limiting factor in terms of Fmax, like it is on Piledriver for example D:

What's the L3 latency on Broadwell-E for example? :sneaky:

Dresdenboy · May 9, 2016

The Stilt said:
Who knows :sneaky:

I just hope that the L2 on Zen is conservative in terms of the latency. AMD has always had hard time with their caches, especially with L2. Partitially probably because they are larger than on Intels, but there has to be more in it. I wouldn't expect a 20 cycle L2 to be the first limiting factor in terms of Fmax, like it is on Piledriver for example D:

What's the L3 latency on Broadwell-E for example? :sneaky:

Additional to being big (high latencies), a part of the construction core L2 problems might be caused by inefficient handling of multiple, overlapping requests (from core or prefetchers).

PPB · May 11, 2016

Dresdenboy said:
Additional to being big (high latencies), a part of the construction core L2 problems might be caused by inefficient handling of multiple, overlapping requests (from core or prefetchers).

Yep. Cache trashing is a real thing in Con uarch. And it is the really only glaring fault in AMD's implementation of CMT

Sent from my XT1040 using Tapatalk

Dresdenboy · May 11, 2016

PPB said:
Yep. Cache trashing is a real thing in Con uarch. And it is the really only glaring fault in AMD's implementation of CMT

Sent from my XT1040 using Tapatalk

And there's also that Write Coalescing Cache, the FPU <-> L1 bottleneck and so on.

BTW, just found this nice quote:

AMD Investor presentation Feb 2016 said:
Expect revenue to grow y/y in 2016 driven by:
[...]

“Bristol Ridge”, “Summit Ridge”, and next-generation graphics based on Polaris architecture

Also for 2016:

“Zen” launch with “Summit Ridge” desktop product and server product sampling in 2016

Click to expand...

Does it mean SR launch + server samples in 2016 or "launch" with both SR + server samples?

And AMD expected further shrinking PC revenue at least a year ago already:

http://www.tomshardware.com/news/amd-financial-analysis-2015,29056.html

Exophase · May 11, 2016

Dresdenboy said:
And there's also that Write Coalescing Cache, the FPU <-> L1 bottleneck and so on.

One other weakness of CMT is that it's a lot messier to handle fine grained power management of the cores. Since some parts are shared and some parts aren't, the best you could really do is have separate power gating for the shared part of the module and the unshared portions of the two cores, or three sections. And when one only core is active the shared part has to be entirely on, even though it's oversized (and thus overpowered) for single core execution.

In practice, the result was simply that you could only power gate the entire module, AFAIK.

NostaSeronx · May 11, 2016

Exophase said:
One other weakness of CMT is that it's a lot messier to handle fine grained power management of the cores. Since some parts are shared and some parts aren't, the best you could really do is have separate power gating for the shared part of the module and the unshared portions of the two cores, or three sections. And when one only core is active the shared part has to be entirely on, even though it's oversized (and thus overpowered) for single core execution.

In practice, the result was simply that you could only power gate the entire module, AFAIK.

Cluster Multithreading can easily handle fine grained power management of the cores. So, is it a weakness of CMT? No. Is it a perceived weakness of Bulldozer? Yes.

Exophase · May 11, 2016

NostaSeronx said:
Cluster Multithreading can easily handle fine grained power management of the cores. So, is it a weakness of CMT? No. Is it a perceived weakness of Bulldozer? Yes.

You don't exactly present a convincing counter argument.

Lepton87 · May 11, 2016

Exophase said:
You don't exactly present a convincing counter argument.

But he presents a fantastically vivid imagination.

The Stilt · May 11, 2016

"Vivid" would be the understatement of the century :sneaky:

NostaSeronx · May 11, 2016

Exophase said:
You don't exactly present a convincing counter argument.

You don't present a convincing argument what so ever. I'll go over what you typed;

Exophase said:
One other weakness of CMT is that it's a lot messier to handle fine grained power management of the cores.

Fine grain power management of the cores in CMT is not messy. Nor, is it messier than simultaneous multithreading. In fact, cluster multithreading is easier as it uses physical replication, rather than logical replication.

Exophase said:
Since in SMT some parts are shared and some parts aren't, the best you could really do is have separate power gating for the shared part of the module and the unshared portions of the two cores/threads, or three sections.

And when one only core/thread is active the shared part has to be entirely on, even though it's oversized (and thus overpowered) for single core/thread execution.

The issue is with the above is that you are describing any simultaneous multithreading architecture as well as Bulldozer's implement of CMT. I have added bold addendum side by side to replace or add additional information.

Full purpose CMT is divided between μ-architecture and n-architecture. No implement of standard SMT can be deviated from μ-architecture, thus n-architecture isn't possible.

The cores in a CMT design in Bulldozer can each be given their own voltage and their own frequency. Given that it is a division or a multiplication of the module voltage or module frequency.

Base Clock * Module Ratio(μ-architecture) * Core Ratio(n-architecture) = Core Frequency
Base Voltage * Module Ratio(μ-architecture VR) * Core Ratio(n-architecture IVR) = Core Voltage

Shared and overprovisioned resources aren't a problem in simultaneous multithreading. Thus, they aren't an issue in cluster multithreading.
http://www.cs.cmu.edu/~seth/wild-and-crazy09/koreysewell.pdf

EXtreme Virtual Pipelining (XVP) takes a step toward
scalable, multithreaded processors by avoiding the pitfalls
constraining conventional designs. Instead of increasing the
size of critical path resources and attempting to learn
optimal allocations, XVP chooses to virtualize pipeline
resources, to provide mechanisms for those resources to
dynamically partition themselves, and to add a 3 rd L1-
Cache for storage of those resources. Future versions of
XVP could virtualize other non-critical path shared
resources like branch predictors, branch target buffers or
load-wait tables. Additionally, XVP’s virtualization
methods can be used to optimize single-threaded processors
by providing the illusion of more pipeline resources than is
traditionally available on a single-threaded processor.

Exophase · May 11, 2016

So basically your argument is that you can disable parts of the unused core in a CMT module more than you can disable an unused SMT thread (which only barely exists as a hardware construct). That's not a fair comparison. No one was ever calling separate threads in SMT "cores", and the added compulsory power overhead by making a core SMT is nothing compared to that of turning a core into a CMT module. Furthermore, in the SMT case nearly 100% of the core can be utilized by a single thread, while that's very far from true in the CMT case.

You can give the cores in a module separate clocks and voltage domains.. if you're okay with having three clock and voltage domains in a single module. Which adds a lot of overhead for more regulators and adds latency for clock crossing between the shared and unshared regions. There's a reason why most CPU makers today don't even have single cores on separate DVFS domains.

I think there's some pretty compelling reasons why AMD is abandoning CMT with Zen and instead joining much of the rest of the industry with SMT.

Lepton87 · May 12, 2016

The Stilt said:
"Vivid" would be the understatement of the century :sneaky:

Whimsical imagination? Better?
ps. Nosta, how is your love affair with FD-SOI going? It's been a long time since you wrote something about it.

William Gaatjes · May 12, 2016

When readin this website, i read about the ZEN lite core.

http://vrworld.com/2016/05/11/amd-confirms-sony-playstation-neo-based-zen-polaris/

The only mandate the company received was to keep the hardware changes invisible to the game developers, but that was also changed when Polaris 10 delivered a substantial performance improvement over the original hardware. The new 14nm FinFET APU consists out of eight x86 ‘Zen Lite’ LP cores at 2.1 GHz (they’re not Jaguar cores, as previously rumored) and a Polaris GPU, operating on 15-20% faster clock than the original PS4.

Now since the new ps4 apu will have 8 cores, this to me sounds that the SMT capabilities will be removed. For they may or can never be used. What else could be removed from such an 8 core cpu to match the 8 jaguar cores from the current ps4 ?

I mean, the goal is to stay compatible with the old jaguar cores to be able to run existing software. And make the apu as cheap as possible.
I am sure that the os from the ps4 hides the clock differences and other architectural changes from the games. I assume that the games are not programmed in a "bare metal" sense.

New Zen microarchitecture details

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Lifer

Golden Member

Golden Member

Senior member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

Diamond Member

Diamond Member

Platinum Member

Lifer