AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

lolfail9001 · Sep 24, 2016

superstition said:
Remember that it was only implemented as a victim cache in Broadwell C. Skylake, by contrast, was developed so it can be used as a DRAM buffer.

And, Peter Bright disagrees.

Well, it's up to you to prove that it gives larger gains as DRAM buffer rather than victim cache. From what i recall, gains from it on 6770HQ/6700HQ were similar to gains on Broadwell. With a small cave-at: eDRAM on Skylake kills memory scaling. Basically it looks like a replacement for fast memory, little else.

The 4790K isn't clocked at the level of Broadwell C nor did it have the same low TDP.

Also, if Haswell is that competitive with Skylake it also supports my argument that enthusiasts would have been better-served if Intel had sold a higher-TDP Broadwell C part (possibly with iGPU disabled for yields) and/or a Skylake with the EDRAM (particularly in the apparently improved condition of being a DRAM buffer not just a victim cache). Why ask everyone to buy new RAM and a new motherboard for a minor improvement?

I thought you would know that TDP means jack on desktop CPUs. Broadwell (both mainstream and HEDT) caps off at 4.2-4.4Ghz, so Intel could release a 200W part, it would not be any faster than OCd 5775C.

IBM's results with the latest Power stuff also suggests that EDRAM is far from being overrated. There is EDRAM all over the place.

Well, what is it's impact on performance, then? Considering that POWER is no-compromise performance part by design.

Anyways, leaks went quiet. And that's boring.

bjt2 · Sep 24, 2016

itsmydamnation said:
Remember AMD decodes into Mop's (assuming they haven't changed the way they have done thing since like K7). 1x86 instruction = 1 to 2 Mops. traditionally an AMD decoder can all decode 1-2 Mops each. Each Mop can have arithmetic and memory uop's.

Yes, i know... I ever wondered why include the ability to not use microcode ROM (that is awfully slow) for 2 M(u)OP(s) instruction, given that about 95-96% of the instructions are 1 M(u)OP... The reason is simple: there are few instruction, but quite common in actual code, that require 2 MOPS on AMD split scheduler architecture: floating point move. I wonder if this is the case also in INTEL architectures... But now I can see the reason for incude a potentially costly feature... Maybe simulations demonstrated that it was worth it

cdimauro said:
It was a

it was a lapsus. Decoders -> dispatchers

Ok, now "INT and FP" makes sense...

Tuna-Fish · Sep 24, 2016

bjt2 said:
So a memory FP instruction is decoded into at least 2 uops? Now I get the REAL reason to have to have double path decoder... There are a few instruction that need 2 uops, but they are very important...
I bet that a memory FP instruction is decoded as a single uop in INTEL cpus, due to unified scheduler...

No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.

superstition · Sep 24, 2016

lolfail9001 said:
It's up to you to prove that ...

Not really since the EDRAM as victim cache has demonstrated that it's a worthwhile addition already.

lolfail9001 said:
I thought you would know that TDP means jack on desktop CPUs. Broadwell (both mainstream and HEDT) caps off at 4.2-4.4Ghz, so Intel could release a 200W part, it would not be any faster than OCd 5775C.

Peter Bright said:
Those 5775C results tantalized us with the prospect of a comparable Skylake part. Pair that ginormous cache with Intel's latest-and-greatest core and raise the speed limit on the clock speed by giving it a 90-odd W power envelope

I guess he doesn't know what TDP means, too.

Getting Broadwell C to not throttle, even at stock...

lolfail9001 · Sep 24, 2016

superstition said:
Not really since the EDRAM as victim cache has demonstrated that it's a worthwhile addition already.

Yeah, but as DRAM buffer with DDR4-4000 in play? There is evidence that EDRAM gives performance boost on DDR4-2133Mhz. But there is also evidence that with EDRAM on Skylake in play gains from faster memory are reduced to minimum. Something very weird to observe on Skylake of all late uarches.

I guess he doesn't know what TDP means, too.

Getting Broadwell C to not throttle, even at stock...

Yeah, TDP means Thermal design power, nothing direct to do with actual operating frequencies.
By the way, do you have evidence of Broadwell C throttling in stock? Let alone in overclocked state when power limits go out of the window entirely.

superstition · Sep 24, 2016

lolfail9001 said:
By the way, do you have evidence of Broadwell C throttling in stock?

It was discussed in the Anandtech review (why the 5675C outperformed the 5775C at times due to the greater ability turbo within the tiny 65 W envelope).

bjt2 · Sep 24, 2016

Tuna-Fish said:
No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.

Ok this is the AMD concept of MOP and uop. Macro Op can be composed of more uops. 1 MOP occupies on reservation station in the retire buffer, but can occupy more than one uop slot due to the composition of the MOP... This was true in the K7-K10 and seems to be true also for Zen...

lolfail9001 · Sep 24, 2016

superstition said:
It was discussed in the Anandtech review (why the 5675C outperformed the 5775C at times due to the greater ability turbo within the tiny 65 W envelope).

Only seen that happen when iGPU was in action, however.

Dresdenboy · Sep 24, 2016

Tuna-Fish said:
No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.

These ops run under many names at AMD (MacroOp, Mop, COP, instruction, etc.). I understood the FP+read handling the same way as you described. For 256b AVX Mike Clark answered one of my questions at the Q&A, saying that those are decoded as "fastpath doubles" into "one uop" (or "instruction") and split up when entering the dispatch buffer.

bjt2 · Sep 24, 2016

Dresdenboy said:
These ops run under many names at AMD (MacroOp, Mop, COP, instruction, etc.). I understood the FP+read handling the same way as you described. For 256b AVX Mike Clark answered one of my questions at the Q&A, saying that those are decoded as "fastpath doubles" into "one uop" (or "instruction") and split up when entering the dispatch buffer.

Nice: so 256 bit instructions still occupy 1 reservation station in the retire buffer... This is understandable, and tells us that the 256 bit register is treated as a whole and not as 2x128 bit registers... And also that the retire unit manage 256 bit registers as a whole: one PRF entry etc...

Dresdenboy · Sep 25, 2016

bjt2 said:
Nice: so 256 bit instructions still occupy 1 reservation station in the retire buffer... This is understandable, and tells us that the 256 bit register is treated as a whole and not as 2x128 bit registers... And also that the retire unit manage 256 bit registers as a whole: one PRF entry etc...

Hmm, I still assume that it's more like this for 256b SIMD:
1 x86 AVX/AVX2 instruction -> fetch -> decode -> 1 Mop -> dispatch -> 2 uops/2 PRs -> FPU -> 1+1 uop done -> retire (resolves dependency) -> x86 op retired

Mike also said that the FPU would individually resolve dependencies. So these 2 uops might run down 2 pipelines in parallel or serially depending on free issue slots.

This part of the updated patch might actually be a hint at the fast path double decode capabilities as it directly maps znver1-double to znver1-direct. And the "fix me" hasn't been updated so far.

Code:

+;; Direct instructions can be issued to any of the four decoders.
+(define_reservation "znver1-direct" "znver1-decode0|znver1-decode1|znver1-decode2|znver1-decode3")
+
+;; Fix me: Need to revisit this later to simulate fast path double behaviour.
+(define_reservation "znver1-double" "znver1-direct")

NostaSeronx · Sep 25, 2016

Dresdenboy said:
Hmm, I still assume that it's more like this for 256b SIMD:
1 x86 AVX/AVX2 instruction -> fetch -> decode -> 1 Mop -> dispatch -> 2 uops/2 PRs -> FPU -> 1+1 uop done -> retire (resolves dependency) -> x86 op retired

1x FP256 instruction Fetched;
Decode both Arithmetic and Load/Store;
2 Macro-ops(Lower 128-bit & Upper 128-bit) -> Dispatch -> Retire Queue -> Scheduler -> 4 Micro-ops(Lower 128-bit Math(op) & Load/Store(op) & Upper 128-bit Math(op) & Load/Store(op)) -> FP Dispatch -> Unit 0 * 2 or Unit 1 * 2, etc Rather than using both FMACs, only one is used. This allows two FP256 operations to occur without stalling both cores.

^-- Bulldozer to Excavator. The issue is that x86/FP256 operations do not have register renaming, while FP128 does. This forces the usage of FP128 if maximum performance and reliability is wanted.

All macro-ops and micro-ops are fixed length.

cdimauro · Sep 26, 2016

"This allows two FP256 operations to occur without stalling both cores."

Only if one of the two doesn't access memory, since one operation uses both available Load/Store units, according to your interesting post.

NostaSeronx · Sep 26, 2016

cdimauro said:
Only if one of the two doesn't access memory, since one operation uses both available Load/Store units, according to your interesting post.

This is where vertical multithreading comes in. Core 0 will finish, then Core 1. No stalls until it hits WCC which buffers the write through operation of both cores. A FPU store will complete first on core 0, then operate on core 1. Based on first come, first serve(FIFO Scheduler).

Sweepr · Sep 30, 2016

New benchmarks:

https://browser.primatelabs.com/geekbench3/8076870
https://browser.primatelabs.com/geekbench3/8076878

jpiniero · Sep 30, 2016

So terrible. Looks like it's going to be awhile before AMD releases Zen Server, it needs way more work.

Arachnotronic · Sep 30, 2016

Not a single Zen leak has actually been good. AMD may have produced another Dozer.

Azuma Hazuki · Sep 30, 2016

Arachnotronic said:
Not a single Zen leak has actually been good. AMD may have produced another Dozer.

Pray to whatever you believe in they haven't. Competition keeps your precious Intel chips affordable. Honestly I will never understand your hatred of AMD...it's really creepy and gross. You act like someone at Sunnyvale ate your cat or something.

crashtech · Sep 30, 2016

A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.

Arachnotronic · Sep 30, 2016

Azuma Hazuki said:
Pray to whatever you believe in they haven't. Competition keeps your precious Intel chips affordable. Honestly I will never understand your hatred of AMD...it's really creepy and gross. You act like someone at Sunnyvale ate your cat or something.

I don't hate AMD. I dislike personal attacks, though. Anyway, I think it would benefit you if you understood where I'm coming from. I am not hoping, wishing, or praying that Zen is good or bad. I am merely looking at what information is out there and drawing a conclusion.

It does nobody any good to look at data but then completely disregard it due to "hopes and dreams."

Anyway, Intel CPUs are quite affordable. $340 buys you the best gaming CPU on the market, hands down. Lasts for several years and has good resale value at the end of ownership.

Interesting how the 6700k is affordable even though AMD hasn't been competitive in years. Maybe AMD's continued presence in the desktop CPU market isn't as essential as some people think?

jpiniero · Sep 30, 2016

Arachnotronic said:
Not a single Zen leak has actually been good. AMD may have produced another Dozer.

The Ashes leak performance wasn't terrible, although at the stock clocks it's not going to be all that great for gaming. We don't know how good of an overclocker Zen is.

crashtech said:
A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.

I'm convinced the MCM is wreaking havoc.

Azuma Hazuki said:
Pray to whatever you believe in they haven't. Competition keeps your precious Intel chips affordable. Honestly I will never understand your hatred of AMD...it's really creepy and gross. You act like someone at Sunnyvale ate your cat or something.

Intel's got plenty of competition out there; Apple (The ARMy as a whole really), nVidia, IBM (Power)...

Arachnotronic · Sep 30, 2016

jpiniero said:
The Ashes leak performance wasn't terrible, although at the stock clocks it's not going to be all that great for gaming. We don't know how good of an overclocker Zen is.

The AoTS benchmark suggested perf/clock well below Haswell levels and clock speeds don't look high either. I wouldn't bet on it being a good overclocker given that the die is a server-first design. We'll see how it does when it finally arrives, though.

I'm convinced the MCM is wreaking havoc.

I would blame the poor multicore scaling on both MCM and the fact that the die is organized into CCX's with 4 cores + shared L3$ each. There has got to be some serious overhead involved in such a structure compared to a monolithic die with all of the cores sharing a single large pool of L3$.

Intel's got plenty of competition out there; Apple (The ARMy as a whole really), nVidia, IBM (Power)...

True, also Intel faces competition from a PC market that's in structural decline and competition from what it sold customers in the past (they need compelling reasons to upgrade otherwise they stick to what they have).

Arachnotronic · Sep 30, 2016

crashtech said:
A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.

I wouldn't read too much into GB3, that was a pretty flawed benchmark. I think the best indications we have on Zen performance are the GB4 results as well as the AoTS results.

AMD has built a better core than anything Dozer, so that could help them in a number of markets. But for those expecting AMD to seriously challenge Skylake in terms of per-core performance will, IMHO, be disappointed. Even if you look at AMD's own public statement of +40% IPC over XV, Zen is still likely to be behind Haswell more often than not, and very much behind Skylake.

I think Zen will offer Sandy Bridge/Ivy Bridge levels of performance-per-clock. We already know what clocks AMD is planning to ship Summit Ridge at, the question for enthusiasts/gamers really comes down to -- as jpiniero points out -- how well this thing will overclock. I suspect Intel will have an advantage in this regard with its HEDT chips.

AtenRa · Sep 30, 2016

Arachnotronic said:
Maybe AMD's continued presence in the desktop CPU market isn't as essential as some people think?

If you want Quad Cores (Thats what Core i7 6700K is) at $300-350 for the next 10 years yes its not essential to have AMD around

lolfail9001 · Sep 30, 2016

AtenRa said:
If you want Quad Cores (Thats what Core i7 6700K is) at $300-350 for the next 10 years yes its not essential to have AMD around

Well, go ahead and cop yourself a 2620v4, if you want cores for the low.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Golden Member

Senior member

Golden Member

Platinum Member

Golden Member

Platinum Member

Senior member

Golden Member

Golden Member

Senior member

Golden Member

Diamond Member

Member

Diamond Member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Lifer

Lifer

Lifer

Lifer

Lifer

Golden Member