AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 42 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Remember that it was only implemented as a victim cache in Broadwell C. Skylake, by contrast, was developed so it can be used as a DRAM buffer.

And, Peter Bright disagrees.
Well, it's up to you to prove that it gives larger gains as DRAM buffer rather than victim cache. From what i recall, gains from it on 6770HQ/6700HQ were similar to gains on Broadwell. With a small cave-at: eDRAM on Skylake kills memory scaling. Basically it looks like a replacement for fast memory, little else.
The 4790K isn't clocked at the level of Broadwell C nor did it have the same low TDP.

Also, if Haswell is that competitive with Skylake it also supports my argument that enthusiasts would have been better-served if Intel had sold a higher-TDP Broadwell C part (possibly with iGPU disabled for yields) and/or a Skylake with the EDRAM (particularly in the apparently improved condition of being a DRAM buffer not just a victim cache). Why ask everyone to buy new RAM and a new motherboard for a minor improvement?
I thought you would know that TDP means jack on desktop CPUs. Broadwell (both mainstream and HEDT) caps off at 4.2-4.4Ghz, so Intel could release a 200W part, it would not be any faster than OCd 5775C.
IBM's results with the latest Power stuff also suggests that EDRAM is far from being overrated. There is EDRAM all over the place.
Well, what is it's impact on performance, then? Considering that POWER is no-compromise performance part by design.

Anyways, leaks went quiet. And that's boring.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
Remember AMD decodes into Mop's (assuming they haven't changed the way they have done thing since like K7). 1x86 instruction = 1 to 2 Mops. traditionally an AMD decoder can all decode 1-2 Mops each. Each Mop can have arithmetic and memory uop's.

Yes, i know... I ever wondered why include the ability to not use microcode ROM (that is awfully slow) for 2 M(u)OP(s) instruction, given that about 95-96% of the instructions are 1 M(u)OP... The reason is simple: there are few instruction, but quite common in actual code, that require 2 MOPS on AMD split scheduler architecture: floating point move. I wonder if this is the case also in INTEL architectures... But now I can see the reason for incude a potentially costly feature... Maybe simulations demonstrated that it was worth it

It was a

it was a lapsus. Decoders -> dispatchers
Ok, now "INT and FP" makes sense...
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,474
1,960
136
So a memory FP instruction is decoded into at least 2 uops? Now I get the REAL reason to have to have double path decoder... There are a few instruction that need 2 uops, but they are very important...
I bet that a memory FP instruction is decoded as a single uop in INTEL cpus, due to unified scheduler...

No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
It's up to you to prove that ...
Not really since the EDRAM as victim cache has demonstrated that it's a worthwhile addition already.

I thought you would know that TDP means jack on desktop CPUs. Broadwell (both mainstream and HEDT) caps off at 4.2-4.4Ghz, so Intel could release a 200W part, it would not be any faster than OCd 5775C.

Peter Bright said:
Those 5775C results tantalized us with the prospect of a comparable Skylake part. Pair that ginormous cache with Intel's latest-and-greatest core and raise the speed limit on the clock speed by giving it a 90-odd W power envelope

I guess he doesn't know what TDP means, too.

Getting Broadwell C to not throttle, even at stock...
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Not really since the EDRAM as victim cache has demonstrated that it's a worthwhile addition already.
Yeah, but as DRAM buffer with DDR4-4000 in play? There is evidence that EDRAM gives performance boost on DDR4-2133Mhz. But there is also evidence that with EDRAM on Skylake in play gains from faster memory are reduced to minimum. Something very weird to observe on Skylake of all late uarches.
I guess he doesn't know what TDP means, too.

Getting Broadwell C to not throttle, even at stock...
Yeah, TDP means Thermal design power, nothing direct to do with actual operating frequencies.
By the way, do you have evidence of Broadwell C throttling in stock? Let alone in overclocked state when power limits go out of the window entirely.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.
Ok this is the AMD concept of MOP and uop. Macro Op can be composed of more uops. 1 MOP occupies on reservation station in the retire buffer, but can occupy more than one uop slot due to the composition of the MOP... This was true in the K7-K10 and seems to be true also for Zen...
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
No. FP+read instruction is decoded into one fused uop, which splits into one alu uop and one agu uop in the sceduler. Similarly, I expect that in Zen, a read + FP op enters dispatcher as one op, but is dispatched to both sides, duplicating it at that point. The double path decoders are required for widening AVX -- 256-bit ops are converted into two 128-bit ones.
These ops run under many names at AMD (MacroOp, Mop, COP, instruction, etc.). I understood the FP+read handling the same way as you described. For 256b AVX Mike Clark answered one of my questions at the Q&A, saying that those are decoded as "fastpath doubles" into "one uop" (or "instruction") and split up when entering the dispatch buffer.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
These ops run under many names at AMD (MacroOp, Mop, COP, instruction, etc.). I understood the FP+read handling the same way as you described. For 256b AVX Mike Clark answered one of my questions at the Q&A, saying that those are decoded as "fastpath doubles" into "one uop" (or "instruction") and split up when entering the dispatch buffer.

Nice: so 256 bit instructions still occupy 1 reservation station in the retire buffer... This is understandable, and tells us that the 256 bit register is treated as a whole and not as 2x128 bit registers... And also that the retire unit manage 256 bit registers as a whole: one PRF entry etc...
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Nice: so 256 bit instructions still occupy 1 reservation station in the retire buffer... This is understandable, and tells us that the 256 bit register is treated as a whole and not as 2x128 bit registers... And also that the retire unit manage 256 bit registers as a whole: one PRF entry etc...
Hmm, I still assume that it's more like this for 256b SIMD:
1 x86 AVX/AVX2 instruction -> fetch -> decode -> 1 Mop -> dispatch -> 2 uops/2 PRs -> FPU -> 1+1 uop done -> retire (resolves dependency) -> x86 op retired

Mike also said that the FPU would individually resolve dependencies. So these 2 uops might run down 2 pipelines in parallel or serially depending on free issue slots.

This part of the updated patch might actually be a hint at the fast path double decode capabilities as it directly maps znver1-double to znver1-direct. And the "fix me" hasn't been updated so far.
Code:
+;; Direct instructions can be issued to any of the four decoders.
+(define_reservation "znver1-direct" "znver1-decode0|znver1-decode1|znver1-decode2|znver1-decode3")
+
+;; Fix me: Need to revisit this later to simulate fast path double behaviour.
+(define_reservation "znver1-double" "znver1-direct")
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,228
136
Hmm, I still assume that it's more like this for 256b SIMD:
1 x86 AVX/AVX2 instruction -> fetch -> decode -> 1 Mop -> dispatch -> 2 uops/2 PRs -> FPU -> 1+1 uop done -> retire (resolves dependency) -> x86 op retired
1x FP256 instruction Fetched;
Decode both Arithmetic and Load/Store;
2 Macro-ops(Lower 128-bit & Upper 128-bit) -> Dispatch -> Retire Queue -> Scheduler -> 4 Micro-ops(Lower 128-bit Math(op) & Load/Store(op) & Upper 128-bit Math(op) & Load/Store(op)) -> FP Dispatch -> Unit 0 * 2 or Unit 1 * 2, etc Rather than using both FMACs, only one is used. This allows two FP256 operations to occur without stalling both cores.

^-- Bulldozer to Excavator. The issue is that x86/FP256 operations do not have register renaming, while FP128 does. This forces the usage of FP128 if maximum performance and reliability is wanted.

All macro-ops and micro-ops are fixed length.
 

cdimauro

Member
Sep 14, 2016
163
14
61
"This allows two FP256 operations to occur without stalling both cores."

Only if one of the two doesn't access memory, since one operation uses both available Load/Store units, according to your interesting post.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,228
136
Only if one of the two doesn't access memory, since one operation uses both available Load/Store units, according to your interesting post.
This is where vertical multithreading comes in. Core 0 will finish, then Core 1. No stalls until it hits WCC which buffers the write through operation of both cores. A FPU store will complete first on core 0, then operate on core 1. Based on first come, first serve(FIFO Scheduler).
 

jpiniero

Lifer
Oct 1, 2010
15,091
5,655
136
So terrible. Looks like it's going to be awhile before AMD releases Zen Server, it needs way more work.
 

crashtech

Lifer
Jan 4, 2013
10,573
2,145
146
A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.
 
Mar 10, 2006
11,715
2,012
126
Pray to whatever you believe in they haven't. Competition keeps your precious Intel chips affordable. Honestly I will never understand your hatred of AMD...it's really creepy and gross. You act like someone at Sunnyvale ate your cat or something.

I don't hate AMD. I dislike personal attacks, though. Anyway, I think it would benefit you if you understood where I'm coming from. I am not hoping, wishing, or praying that Zen is good or bad. I am merely looking at what information is out there and drawing a conclusion.

It does nobody any good to look at data but then completely disregard it due to "hopes and dreams."

Anyway, Intel CPUs are quite affordable. $340 buys you the best gaming CPU on the market, hands down. Lasts for several years and has good resale value at the end of ownership.

Interesting how the 6700k is affordable even though AMD hasn't been competitive in years. Maybe AMD's continued presence in the desktop CPU market isn't as essential as some people think?
 
Last edited:

jpiniero

Lifer
Oct 1, 2010
15,091
5,655
136
Not a single Zen leak has actually been good. AMD may have produced another Dozer.

The Ashes leak performance wasn't terrible, although at the stock clocks it's not going to be all that great for gaming. We don't know how good of an overclocker Zen is.

A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.

I'm convinced the MCM is wreaking havoc.

Pray to whatever you believe in they haven't. Competition keeps your precious Intel chips affordable. Honestly I will never understand your hatred of AMD...it's really creepy and gross. You act like someone at Sunnyvale ate your cat or something.

Intel's got plenty of competition out there; Apple (The ARMy as a whole really), nVidia, IBM (Power)...
 
Mar 10, 2006
11,715
2,012
126
The Ashes leak performance wasn't terrible, although at the stock clocks it's not going to be all that great for gaming. We don't know how good of an overclocker Zen is.

The AoTS benchmark suggested perf/clock well below Haswell levels and clock speeds don't look high either. I wouldn't bet on it being a good overclocker given that the die is a server-first design. We'll see how it does when it finally arrives, though.

I'm convinced the MCM is wreaking havoc.

I would blame the poor multicore scaling on both MCM and the fact that the die is organized into CCX's with 4 cores + shared L3$ each. There has got to be some serious overhead involved in such a structure compared to a monolithic die with all of the cores sharing a single large pool of L3$.


Intel's got plenty of competition out there; Apple (The ARMy as a whole really), nVidia, IBM (Power)...
True, also Intel faces competition from a PC market that's in structural decline and competition from what it sold customers in the past (they need compelling reasons to upgrade otherwise they stick to what they have).
 
Mar 10, 2006
11,715
2,012
126
A little quick and dirty arithmetic puts the ST perf of that leaked result on par with an old Phenom, assuming linear frequency scaling.

I wouldn't read too much into GB3, that was a pretty flawed benchmark. I think the best indications we have on Zen performance are the GB4 results as well as the AoTS results.

AMD has built a better core than anything Dozer, so that could help them in a number of markets. But for those expecting AMD to seriously challenge Skylake in terms of per-core performance will, IMHO, be disappointed. Even if you look at AMD's own public statement of +40% IPC over XV, Zen is still likely to be behind Haswell more often than not, and very much behind Skylake.

I think Zen will offer Sandy Bridge/Ivy Bridge levels of performance-per-clock. We already know what clocks AMD is planning to ship Summit Ridge at, the question for enthusiasts/gamers really comes down to -- as jpiniero points out -- how well this thing will overclock. I suspect Intel will have an advantage in this regard with its HEDT chips.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |