AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

itsmydamnation · Feb 15, 2017

You have to remember shrinks matter less and less, the big benifit at 22nm for intel and 14/16nm for GF/TSMC/etc is finfet for low voltage/leakage. So even if 22nm was equal to 14nm its nowhere near as big of a deal as say 130nm vs 90nm. What we have to wait and see for is what new/if any techniques intel brings to 10nm (qfet or something) because the shrink itself isn't going to make or break anyone.

Doom2pro · Feb 15, 2017

itsmydamnation said:
You have to remember shrinks matter less and less, the big benifit at 22nm for intel and 14/16nm for GF/TSMC/etc is finfet for low voltage/leakage. So even if 22nm was equal to 14nm its nowhere near as big of a deal as say 130nm vs 90nm. What we have to wait and see for is what new/if any techniques intel brings to 10nm (qfet or something) because the shrink itself isn't going to make or break anyone.

Another thing is library density, packing more into the same area on the same process.

Arachnotronic · Feb 15, 2017

itsmydamnation said:
You have to remember shrinks matter less and less, the big benifit at 22nm for intel and 14/16nm for GF/TSMC/etc is finfet for low voltage/leakage. So even if 22nm was equal to 14nm its nowhere near as big of a deal as say 130nm vs 90nm. What we have to wait and see for is what new/if any techniques intel brings to 10nm (qfet or something) because the shrink itself isn't going to make or break anyone.

You've really hit it on the head here. People associate shrinks with improved performance, but these are actually fairly orthogonal. You can get big performance/power improvements without shrinking (i.e. 20nm -> 16FF+) and you can shrink and basically get minimal performance improvement (i.e. 28nm HPm -> 20nm SoC).

Shrinks are done for economic reasons, not performance. But as shrinks become harder (i.e. harder to match yields in a given time), it can be more cost effective to improve performance of older process with newer transistors/materials and keep the design rules the same than to shrink.

Agent-47 · Feb 15, 2017

.vodka said:
I originally did this vs the Ryzen baseline when it showed up, but decided to run some extra numbers.

CPU mark, 2500k @ 4.5GHz, fixed 10-11-10-30-1T timings

800

1066

1333

1600

1866

2133

Memory mark, 2500k @ 4.5GHz, fixed 10-11-10-30-1T timings

800

1066

1333

1600

1866

2133

I might have made a mistake here and there since I feel like crap (stupid cold/flu-ish), I could do the same with the CPU clocked up to 5.1GHz if needed.

Great stuff! From your results I realize I'm varying bandwidth and latency at the same time having left timings equal throughout the run, but the results mirror itsmydamnation's Ivy Bridge results. Sandy and Ivy are pretty much the same at a high level after all, so it was expected.

Extrapolating to Ryzen, its around 14ns results could be compared to my DDR3-1333 15ns run. Look how much performance was left on the table by not using faster RAM in latency sensitive workloads.

I know if I'm getting my hands on Ryzen I'll get sticks capable of what my DDR3-1866 is (around 10ns or better, that could be DDR4 3000 CAS 15 -perfect 10ns- to start with) as not to unnecessarily gimp the CPU.

Great stuff! From these data, for sure amd's score will improve 25% at least with better ram since the test uses 4 mb per thread which means its accessing the system memory often. But zen is 50 to 70 % behind according to the normalized scores.

It will seem strange that we are talking so much about a seemingly rather unimportant benchmark. But I think it illustrates a bottleneck somewhere which may pop up in other cases. After all the code does not use any exotic instructions like crc32/avx2 or something. And we have seen integer performance match BWE.
This test is a pure int64 based test coupled with branches and mods

Edit: before someone points out, 1 these are older leaks but they are in line with the newer ones. 2 prime number score don't scale linearly with nber of cores which should be kept in mind. But that does not mean a 4c will be as fast as a 8c

.vodka · Feb 15, 2017

lopri said:
That is an awesome work. :beer:

Thanks!

cytg111 said:
Yea, great work. Damn, that scaling. Now to keep speed the same and only mess with latencies?

Sure, let's dial speed down to a more flexible DDR3-1333 so I can do whatever I want with timings. According to my RAM datasheet, it supports CAS 6, 7, 8, 9, 10, 11, 13. Let's set VDIMM to 1.65v and see what happens. CPU remains at 4.5GHz. ASUS MemOK button sure saved me a few times exploring the lower CAS limits here... Tightest these could go at DDR3-1333 is 6-7-6-18 1T, 6-6-6 would bootloop. Those could be crappy cheap DDR2 timings, nice.

CPU mark

Memory mark

Hmm, altering only CAS latency doesn't do much. The rest of the timings are still tight...

-----------------------------------------------------------------------------

Now, let's do n, n, n, 3*n 1T and see what happens

CPU mark

Memory mark

Now that's more like it. Still not as pronounced as bandwidth + timings scaling. If I had chosen DDR3-1066 or 800 for these tests the results would've been much more pronounced because of bandwidth starving... but to be realistic who uses such slow speeds? It'd been useless data. I'd say the first post with fixed 10-11-10-30 1T timings and varying memory speeds is a better indicator of real life conditions.

Anyway, the lesson here is to pick a decent memory kit with fast speeds and tight latencies to make the most out of what the CPU can do... nothing new to be learned from this (memory speed is a product of latency and frequency)

Agent-47 · Feb 15, 2017

^^these results are true for intel arch, may not be true for ryzen. but as a general rule any heavily memory accessing process that takes more space than the on die memory will depend on how fast the system memory is

EDIT:
ryzen was tested with 2400 MHz at 17-17-17 i believe, going to 3200 at 15 will result in a 28% jump based on your result. But a 8c16t ryzen is 83% behind a 8c16t intel

[/SPOILER]

Agent-47 · Feb 15, 2017

Nothingness said:
Yes branching certainly is more an issue than the use of mod.

True
For a C++ code>

Code:

    double mod;
    int val = 1004;
    int m = 10;
    mod = val % m;

following assembly language is generated:

Code:

    int val = 1004;
0111166E  mov         dword ptr [val],3ECh
    int m = 10;
01111675  mov         dword ptr [m],0Ah
    mod = val % m;
0111167C  mov         eax,dword ptr [val]
0111167F  cdq            -- convert double to quad
01111680  idiv        eax,dword ptr [m] -- devide
01111683  cvtsi2sd    xmm0,edx    --convert int to double
01111687  movsd       mmword ptr [mod],xmm0-- move string data

according to CPC, If i am reading it correctly, idiv for both ryzen and KL has the same throughput?
http://imgur.com/ZS0hwfR
maybe some can confirm/correct it.

if that is true, then its not the code. but either ram(which looks unlikely based on vodka's results), L1/L3 or branching.

OrangeKhrush · Feb 15, 2017

Sven_eng said:
Haswell-E is close to 2x the size of Ryzen so what do you think?

And twice the die size.

OrangeKhrush · Feb 15, 2017

Agent-47 said:
^^these results are true for intel arch, may not be true for ryzen. but as a general rule any heavily memory accessing process that takes more space than the on die memory will depend on how fast the system memory is

EDIT:
ryzen was tested with 2400 MHz at 17-17-17 i believe, going to 3200 at 15 will result in a 28% jump based on your result. But a 8c16t ryzen is 83% behind a 8c16t intel

[/SPOILER]

Can't assume unless you know the 5960X specs, granted quad channel will also help.

I have known about AMDs IMC being good but not where it should be. That said Zen+ will address that.

Nothingness · Feb 16, 2017

Agent-47 said:

True
For a C++ code>

Code:

    double mod;
    int val = 1004;
    int m = 10;
    mod = val % m;

following assembly language is generated:

Code:

    int val = 1004;
0111166E  mov         dword ptr [val],3ECh
    int m = 10;
01111675  mov         dword ptr [m],0Ah
    mod = val % m;
0111167C  mov         eax,dword ptr [val]
0111167F  cdq            -- convert double to quad
01111680  idiv        eax,dword ptr [m] -- devide
01111683  cvtsi2sd    xmm0,edx    --convert int to double
01111687  movsd       mmword ptr [mod],xmm0-- move string data

Sorry but either your compiler is a PoS or you didn't turn on optimizations

Here is what gcc outputs as soon as you turn optim:

Code:

        movsd   .LC0(%rip), %xmm0
...
.LC0:
        .long   0
        .long   1074790400

Your code produces a constant...

cytg111 · Feb 16, 2017

.vodka said:
Thanks!

Sure, let's dial speed down to a more flexible DDR3-1333 so I can do whatever I want with timings. According to my RAM datasheet, it supports CAS 6, 7, 8, 9, 10, 11, 13. Let's set VDIMM to 1.65v and see what happens. CPU remains at 4.5GHz. ASUS MemOK button sure saved me a few times exploring the lower CAS limits here... Tightest these could go at DDR3-1333 is 6-7-6-18 1T, 6-6-6 would bootloop. Those could be crappy cheap DDR2 timings, nice.

CPU mark

DDR3-1333 6-7-6-18 1T

DDR3-1333 7-7-6-18 1T

DDR3-1333 8-7-6-18 1T

DDR3-1333 9-7-6-18 1T

DDR3-1333 10-7-6-18 1T

DDR3-1333 11-7-6-18 1T

DDR3-1333 13-7-6-18 1T

Memory mark

DDR3-1333 6-7-6-18 1T

DDR3-1333 7-7-6-18 1T

DDR3-1333 8-7-6-18 1T

DDR3-1333 9-7-6-18 1T

DDR3-1333 10-7-6-18 1T

DDR3-1333 11-7-6-18 1T

DDR3-1333 13-7-6-18 1T

Hmm, altering only CAS latency doesn't do much. The rest of the timings are still tight...

-----------------------------------------------------------------------------

Now, let's do n, n, n, 3*n 1T and see what happens

CPU mark

DDR3-1333 6-7-6-18 1T

DDR3-1333 7-7-7-21 1T

DDR3-1333 8-8-8-24 1T

DDR3-1333 9-9-9-27 1T

DDR3-1333 10-10-10-30 1T

DDR3-1333 11-11-11-33 1T

DDR3-1333 13-13-13-39 1T

Memory mark

DDR3-1333 6-7-6-18 1T

DDR3-1333 7-7-7-21 1T

DDR3-1333 8-8-8-24 1T

DDR3-1333 9-9-9-27 1T

DDR3-1333 10-10-10-30 1T

DDR3-1333 11-11-11-33 1T

DDR3-1333 13-13-13-39 1T

Now that's more like it. Still not as pronounced as bandwidth + timings scaling. If I had chosen DDR3-1066 or 800 for these tests the results would've been much more pronounced because of bandwidth starving... but to be realistic who uses such slow speeds? It'd been useless data. I'd say the first post with fixed 10-11-10-30 1T timings and varying memory speeds is a better indicator of real life conditions.

Anyway, the lesson here is to pick a decent memory kit with fast speeds and tight latencies to make the most out of what the CPU can do... nothing new to be learned from this (memory speed is a product of latency and frequency)

That is awesome!! You might say that theres nothing new to be learned but never dismiss the value of basic research, especially when circumstances change a bit. This is gold IMO.

lopri · Feb 16, 2017

Agent-47 said:
^^these results are true for intel arch, may not be true for ryzen. but as a general rule any heavily memory accessing process that takes more space than the on die memory will depend on how fast the system memory is

EDIT:
ryzen was tested with 2400 MHz at 17-17-17 i believe, going to 3200 at 15 will result in a 28% jump based on your result. But a 8c16t ryzen is 83% behind a 8c16t intel

[/SPOILER]

Quad-channel can give double the scores of dual-channel in memory bound tests like this. Looking at the rest of scores, it seem rather obvious that this test is not so much about IPC but number of cores + memory performance. That FX-8370 scores like 7600K and 7700K should be a strong tell. And why is 5930K @3.4 GHz scoring higher than at stock? Stock 5930K is a 3.5/3.7 GHz part and it should never score lower than @3.4 GHz, all other things being equal. Seeing the weird nature of this benchmark, my guess is that the 3.4 GHz run was done with high performance memory and the stock run was with lower performance memory.

That takes us to the final conclusion: Given that many have contributed to prove this benchmark is highly sensitive to memory performance, without knowing the memory configuration of every system on which these CPUs were tested, the results in the chart are utterly meaningless.

krumme · Feb 16, 2017

lopri said:
Quad-channel can give double the scores of dual-channel in memory bound tests like this. Looking at the rest of scores, it seem rather obvious that this test is not so much about IPC but number of cores + memory performance. That FX-8370 scores like 7600K and 7700K should be a strong tell. And why is 5930K @3.4 GHz scoring higher than at stock? Stock 5930K is a 3.5/3.7 GHz part and it should never score lower @3.4 GHz, all other things being equal. Seeing the weird nature of this benchmark, my guess is that the 3.4 GHz run was done with high performance memory and the stock run was with lower performance memory.

That takes us to the final conclusion: Given that many have contributed to prove this benchmark is highly sensitive to memory performance, without knowing the memory configuration of every system on which these CPUs were tested, the results in the chart are utterly meaningless.

People will be using flashlight to find irrelevant synthetic test to show where zen is slow. As if Intel is the relevant benchmark here.
Portraying it as zen plus eg needs to have 256bit fp and 4 mem channel is not only wrong its counter productive. Its excactly the lack of those features that makes it so damn fast, efficient and cheap at the same time. Zen looks extremely balanced. Its the intel solutions that is off for most realworld loads.

Agent-47 · Feb 16, 2017

lopri said:
Quad-channel can give double the scores of dual-channel in memory bound tests like this.

I am afraid not. If it was quad channel 7700k would have at least half the score of ryzen. Also is evident by looking at the normalized score I posted which shows 7700k has the same IPC as the quad channel intel

Riek · Feb 16, 2017

Agent-47 said:
I am afraid not. If it was quad channel 7700k would have at least half the score of ryzen. Also is evident by looking at the normalized score I posted which shows 7700k has the same IPC as the quad channel intel

Your assumption being that scaling from 4 to 8 core is the same if the memory configuration stays the same, which is highly doubtful given that we see huge memory impact on a quad core.

That said, this discussion has much to do about nothing. We already know these are synthetic tests... and we also know it is highly memory dependent.

So all discussion about performance and differences and crappy branch prediction etc..., the only conclusion you can pull at the moment is that the quad channel Intel subsystem is better than the dual channel ryzen!.

Agent-47 · Feb 16, 2017

Nothingness said:
Sorry but either your compiler is a PoS or you didn't turn on optimizations

Here is what gcc outputs as soon as you turn optim:

Code:

movsd .LC0(%rip), %xmm0 ... .LC0: .long 0 .long 1074790400

Your code produces a constant...

My assembly code was from Intel compiler using MS disassembler. No optimization, as the code is too simple and the optimization will lead to constant as the results donot change with time.

If you want you can use cin to let the user set the values and it should generate the assembly you want even with optimization.

Agent-47 · Feb 16, 2017

Riek said:
Your assumption being that scaling from 4 to 8 core is the same if the memory configuration stays the same, which is highly doubtful given that we see huge memory impact on a quad core.

That said, this discussion has much to do about nothing. We already know these are synthetic tests... and we also know it is highly memory dependent.

So all discussion about performance and differences and crappy branch prediction etc..., the only conclusion you can pull at the moment is that the quad channel Intel subsystem is better than the dual channel ryzen!.

Gee. Have you looked at the results.? Compare the 7700k at 3.4 and the 8c Intel CPU at 3.4. Its 32vs68. That's linear core scaling. Maybe you can take a 6800k on dual channel aslnd see if it changes anything. But even if it does ryzen does not have a quad channel, so ryzen will be handy capped. But I doubt it as I am yet to see any synthetic benchmark that give 80% performance improvement going quad channel (except pure mem read/write ones)

As to why this is important. There is a reason why synthetic cpu benchmark is still around. Because while most of us here a gamers and vedio encoders, some use them for large arithmetical analysis.

It will seem strange that we are talking so much about a seemingly rather unimportant benchmark. But I think it illustrates a bottleneck somewhere which may pop up in other cases. After all the code does not use any exotic instructions like crc32/avx2 or something. And we have seen integer performance match BWE.
This test is a pure int64 based test coupled with branches and mods

dfk7677 · Feb 16, 2017

I think we have seen what Ryzen can do in video encoding (almost 10% better than 6900K?).

As far as gaming is concerned, if the Ryzen prime result is bottlenecked by memory (which apparently is), it is still better than the 'best' gaming CPU, 7700K.

coercitiv · Feb 16, 2017

dfk7677 said:
As far as gaming is concerned, if the Ryzen prime result is bottlenecked by memory (which apparently is), it is still better than the 'best' gaming CPU, 7700K.

The prime result is not relevant for gaming.

Agent-47 · Feb 16, 2017

This is from AdoreTV's recently uploaded video. with i7 with correct ram and clocks: 2400MHz RAM, 3.4Ghz base clock

everything is almost twice except ST, PN and Physics. If it was a memory bottleneck, why is database sorting not effected?

The String Sorting Test uses the qSort algorithm to see how fast the CPU can sort strings (single byte characters). A very common task in many applications. This tests uses memory buffers totaling about 25MB per core.

dfk7677 · Feb 16, 2017

coercitiv said:
The prime result is not relevant for gaming.

I wasn't the one implying that

The Stilt · Feb 16, 2017

Any specific workloads (must be available for Windows) you would like to see, besides the current ones?
Preferably open source, but that's not mandatory if the workload can otherwise be justified.

Floating Point:

3DPM V2.0b1 (Custom binary, ICL 2017)
Blackscholes (Custom binary, ICL 2017)
Blender 2.78.4 (Custom binary, MSVC 2015 + ICL 2017)
libBullet 2.85 (Custom binary, ICL 2017)
C-Ray (Custom binary, ICL 2017)
Caselab Euler3D (Public binary, ICL)
Cinebench 10 (Public binary, ICL)
Cinebench R11.5 (Public binary, ICL)
Cinebench R15 (Public binary, ICL)
Embree 2.13.0 (Public binary, ICL)
Euler3D CFD (Custom binary, ICL 2017)
GMPBench 0.2 / libGMP 6.12 (Custom binary, GCC 6.2)
Himeno (Custom binary, GCC 6.3)
Linpack 2017.014.0 (Public binary, ICL)
MCRT (Custom binary, ICL 2017)
NAMD 2.12 (Public binary, ICL)
NBody (Custom binary, ICL 2017)

Integer:

7Zip 16.04 x64 (Public binary, MSVC)
GCC 6.3 x86-64 (Public binary, GCC)
NQueen (Custom binary, iFortran)
OpenSSL 1.1.0d (Custom binary, GCC 6.2)
Stockfish 8 (Custom binary, GCC 6.3)
VampireNumbers (Custom binary, GCC 6.3)
X264 r2762 (Custom binary, GCC 5.3 + YASM)
X265 2+2 (Custom binary, GCC 6.3 + YASM)
WinRar 5.40 x64 (Public binary, MSVC?)

Some might wonder why ICL 2017 is the most common compiler used here.
That's because it is currently the fastest all-over compiler (for FP) for all of the µarch's I'm testing (XV, Zen, HSW, KBL).
Naturally the vendor dependent instruction dispatcher has been removed from all of the custom binaries. Needless to say that since the newer ICLs (>= 2011) are no longer hostile towards AMD, and removing the dispatcher generally makes no difference.
The only two workloads where removing the dispatcher does make a difference are Caselab Euler3D (ICL from 2009 (?)) and Linpack. In Caselab Euler3D removing the dispatcher improves the performance on AMD CPUs by >30%, while in Linpack the dispatcher doesn't degrade the actual performance but prevents the program from running on AMD CPUs alltogether.

coercitiv · Feb 16, 2017

Agent-47 said:
This is from AdoreTV's recently uploaded video. with i7 with correct ram and clocks: 2400MHz RAM, 3.4Ghz base clock

everything is almost twice except ST, PN and Physics. If it was a memory bottleneck, why is database sorting not effected?

This is an i5 6600K @ 3.4Ghz with DDR4 2400 17-17-17-41. Every benchmark is lower than 6700K @ 3.4Ghz, except for Prime Numbers. Any thoughts on this?

malitze · Feb 16, 2017

coercitiv said:
This is an i5 6600K @ 3.4Ghz with DDR4 2400 17-17-17-41. Every benchmark is lower than 6700K @ 3.4Ghz, except for Prime Numbers. Any thoughts on this?

Strange benchmark indeed. My first naive guess would be that there's twice as many threads competing for the same memory bandwidth, though I would expect the number of concurrent threads should be factored in the score somehow.

Agent-47 · Feb 16, 2017

coercitiv said:
This is an i5 6600K @ 3.4Ghz with DDR4 2400 17-17-17-41. Every benchmark is lower than 6700K @ 3.4Ghz, except for Prime Numbers. Any thoughts on this?

the plot thickens. My guess is HT does not help it

i just ran the test on FX.
6c => 22
1c => 4.

so it scales.
Can someone with i7 or i3 run the test with HT disabled?

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Diamond Member

Senior member

Lifer

Senior member

Golden Member

Senior member

Senior member

Senior member

Senior member

Diamond Member

Lifer

Elite Member

Diamond Member

Senior member

Senior member

Senior member

Senior member

Member

Diamond Member

Senior member

Member

Golden Member

Diamond Member

Junior Member

Senior member