AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Dresdenboy · Oct 31, 2016

AtenRa said:
I would suggest you use the Core i3 6100 instead, A12-9800 is an APU with L2 cache only at 65W TDP when BD-E 6950K is a L3 25MB at 140W TDP SKU.

Interesting thought. But I think using a L3$ chip vs. a max. L2$ chip is fair, because of this 40% including the L3$. It should mitigate the effect of the smaller L2$.

TheELF · Oct 31, 2016

Dresdenboy said:
IPC usually means measured intructions per clock over several cycles (e.g. 1M, or all cycles of an application run), not "issue width", which is kind of a peak value, as the decoders and µOp$ simply can't provide enough instructions per cycle.

Yes,usually it does,but ever since AMD won the lawsuit against them about what constitutes a core they pretty much have the card blanche to use any technical term in any way they like,I mean they are selling 12 core laptops now because of this...there is no trust anymore.

Dresdenboy said:
Sometimes even smart people might run into kind of a dead end, thought-wise. Then it might help to take them out of their seat, shake them a bit, and put them back in place. Or alternatively, one might kindly ask them to step back and check their POV.

Intel has specialized execution units that reduce the cycles that instructions need until they are done.
If blender can actually use all available instructions,10 for ZEN and 8 for broadwell,then ZEN had to be 20% faster to match the "measured intructions per clock over several cycles" you mentioned,it doesn't so it is slower.
Even in the scenario when the software can use all instructions...as you said, most software doesn't.

bjt2 · Oct 31, 2016

Dresdenboy said:
I'd expected some ratio: % of max possible, not of all.
Here is an explanation:
http://www.eetimes.com/document.asp?doc_id=1276117

Zen likely has roughly similar values.

So since Zen have longer pipeline length (19 vs maybe 12-15) than Jaguar and Bobcat, there are more clock gating opportunities and so its numbers should be even better...
Very good...
So with leakage being 1/6 on 14nm FF LPP, vs 28nm BULK, there are more opportunities to increase dynamic power,and thus higher clocks...

bjt2 · Oct 31, 2016

TheELF said:
Intel has specialized execution units that reduce the cycles that instructions need until they are done.

I don't think I have understood... Can you elaborate more on this?

Dresdenboy · Oct 31, 2016

TheELF said:
Yes,usually it does,but ever since AMD won the lawsuit against them about what constitutes a core they pretty much have the card blanche to use any technical term in any way they like,I mean they are selling 12 core laptops now because of this...there is no trust anymore.

Intel has specialized execution units that reduce the cycles that instructions need until they are done.
If blender can actually use all available instructions,10 for ZEN and 8 for broadwell,then ZEN had to be 20% faster to match the "measured intructions per clock over several cycles" you mentioned,it doesn't so it is slower.
Even in the scenario when the software can use all instructions...as you said, most software doesn't.

OK, there are so many fallacies and mistakes in here, that I won't continue this discussion.

If you like, you might get familiar with superscalar OoO execution first.
https://lagunita.stanford.edu/c4x/Engineering/CS316/asset/Processor_Microarchitecture.pdf
http://www.lighterra.com/papers/modernmicroprocessors/

The Stilt · Oct 31, 2016

Cycles has recently received quite a few SIMD optimizations, so it might be worth to check the differences between the old and the current builds. The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.

Dresdenboy · Oct 31, 2016

The Stilt said:
Cycles has recently received quite a few SIMD optimizations, so it might be worth to check the differences between the old and the current builds. The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.

8% is not bad. Depending on the code changes and code path selection for Zen, there might be zero to a small change.

Abwx · Oct 31, 2016

The Stilt said:
The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.

If the CPU has to downscale frequency when AVX2 is used then thoses 8% are moot, as pointed by Dresdenboy this wouldnt even impact noticeably the throughput if it s executed in two passes and that frequency is kept constant..

The Stilt · Oct 31, 2016

Dresdenboy said:
8% is not bad. Depending on the code changes and code path selection for Zen, there might be zero to a small change.

The devs were not lying.
In Blender 2.78 AVX2 is indeed beneficial. I measured > 11.8% average performance improvement from enabling AVX2. The SMT yield seems to be coming down to sane levels too (> 59% in 2.77, > 35% in 2.78).

Binaries compiled with MSVC 2015.
Tested with the "BMW Rev. 4" scene, 1920x1080 resolution, 40x40 tile size (96 X, 54 Y), on Haswell-EP HCC (18C/36T).

Non-AVX2: 270.52s
AVX2: 241.89s

Blender 2.78 - MSVC 2015, AVX2 / Non-AVX2) (111MB).

Pass: "blender" (without the quotes).

AtenRa · Oct 31, 2016

Arachnotronic said:
I don't have a Core i3 6100 available to me at the moment, sorry. Anyway, the i3 6100 is Skylake based and the L3$ is actually a lot faster on SKL than it is on BDW-E (2.8GHz only). GB4 likes L3$ speed more than it likes L3$ size, AFAICT.

Dresdenboy said:
Interesting thought. But I think using a L3$ chip vs. a max. L2$ chip is fair, because of this 40% including the L3$. It should mitigate the effect of the smaller L2$.

The problem is not the L3 cache that much as the more than half the TDP (65W vs 140W). From my testing, A8-7600 had a substantial ST performance difference at 45W TDP vs 65W TDP in CB.

Take a specially note that A10-7700K also has the same 3.8GHz Single Thread as A8-7600 but A10-7700K is at 95W TDP. So going from 45W TDP to 95W TDP (~double) keeping the same 3.8GHz Single Thread clocks, we have 8-10% higher Single Thread performance using the same core/architecture.

Now, A12-9800 has a 4.2GHz Single Thread at 65W TDP when Core i7 6950K has a Single Thread of 3.5GHz at 140W TDP. In order to be apples to apples comparison, we need both the same Single Core clocks AND same TDP.

mikk · Oct 31, 2016

For Intel it doesn't matter, even with 65W such a Singlethread test with Geekbench is no issue, it performs with max clock. As for AMD it just shows that their Turbo implementation is very weak, I really doubt their CPU needs more than 45W when only one core is fully loaded, especially in Geekbench.

cdimauro · Oct 31, 2016

The Stilt said:
The devs were not lying.
In Blender 2.78 AVX2 is indeed beneficial. I measured > 11.8% average performance improvement from enabling AVX2. The SMT yield seems to be coming down to sane levels too (> 59% in 2.77, > 35% in 2.78).

Binaries compiled with MSVC 2015.
Tested with the "BMW Rev. 4" scene, 1920x1080 resolution, 40x40 tile size (96 X, 54 Y), on Haswell-EP HCC (18C/36T).

Non-AVX2: 270.52s
AVX2: 241.89s

Blender 2.78 - MSVC 2015, AVX2 / Non-AVX2) (111MB).

Pass: "blender" (without the quotes).

That's really strange, since looking at Blender's code it shouldn't make use of 256-bit vectors.

Can you try to force the usage of 128-bit sized vector registers with AVX code, if possible?

The Stilt · Oct 31, 2016

mikk said:
For Intel it doesn't matter, even with 65W such a Singlethread test with Geekbench is no issue, it performs with max clock. As for AMD it just shows that their Turbo implementation is very weak, I really doubt their CPU needs more than 45W when only one core is fully loaded, especially in Geekbench.

It was until Carrizo. Prior Carrizo the CPUs & APUs didn't know their actual power consumption, meaning they did adjust their frequency based on a power vs. P-State (frequency state) lookup table, which was fused. Needless to say the results weren't that great.

JoeRambo · Oct 31, 2016

cdimauro said:
That's really strange, since looking at Blender's code it shouldn't make use of 256-bit vectors.

Can you try to force the usage of 128-bit sized vector registers with AVX code, if possible?

Those AVX2 changes for Blender.

https://git.blender.org/gitweb/gitw.../cycles/kernel/geom/geom_triangle_intersect.h

Some epic vectorization skills at work, black belt in making use of AVX2 with vector permutations/shuffles where no obviuos vectorization is possible

cdimauro · Oct 31, 2016

But they are very very limited, as you can see: 256-bit instructions are rare birds.

The biggest part with Blender is represented by scalar code, or at most some 128-bit vector usage.

I doubt that the whole speed-up of this patch is related to the 256-bit vector usage. Unfortunately there's no way to force/set only 128-bit vectors.

JoeRambo · Oct 31, 2016

It's not "vector" usage, algorithm was transformed into one capable of using 256bits of hardware instead of just floats by some magic code. Btw from commit log it seems that the gain is actually 20% on Windows machine. Also commit log revealed that some Intel guy (Maxym Dmytrychenko) gave them ideas about how to do it as it is not obvious (understatement of the month).

cdimauro · Oct 31, 2016

I know. I talk about vector when we don't use scalar quantities.

Yes, the idea isn't obvious at all. Congrats to the colleague.

nismotigerwvu · Oct 31, 2016

AtenRa said:

Man, I wouldn't have expected the slightly tweaked Stars core in Llano to have put on that strong of a showing from a performance standpoint. I sort of wish we could have seen what it was capable of on a process that wasn't a mess, especially considering how nicely that 32nm node matured by the end.

KTE · Nov 1, 2016

Dresdenboy said:
Maybe I didn't understand your point, but this is a simplistic form of my view here:
For ST (@iso freq):
T'put = #inst./cycle
Speed= #inst./cycle

And who said, that in real ST code XV constantly maxes out an IPC of 7.0 (2ALU+2AGU+3FP)? Why would a SW need to reach 10 ops/clock if in reality the value is more like 1.0 to 3.0, sometimes higher

I ran an analysis on Thuban a while back using Code Analyst. Typical was 0.7-1.3 IPC. Max was 2.2 IPC by Prime95

Sent from HTC 10
(Opinions are own)

Dresdenboy · Nov 1, 2016

KTE said:
I ran an analysis on Thuban a while back using Code Analyst. Typical was 0.7-1.3 IPC. Max was 2.2 IPC by Prime95

I was speaking for Intel... no, that's not true. I just roughly remembered what is sometimes seen in papers.

But based on your measurements, Intel's current cores with improved mem access handling over Thuban, etc., should land somewhere in the range I gave.

I'm not surprised about Prime95. <storymode>While discussing with George Woltman and others about K8 optimizations, I learned a lot of interesting things about the SSE2 (P4) optimizations. Prime95's split radix FFTs were 3 times faster than the next fastest lib (FFTW, IIRC). George did not only schedule SSE2 instructions to perfection (and FFT radixes of course), but also did nice tricks with TLBs, etc.

KTE · Nov 1, 2016

Dresdenboy said:
I was speaking for Intel... no, that's not true. I just roughly remembered what is sometimes seen in papers.

But based on your measurements, Intel's current cores with improved mem access handling over Thuban, etc., should land somewhere in the range I gave.

I'm not surprised about Prime95. <storymode>While discussing with George Woltman and others about K8 optimizations, I learned a lot of interesting things about the SSE2 (P4) optimizations. Prime95's split radix FFTs were 3 times faster than the next fastest lib (FFTW, IIRC). George did not only schedule SSE2 instructions to perfection (and FFT radixes of course), but also did nice tricks with TLBs, etc.

That's why it can pull so much more power than typical apps

Sent from HTC 10
(Opinions are own)

Dresdenboy · Nov 1, 2016

KTE said:
That's why it can pull so much more power than typical apps

Yep, close to a power virus, but doing useful calculations.

jpiniero · Nov 1, 2016

Speaking of Blender, I looked at Tom's review of Broadwell-E, and they did test Blender. The difference between Broadwell-E and Haswell-E wasn't all that much when you account for clocks; but the difference between Broadwell-E and Ivy Bridge-E was like 25%. I can't imagine it was just IPC and memory bandwidth so I have to think there was some AVX2 going on even in the version that they tested. BTW, the 6700K nearly tied the 6 core 4960X enough that the 7700K would be faster for sure. The 8 core Zen should be faster than the 7700K; but you would have to have confidence that AMD didn't rig the test beyond downclocking the Broadwell-E machine and disabling AVX2 like they did with that Polaris power consumption demo.

Sven_eng · Nov 1, 2016

How did AMD rig the Polaris power consumption video?

Trolling is not allowed
Markfw900
Anandtech Moderator

Sweepr · Nov 2, 2016

New Blender results with an ES (Snowy Owl? / Naples?)

AMD Engineer Sample render time: 69 seconds
E5-2699 v3 (Haswell-EP, 2014) render time: 35 seconds

http://blenchmark.com/cpu-benchmarks

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Golden Member

Diamond Member

Senior member

Senior member

Golden Member

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Diamond Member

Member

Golden Member

Golden Member

Member

Golden Member

Member

Golden Member

Senior member

Golden Member

Senior member

Golden Member

Lifer

Member

Diamond Member