AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Page 60 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I would suggest you use the Core i3 6100 instead, A12-9800 is an APU with L2 cache only at 65W TDP when BD-E 6950K is a L3 25MB at 140W TDP SKU.
Interesting thought. But I think using a L3$ chip vs. a max. L2$ chip is fair, because of this 40% including the L3$. It should mitigate the effect of the smaller L2$.
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126
IPC usually means measured intructions per clock over several cycles (e.g. 1M, or all cycles of an application run), not "issue width", which is kind of a peak value, as the decoders and µOp$ simply can't provide enough instructions per cycle.
Yes,usually it does,but ever since AMD won the lawsuit against them about what constitutes a core they pretty much have the card blanche to use any technical term in any way they like,I mean they are selling 12 core laptops now because of this...there is no trust anymore.
Sometimes even smart people might run into kind of a dead end, thought-wise. Then it might help to take them out of their seat, shake them a bit, and put them back in place. Or alternatively, one might kindly ask them to step back and check their POV.
Intel has specialized execution units that reduce the cycles that instructions need until they are done.
If blender can actually use all available instructions,10 for ZEN and 8 for broadwell,then ZEN had to be 20% faster to match the "measured intructions per clock over several cycles" you mentioned,it doesn't so it is slower.
Even in the scenario when the software can use all instructions...as you said, most software doesn't.
 

bjt2

Senior member
Sep 11, 2016
784
180
86
I'd expected some ratio: % of max possible, not of all.
Here is an explanation:
http://www.eetimes.com/document.asp?doc_id=1276117

Zen likely has roughly similar values.

So since Zen have longer pipeline length (19 vs maybe 12-15) than Jaguar and Bobcat, there are more clock gating opportunities and so its numbers should be even better...
Very good...
So with leakage being 1/6 on 14nm FF LPP, vs 28nm BULK, there are more opportunities to increase dynamic power,and thus higher clocks...
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Yes,usually it does,but ever since AMD won the lawsuit against them about what constitutes a core they pretty much have the card blanche to use any technical term in any way they like,I mean they are selling 12 core laptops now because of this...there is no trust anymore.

Intel has specialized execution units that reduce the cycles that instructions need until they are done.
If blender can actually use all available instructions,10 for ZEN and 8 for broadwell,then ZEN had to be 20% faster to match the "measured intructions per clock over several cycles" you mentioned,it doesn't so it is slower.
Even in the scenario when the software can use all instructions...as you said, most software doesn't.
OK, there are so many fallacies and mistakes in here, that I won't continue this discussion.

If you like, you might get familiar with superscalar OoO execution first.
https://lagunita.stanford.edu/c4x/Engineering/CS316/asset/Processor_Microarchitecture.pdf
http://www.lighterra.com/papers/modernmicroprocessors/
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Cycles has recently received quite a few SIMD optimizations, so it might be worth to check the differences between the old and the current builds. The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Cycles has recently received quite a few SIMD optimizations, so it might be worth to check the differences between the old and the current builds. The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.
8% is not bad. Depending on the code changes and code path selection for Zen, there might be zero to a small change.
 

Abwx

Lifer
Apr 2, 2011
11,167
3,862
136
The developers reported 8% reduction in rendering time from a single commit improving AVX2 (IIRC) alone, in the "BMW" scene.

If the CPU has to downscale frequency when AVX2 is used then thoses 8% are moot, as pointed by Dresdenboy this wouldnt even impact noticeably the throughput if it s executed in two passes and that frequency is kept constant..
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
8% is not bad. Depending on the code changes and code path selection for Zen, there might be zero to a small change.

The devs were not lying.
In Blender 2.78 AVX2 is indeed beneficial. I measured > 11.8% average performance improvement from enabling AVX2. The SMT yield seems to be coming down to sane levels too (> 59% in 2.77, > 35% in 2.78).

Binaries compiled with MSVC 2015.
Tested with the "BMW Rev. 4" scene, 1920x1080 resolution, 40x40 tile size (96 X, 54 Y), on Haswell-EP HCC (18C/36T).

Non-AVX2: 270.52s
AVX2: 241.89s

Blender 2.78 - MSVC 2015, AVX2 / Non-AVX2) (111MB).

Pass: "blender" (without the quotes).
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I don't have a Core i3 6100 available to me at the moment, sorry. Anyway, the i3 6100 is Skylake based and the L3$ is actually a lot faster on SKL than it is on BDW-E (2.8GHz only). GB4 likes L3$ speed more than it likes L3$ size, AFAICT.

Interesting thought. But I think using a L3$ chip vs. a max. L2$ chip is fair, because of this 40% including the L3$. It should mitigate the effect of the smaller L2$.

The problem is not the L3 cache that much as the more than half the TDP (65W vs 140W). From my testing, A8-7600 had a substantial ST performance difference at 45W TDP vs 65W TDP in CB.





Take a specially note that A10-7700K also has the same 3.8GHz Single Thread as A8-7600 but A10-7700K is at 95W TDP. So going from 45W TDP to 95W TDP (~double) keeping the same 3.8GHz Single Thread clocks, we have 8-10% higher Single Thread performance using the same core/architecture.

Now, A12-9800 has a 4.2GHz Single Thread at 65W TDP when Core i7 6950K has a Single Thread of 3.5GHz at 140W TDP. In order to be apples to apples comparison, we need both the same Single Core clocks AND same TDP.
 

mikk

Diamond Member
May 15, 2012
4,173
2,211
136
For Intel it doesn't matter, even with 65W such a Singlethread test with Geekbench is no issue, it performs with max clock. As for AMD it just shows that their Turbo implementation is very weak, I really doubt their CPU needs more than 45W when only one core is fully loaded, especially in Geekbench.
 

cdimauro

Member
Sep 14, 2016
163
14
61
The devs were not lying.
In Blender 2.78 AVX2 is indeed beneficial. I measured > 11.8% average performance improvement from enabling AVX2. The SMT yield seems to be coming down to sane levels too (> 59% in 2.77, > 35% in 2.78).

Binaries compiled with MSVC 2015.
Tested with the "BMW Rev. 4" scene, 1920x1080 resolution, 40x40 tile size (96 X, 54 Y), on Haswell-EP HCC (18C/36T).

Non-AVX2: 270.52s
AVX2: 241.89s

Blender 2.78 - MSVC 2015, AVX2 / Non-AVX2) (111MB).

Pass: "blender" (without the quotes).
That's really strange, since looking at Blender's code it shouldn't make use of 256-bit vectors.

Can you try to force the usage of 128-bit sized vector registers with AVX code, if possible?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
For Intel it doesn't matter, even with 65W such a Singlethread test with Geekbench is no issue, it performs with max clock. As for AMD it just shows that their Turbo implementation is very weak, I really doubt their CPU needs more than 45W when only one core is fully loaded, especially in Geekbench.

It was until Carrizo. Prior Carrizo the CPUs & APUs didn't know their actual power consumption, meaning they did adjust their frequency based on a power vs. P-State (frequency state) lookup table, which was fused. Needless to say the results weren't that great.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136

cdimauro

Member
Sep 14, 2016
163
14
61
But they are very very limited, as you can see: 256-bit instructions are rare birds.

The biggest part with Blender is represented by scalar code, or at most some 128-bit vector usage.

I doubt that the whole speed-up of this patch is related to the 256-bit vector usage. Unfortunately there's no way to force/set only 128-bit vectors.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
It's not "vector" usage, algorithm was transformed into one capable of using 256bits of hardware instead of just floats by some magic code. Btw from commit log it seems that the gain is actually 20% on Windows machine. Also commit log revealed that some Intel guy (Maxym Dmytrychenko) gave them ideas about how to do it as it is not obvious (understatement of the month).
 

cdimauro

Member
Sep 14, 2016
163
14
61
I know. I talk about vector when we don't use scalar quantities.

Yes, the idea isn't obvious at all. Congrats to the colleague.
 

nismotigerwvu

Golden Member
May 13, 2004
1,568
33
91

Man, I wouldn't have expected the slightly tweaked Stars core in Llano to have put on that strong of a showing from a performance standpoint. I sort of wish we could have seen what it was capable of on a process that wasn't a mess, especially considering how nicely that 32nm node matured by the end.
 

KTE

Senior member
May 26, 2016
478
130
76
Maybe I didn't understand your point, but this is a simplistic form of my view here:
For ST (@iso freq):
T'put = #inst./cycle
Speed= #inst./cycle

And who said, that in real ST code XV constantly maxes out an IPC of 7.0 (2ALU+2AGU+3FP)? Why would a SW need to reach 10 ops/clock if in reality the value is more like 1.0 to 3.0, sometimes higher

I ran an analysis on Thuban a while back using Code Analyst. Typical was 0.7-1.3 IPC. Max was 2.2 IPC by Prime95

Sent from HTC 10
(Opinions are own)
 
Reactions: Dresdenboy

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I ran an analysis on Thuban a while back using Code Analyst. Typical was 0.7-1.3 IPC. Max was 2.2 IPC by Prime95
I was speaking for Intel... no, that's not true. I just roughly remembered what is sometimes seen in papers.

But based on your measurements, Intel's current cores with improved mem access handling over Thuban, etc., should land somewhere in the range I gave.

I'm not surprised about Prime95. <storymode>While discussing with George Woltman and others about K8 optimizations, I learned a lot of interesting things about the SSE2 (P4) optimizations. Prime95's split radix FFTs were 3 times faster than the next fastest lib (FFTW, IIRC). George did not only schedule SSE2 instructions to perfection (and FFT radixes of course), but also did nice tricks with TLBs, etc.
 
Reactions: KTE

KTE

Senior member
May 26, 2016
478
130
76
I was speaking for Intel... no, that's not true. I just roughly remembered what is sometimes seen in papers.

But based on your measurements, Intel's current cores with improved mem access handling over Thuban, etc., should land somewhere in the range I gave.

I'm not surprised about Prime95. <storymode>While discussing with George Woltman and others about K8 optimizations, I learned a lot of interesting things about the SSE2 (P4) optimizations. Prime95's split radix FFTs were 3 times faster than the next fastest lib (FFTW, IIRC). George did not only schedule SSE2 instructions to perfection (and FFT radixes of course), but also did nice tricks with TLBs, etc.
That's why it can pull so much more power than typical apps

Sent from HTC 10
(Opinions are own)
 

jpiniero

Lifer
Oct 1, 2010
14,840
5,456
136
Speaking of Blender, I looked at Tom's review of Broadwell-E, and they did test Blender. The difference between Broadwell-E and Haswell-E wasn't all that much when you account for clocks; but the difference between Broadwell-E and Ivy Bridge-E was like 25%. I can't imagine it was just IPC and memory bandwidth so I have to think there was some AVX2 going on even in the version that they tested. BTW, the 6700K nearly tied the 6 core 4960X enough that the 7700K would be faster for sure. The 8 core Zen should be faster than the 7700K; but you would have to have confidence that AMD didn't rig the test beyond downclocking the Broadwell-E machine and disabling AVX2 like they did with that Polaris power consumption demo.
 

Sven_eng

Member
Nov 1, 2016
110
57
61
How did AMD rig the Polaris power consumption video?


Trolling is not allowed
Markfw900
Anandtech Moderator
 
Last edited by a moderator:
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |