AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

bjt2 · Nov 20, 2016

Since with blender test we know that the MT mean IPC is the same as broadwell-E there are 2 possibilities:

1) AMD SMT it's better than INTEL's. So in ST Zen will be slightly inferior.
2) AMD SMT it's worse than INTEL's. So in ST Zen will be slightly better.

This in blender.
Pick your choice...

But please note that, at least in blender, it can't be both SMT and ST IPC worse than INTEL's...

Anyway Zen was an early ES, with 2 memory channels, vs 4, and probabily with the RAM at low clock, so final product can be better (at least in clock frequencies)...

witeken · Nov 20, 2016

Arachnotronic said:
In volume notebooks, AMD will have to deal with CNL-U...at least a full generation behind on process technology again.

It would seem more than a generation ahead given how super aggressive Intel is on 10nm power consumption. I wonder if Apple has had any influences there. With the modem pilot experiment that Apple did, which seems to have been successful even if the modem itself sucks, with the ARM Artisan IP announcement at IDF, a transistion to Intel 10nm in 2018 would be a possibility. I think the battle Intel 10nm vs. TSMC 7nm for A12 has not yet been fought.

Arachnotronic · Nov 20, 2016

bjt2 said:
Since with blender test we know that the MT mean IPC is the same as broadwell-E there are 2 possibilities:

1) AMD SMT it's better than INTEL's. So in ST Zen will be slightly inferior.
2) AMD SMT it's worse than INTEL's. So in ST Zen will be slightly better.

This in blender.
Pick your choice...

But please note that, at least in blender, it can't be both SMT and ST IPC worse than INTEL's...

Anyway Zen was an early ES, with 2 memory channels, vs 4, and probabily with the RAM at low clock, so final product can be better (at least in clock frequencies)...

In the specific Blender test shown by AMD with unspecified hardware and software configurations just before AMD announced a follow on stock offering to pay off debt.

Sven_eng · Nov 20, 2016

TheELF said:
Well it buggs me what can I say,it gives any company card blanche to use technical terms to misslead it's customers without lying.

Does it also bug you when Intel relabels i3's as i5's and i7's in laptops?

jpiniero · Nov 20, 2016

witeken said:
It would seem more than a generation ahead given how super aggressive Intel is on 10nm power consumption.

Intel has said so little about 10 nm that to say that seems premature. Which makes comparing Cannonlake to Zen at this point fruitless.

blublub · Nov 20, 2016

For AMD to survive 2017 all ZEN has to do is be a lot better than XV - and all signs post in the direction that it's going to be just that.
If ZEN+ isn't narrowing the gap to Intel this would be worrisome as it would indicate that the design can't be improved much.

Although I have to admit that if ZEN benchmarks fall below Haswell the stock is gonna tank as AMD suggest it can keep up with its blender test

NTMBK · Nov 20, 2016

Arachnotronic said:
In volume notebooks, AMD will have to deal with CNL-U...at least a full generation behind on process technology again.

In servers, Zen needs to go up against Skylake-EP, which will certainly be a formidable architecture.

Gaining share against Intel is not going to be easy.

Certainly not. A lot depends on how well Intel can get their 10nm process yielding.

witeken · Nov 20, 2016

jpiniero said:
Intel has said so little about 10 nm that to say that seems premature. Which makes comparing Cannonlake to Zen at this point fruitless.

They have given a presentation at IDF. A slide with gate delay and energy. Energy was such a big decrease compared to other nodes, even FinFET, that it can't just be coincidence. Bohr explicitly said in webcast that the focus was on energy not performance.

cdimauro · Nov 20, 2016

bjt2 said:
I remember your article on the pyton interpreter... In this case it's true that it's a big switch, but as with common instructions in CPU, that are encoded with shorter codes or are decoded in one uop (the famous 96% of instructions decoded in 1 uop), this can be said also for emulators. If the uop cache is big enough, you will find in it the most used code snipped, corresponding to the most used emulated instructions... I bet that more than 90% of emulated code can be reduced to a few dozen instructions, and the switch branches that take care of them can be safely stored in a 1k-2k uops cache... When a less used opcode must be emulated, the new snippet kicks out a less used bunch of instructions... But the majority would stay in cache... This is in essence the task of a cache...

Yes, but the micro-op cache is very small, at least compared to the regular L1C. And we don't know how it works. For example, it might work well only for loops, and have scarse or no associativity at all (e.g.: only a few "fragments" are cached).
Long story short: with "fragmented/heavy-branch" code its contribute can be small or even none.

Regarding emulators (and VMs too), it's true that there are more common opcodes/instructions, but still they are too many and contain too many instructions to be kept all (or even a good part) into an uop cache.

I know for sure (C)Python's VM, which executes A LOT of instructions for a very simple operation, like a very very common "INT + INT" operation. If you take a look at the executed code, there are some branches and several instructions, which'll be split in many uops. Repeat the same for some other common operations, and you can easily understand that even the L1C isn't enough for keeping the average working set.

KTE said:
My own opinion is that it is looking to end up as 20-25% slower than Intels SKL "all around". I don't care to dwell into the ST/MT depths just yet.

Part of that will be SKLs clockspeed advantage, which I'm sure of.

IPC wise, I couldn't put a figure on it.

This is the same difference BD 8150 had against SNB 2600k, and the top models were even more ahead.

AMD typically loses a lot of revenue due to its repeated quarterly delays. Such delays buy a lot of time for the competitor.

They fail miserably in ambush or sudden counter fire.

Sent from HTC 10
(Opinions are own)

If the very last information reported above is true, the situation is much worse...

bjt2 · Nov 20, 2016

Arachnotronic said:
In the specific Blender test shown by AMD with unspecified hardware and software configurations just before AMD announced a follow on stock offering to pay off debt.

So you don't buy that Zen has similar IPC as BWE in blender? It seems they used the same official version downloadable from the site...

cdimauro said:
Yes, but the micro-op cache is very small, at least compared to the regular L1C. And we don't know how it works. For example, it might work well only for loops, and have scarse or no associativity at all (e.g.: only a few "fragments" are cached).
Long story short: with "fragmented/heavy-branch" code its contribute can be small or even none.

Regarding emulators (and VMs too), it's true that there are more common opcodes/instructions, but still they are too many and contain too many instructions to be kept all (or even a good part) into an uop cache.

I know for sure (C)Python's VM, which executes A LOT of instructions for a very simple operation, like a very very common "INT + INT" operation. If you take a look at the executed code, there are some branches and several instructions, which'll be split in many uops. Repeat the same for some other common operations, and you can easily understand that even the L1C isn't enough for keeping the average working set.

If the very last information reported above is true, the situation is much worse...

In this case then the 4 way decoder could give a steady flux of 2 instructions per thread per clock... I imagine that an interpreter has not anyway an high IPC, probabily below 1, certainly below 2... So 4 decoders for two thread, plus something in the uop cache, should be sufficient... Anyway on INTEL CPUs the uop cache is 1.5K uops if I remember well, not so small... I hope that on Zen is at least 1k uops...
I remember this fact that there are apparently simple instructions that are translated in a very long chain in the pyton interpreter, but I hope that there are also simpler cases...

AtenRa · Nov 20, 2016

CNL-U is only 2+2 and will release late 2017, that is one year from now.

krumme · Nov 20, 2016

Arachnotronic said:
In the specific Blender test shown by AMD with unspecified hardware and software configurations just before AMD announced a follow on stock offering to pay off debt.

And the big institutionalized professional investors didnt even see it and still haven't !

LTC8K6 · Nov 20, 2016

Sven_eng said:
Does it also bug you when Intel relabels i3's as i5's and i7's in laptops?

No, because i3's did not have turbo. If the laptop 2c/4t chips had turbo, then they were different from i3's. As far as I know, the mobile i5 and i7 chips all had turbo boost, differentiating them from i3 chips.

It may be that the new Kaby Lake i3's have turbo, though.

Arachnotronic · Nov 20, 2016

AtenRa said:
CNL-U is only 2+2 and will release late 2017, that is one year from now.

Zen APUs coming in late 2017 as well.

AtenRa · Nov 20, 2016

Arachnotronic said:
Zen APUs coming in late 2017 as well.

Yes but not 2+2

Dresdenboy · Nov 20, 2016

The Stilt said:
Besides AMD has already demoed Zeppelin with SMT enabled (Blender).

Good point. OTOH this was the 8C variant, while Naples is a different story. But so is also the type of SMT problem. It might not be a bug per se (non functional SMT), but some configuration/OS issue resulting in lower performance with SMT enabled. There is some adaptive application aware power management, which might be wrong here.

Arachnotronic · Nov 20, 2016

AtenRa said:
Yes but not 2+2

Right, Intel's 4+2 parts for this market will be Kaby Lake-Refresh.

witeken · Nov 20, 2016

I read that Zen overclocks to about 4.2GHz. So at least with Zen people will stop complaining about Intel's processors not overclocking high enough since Zen (even) worse. That same dude says it matches Haswell performance.

Don't know if this was already posted here.

That Cinebench R15 MT score
6900K @ 5.1GHz 2,100~
8C/16T ZEN @ 5.1GHz 2,000 (workaround disabled)

https://webcache.googleusercontent....20/amd-zen-141214/+&cd=2&hl=en&ct=clnk&gl=eng
https://webcache.googleusercontent....41214/index3.html+&cd=3&hl=eng&ct=clnk&gl=eng

KTE · Nov 20, 2016

I don't think SMT is buggy at this stage per se, but maybe a multisocket implementation needs work.

I'd expect AMDs SMT to be pretty darn good given their much lower IPC in general previous uarchs.

It's their ST w/o SMT/MHz/Power I'm worried about. The first because they've left it too late; 20-40% simply wouldn't be enough to compete above cheapy Value in 2017-2018. 40-60% would be excellent, and make up a lot of lost turf but would need improving on ASAP to stay competitive.

Secondly, 95W is fine, but 3.1GHz@95W and then 3.3GHz@125W isn't

They will pit n+2 cores to Intels, with SMT, so I'm not worried about the MT perf.

Sent from HTC 10
(Opinions are own)

Arachnotronic · Nov 20, 2016

witeken said:
I read that Zen overclocks to about 4.2GHz. So at least with Zen people will stop complaining about Intel's processors not overclocking high enough since Zen (even) worse. That same dude says it matches Haswell performance.

Don't know if this was already posted here.

https://webcache.googleusercontent....20/amd-zen-141214/+&cd=2&hl=en&ct=clnk&gl=eng
https://webcache.googleusercontent....41214/index3.html+&cd=3&hl=eng&ct=clnk&gl=eng

LOL, this screams fake.

Soulkeeper · Nov 20, 2016

Is it confirmed which will come first, server or desktop ?

piesquared · Nov 20, 2016

It has been known for several months that desktop comes first. Sounds a lot like A64 and FX51 coming to desktop first then server to follow. Although FX51 w/ socket 940 required ECC RAM, whereas Summit Ridge and AM4 doesn't appear to require it.

Soulkeeper · Nov 20, 2016

Thanks, I guess i'll be waiting awhile. Unless I see some good deals on xeon parts for the holidays.

Thunder 57 · Nov 21, 2016

piesquared said:
It has been known for several months that desktop comes first. Sounds a lot like A64 and FX51 coming to desktop first then server to follow. Although FX51 w/ socket 940 required ECC RAM, whereas Summit Ridge and AM4 doesn't appear to require it.

Except that Opteron came before A64 by a good six months or so. This time around though AMD has said desktop will come first.

cdimauro · Nov 21, 2016

bjt2 said:
So you don't buy that Zen has similar IPC as BWE in blender? It seems they used the same official version downloadable from the site...

Which does't use AVX/-2...

In this case then the 4 way decoder could give a steady flux of 2 instructions per thread per clock... I imagine that an interpreter has not anyway an high IPC, probabily below 1, certainly below 2...

It depends on the emulated code/instructions, but I think that it's reasonable to expect an IPC close to 1.

So 4 decoders for two thread, plus something in the uop cache, should be sufficient... Anyway on INTEL CPUs the uop cache is 1.5K uops if I remember well, not so small... I hope that on Zen is at least 1k uops...

It doesn't change the picture: they are not enough. Not even the L1C is enough. And we don't know the politics which are used to cache the uops.

I remember this fact that there are apparently simple instructions that are translated in a very long chain in the pyton interpreter, but I hope that there are also simpler cases...

Sure: one of the most common case (LOAD_FAST) is very fast and needs a few instructions. But talking about "few" it means around ten if you follow the path from the bytecode fetch 'til the real executing, and the jump back to the fetch code section, and consider that 3 conditional jumps are executed, plus an unconditional one (at the very end).

Which is a normal scenario for an emulator which is works the usual, "interpretative", way.

When a JIT is involved it's quite different, because most of the overhead / non-linear code is represented by software "guards" (when checking the types, or when checking if some interrupt / signal happened), and "loop guards" (to avoid being indefinitely stuck into a loop).

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Senior member

Diamond Member

Lifer

Member

Lifer

Member

Lifer

Diamond Member

Member

Senior member

Lifer

Diamond Member

Lifer

Lifer

Lifer

Golden Member

Lifer

Diamond Member

Senior member

Lifer

Diamond Member

Golden Member

Diamond Member

Platinum Member

Member