AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

Abwx · Dec 15, 2016

Tuna-Fish said:
I expect AMD to have done just that and that the core can retire 4 ops per thread.

Doesnt make sense as that would mean that in ST the core couldnt even reach 50% of its throughput, or are you telling us that SMT gain is 100%..?..

Btw, an EXV core can retire 4 ops/cycle, if you were right there would be no ST gain with Zen, if this latter can retire 6 ops/cycle in ST then it could get as much as 50% better ST perf than XV and there still would be room for 8/6 = 33% SMT gain.

Tuna-Fish · Dec 15, 2016

Abwx said:
Doesnt make sense as that would mean that in ST the core couldnt even reach 50% of its throughput, or are you telling us that SMT gain is 100%..?..

Btw, an EXV core can retire 4 ops/cycle, if you were right there would be no ST gain with Zen, if this latter can retire 6 ops/cycle in ST then it could get as much as 50% better ST perf than XV and there still would be room for 8/6 = 33% SMT gain.

This logic would be correct if it was indeed realistic that EXV would retire 4 ops/cycle or that the Zen would reach above 4 ops/cycle/thread average. Retire is the last stage in the pipeline, it "lags" tens, potentially hundreds of cycles after the actual execution. While most of the other structures deal with bursty loads, the retire can run at full throughput all the time so long as there are ops available, without hurting speed. That is, every other part of the core needs to raise it's peak throughput well above the average to bring the average up, while for retire, average = peak or it's not retire limited and retire won't matter. (also note that retire almost certainly works on large, unsplit uops. That is, a load+op would issue 1 uop to both agu and alu, but in retire it would be a single op)

Even current Intel cores are practically never retire-limited when executing a single thread. If 4-wide retire would actually limit Zen in ST loads, that would be a genuinely wonderful problem to have.

cytg111 · Dec 15, 2016

krumme said:
Yep. But if thats the case then what does that tell us.
1. Base freq was way higher than anticipated by most.
2. Smt is stronger than Intel implementation.

This is even more freaky than anyone could imagine. Ask eg the stilt. It took Intel 10 years to get there. Its absolutely stunning if amd implementation gives near the same utilization.

Suddenly we see a lot of new goalpost here. Powerdelta. Like evaluating a Prius without hybrid system. And someway an presumed (and we actually dont know) excellent smt is turned into something a bit negative because it relative hurts how we evaluate st perf. Man. We have a cpu build from a far smaller r&d budget using less dies size especially considering Intel node is a bit denser. Somewhere the performance needs to come from. And its smart and not brutal using lots of transistors. There is no miracles here so there needs to be some dilemmas.

Right now we want AMD's HT implementation to suck.. cause that means singlethreaded perf rocks .

LTC8K6 · Dec 15, 2016

cytg111 said:
Right now we want AMD's HT implementation to suck.. cause that means singlethreaded perf rocks .

Shame that they didn't run the tests with "HT" off on both chips.

coercitiv · Dec 15, 2016

cytg111 said:
Right now we want AMD's HT implementation to suck.. cause that means singlethreaded perf rocks .

ST design by Jim Keller
SMT design by Retard Chimp

The only thing everyone should wish for is for the arch to perform at similar efficiency levels for a wide spectrum of loads/code.

Abwx · Dec 15, 2016

Tuna-Fish said:
Retire is the last stage in the pipeline, it "lags" tens, potentially hundreds of cycles after the actual execution.

While most of the other structures deal with bursty loads, the retire can run at full throughput all the time so long as there are ops available, without hurting speed.

There s 192 uops in the retire queue, at 8/cycle we re talking of 24 cycles to retire all uops available at a given moment, that s hardly hundreds cycles.

Tuna-Fish said:
Even current Intel cores are practically never retire-limited when executing a single thread. If 4-wide retire would actually limit Zen in ST loads, that would be a genuinely wonderful problem to have.

Limitation is in actual exe ressources, if 8 uops are available then there must be 8 adequate exe units available each cycle, of course this is not possible but in this respect AMD s uarch is less limited than Intel s due to parralel implementation of exe units while Intel rely on exe clusters that deal alternatively with FP and INT instructions, that s good for power efficency but not as good past a certain level of ILP.

MajinCry · Dec 15, 2016

Do we have any idea if Zen has finally cured AMD's draw call deficit in contrast to Intel? If we have another Phenom II <> Nehalem situation, where they crunch numbers equally but with the AMD chips being >3x slower with draw calls...Oh boy.

CatMerc · Dec 15, 2016

MajinCry said:
Do we have any idea if Zen has finally cured AMD's draw call deficit in contrast to Intel? If we have another Phenom II <> Nehalem situation, where they crunch numbers equally but with the AMD chips being >3x slower with draw calls...Oh boy.

wat
AMD's draw call woes are purely coming from the graphics drivers load, it has nothing to do with the CPU.
A faster CPU will increase how many draw calls your PC can handle, but it's not a cure.

MajinCry · Dec 15, 2016

CatMerc said:
wat
AMD's draw call woes are purely coming from the graphics drivers load, it has nothing to do with the CPU.
A faster CPU will increase how many draw calls your PC can handle, but it's not a cure.

That's not true. Compare an AMD CPU vs an Intel CPU, coupled with the fastest NVidia GPU. The AMD CPU will choke out long before the Intel, as they handle draw calls abysmally.

And when you chuck in an AMD GPU...Oh boy. I've inquired about this with Boris Vorontsov, a Russian dude that has been reverse engineering renderers for years. He explicitly states that AMD CPUs are 3-4x slower at draw calls than their intel performance equivalents.

http://enbseries.enbdev.com/forum/viewtopic.php?f=2&t=4666&start=150

Marcurios
AMD cpu like yours is about 3-4 times slower this draw calls bottleneck compared to Intel, so you can't do anything, except lowering drawing distance for everything.

This guy defo' knows what he's talking about, by and by.

Tuna-Fish · Dec 15, 2016

Abwx said:
There s 192 uops in the retire queue, at 8/cycle we re talking of 24 cycles to retire all uops available at a given moment, that s hardly hundreds cycles.

That's if everything is going smoothly. However, since retire is strictly in-order, all it takes is one memory reference that misses cache, making retire stall until it's satisfied, and now there is potentially a hundreds of cycles difference between frontend and retire. After that, the retire queue is smoothly drained until it runs out, since the retire can operate at full speed while no frontend will actually manage 4 insns per cycle in the long run. Unless another memory op happens, ofc.

VirtualLarry · Dec 15, 2016

BeepBeep2 said:
TDP is universally defined as Thermal Design Power, how many watts of heat a heatsink must dissipate.

Actual power consumption can be significantly higher than TDP. The keyword here is thermal.

And every watt of power going into a CPU, gets converted to heat, and dissipated.

If you mean, that "peak instantaneous power consumption" can be higher than TDP, then you are correct. If you mean that CPUs can, on average, consume more power than they dissipate as heat, then no, you would be wrong there.

krumme · Dec 15, 2016

cytg111 said:
Right now we want AMD's HT implementation to suck.. cause that means singlethreaded perf rocks .

Yes and we dont care if its 180mm2 or 280mm2 as long as we can get it for 300usd an reuse our old ddr3

sirmo · Dec 15, 2016

CatMerc said:
wat
AMD's draw call woes are purely coming from the graphics drivers load, it has nothing to do with the CPU.
A faster CPU will increase how many draw calls your PC can handle, but it's not a cure.

They have made big strides in this with their driver. Someone posted a draw call 3Dmark benchmark for every driver since rx480 release, and almost every driver had an increase. It's probably the reason we're seeing 5-8% performance improvement with GCN cards. I saw a post by a guy who noticed an 30fps jump in Overwatch from 100fps to 130, on a 280x.

Ajay · Dec 15, 2016

cytg111 said:
I somewhat agree with TheELF here, either way you put it if we put relative performance of a 8c/16t zen equal to a 8c/16t broadwell, then it goes to reason that for the given workload, if HT threads on the Zen platform pulled a larger utilization level than the equal HT onthe Intel platform, then Zen ST performance will be lower.

Not necessarily so. Abwx made it clear, and TheELF agreed, that 100% of core resources are available for ST. If Zen has a higher net throughput in single core SMT, it is likely due to fast dynamic partitioning of the core to match the current instruction stream better (unlike Intel's HT, where the first thread has first dibs). There is no dynamic re-partitioning for ST.

krumme · Dec 15, 2016

Major posted some smt non smt blender result from amd zen file from 2c skl here
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=index.php?posts/38632293
Pretty darn good scaling imo. What you think?

TheELF · Dec 15, 2016

Ajay said:
Not necessarily so. Abwx made it clear, and TheELF agreed, that 100% of core resources are available for ST.

Being available is not the same as every thread can make use of all of them.
And that's the case with all and any cores. (that's also why HTT gives a boost)
If officially senseMI is responsible for 1/4 of the IPC increase you can be pretty sure that most threads won't be able to use more then 3/4 of the core.
(IF it turns out that senseMI is responsible for partitioning that is)

Abwx · Dec 15, 2016

Tuna-Fish said:
That's if everything is going smoothly. However, since retire is strictly in-order, all it takes is one memory reference that misses cache, making retire stall until it's satisfied, and now there is potentially a hundreds of cycles difference between frontend and retire. After that, the retire queue is smoothly drained until it runs out, since the retire can operate at full speed while no frontend will actually manage 4 insns per cycle in the long run. Unless another memory op happens, ofc.

Apparently their impletmentation is efficient in MT, besides more than Blender the Handbrake demo is quite promising since they display 12% better perf/clock in an Integer based load, that s considerable for INT and that bode well not only for server applications but also for ST perf in consumer apps.

coercitiv · Dec 15, 2016

MajinCry said:
Do we have any idea if Zen has finally cured AMD's draw call deficit in contrast to Intel? If we have another Phenom II <> Nehalem situation, where they crunch numbers equally but with the AMD chips being >3x slower with draw calls...Oh boy.

Somebody who attended the presentation posted some numbers/impressions on reddit. ( and Dresdenboy was kind enough to share it)

Ryzen rig would only dip into the 57fps range every now and then for just a moment and jump back up to the mid-high 60's sometimes 70's. The Intel rig dipped into the low 60's only & hit 70fps+ and seemed to stay at higher fps more consistently.

We would need more info than this, but it's a promising start.

Ajay · Dec 15, 2016

krumme said:
Major posted some smt non smt blender result from amd zen file from 2c skl here
http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=index.php?posts/38632293
Pretty darn good scaling imo. What you think?

If one improves ST core utilization (more optimal code paths, fewer mispredicts, etc.) then SMT scaling will decrease (as shown) - which is fine so long as total execution time decreases as well (which it did).

MajinCry · Dec 15, 2016

coercitiv said:
Somebody who attended the presentation posted some numbers/impressions on reddit. ( and Dresdenboy was kind enough to share it)

We would need more info than this, but it's a promising start.

The thing is, they were probably using Direct3D 12, and were maxing out the game's settings. The proper way to benchmark, albeit ugly, is to jam all the graphics settings to their lowest and draw distances to their highest.

Then go as low res as the game allows (through .ini settings if possible), and watch the minimum framerate on the traditional APIs (Direct3D 9/11).

Of course, if it was on Direct3D 11, yeah that's pretty good, though I'd like to see an AMD GPU being used because of the whole Driver Command Lists shebang.

Ajay · Dec 15, 2016

TheELF said:
Being available is not the same as every thread can make use of all of them.
And that's the case with all and any cores. (that's also why HTT gives a boost)
If officially senseMI is responsible for 1/4 of the IPC increase you can be pretty sure that most threads won't be able to use more then 3/4 of the core.
(IF it turns out that senseMI is responsible for partitioning that is)

Well, not every instruction stream will make full use of all of a core's functional units - so yes. But, it doesn't follow, with dynamic partitioning, that ST will be confined the way you specify. There likely exists an optimal instruction/data stream that will utilize most of the core (with no mispredict penalties or stalls) - some hand coded loop, for example, that gets all data from registers. So, an unlikely case in real world programming. Nonetheless, the point of this sort of dynamic partitioning (depending on the degree it is dynamic) is that resources can be allocated pretty much on the fly. Now, if all the functional block in RED on the diagram Abwx shared are competitively assigned in the same way as Intel's HT - then, SMT behaviour in Ryzen will be closer to Intel's Core line of processors (HT). If that is the case, then it opens up additional optimizations that can be made in follow-up Zen uArch descendants.

SMT can be made much more beneficial to everyday users if the Core can be be more extensively dynamic and faster reacting using pre-execution instruction stream analysis withing the CPU. This will help minimize some of the problems of getting good performance out of multithreaded programs. Debugging will still be an issue (though, better design for multithreaded apps greatly reduces downstream problems**). So one of the things I see in Zen, is the possibility of making multithreaded apps more available on PCs as they are on game consoles! (if AMD can really get Zen market share up).

** seems to be quite an education gap in multithreaded software design. I had the advantage of first working in highly threaded multiprocessor embedded system applications - after that, parallel programming seemed to come quite naturally - though I continued to update my skills for different languages/OSes.

Ajay · Dec 15, 2016

Would someone open a separate thread on Ryzen and GPU performance. This thread is delving more heavily into Ryzen architecture and we would be best served if it stayed that way.
GPU performance is more decidedly a system design problem, IMHO.

TheELF · Dec 15, 2016

Ajay said:
Well, not every instruction stream will make full use of all of a core's functional units - so yes. But, it doesn't follow, with dynamic partitioning, that ST will be confined the way you specify.

Yup it does not,but AMD avoiding ST tests like the pest although this would be their biggest and best argument for people to switch to zen makes me personally think that ST will be confined.

Rifter · Dec 15, 2016

We really need some ST benches to settle this, all this speculation isnt going to get us anywhere.

It does not sit well that they were avoiding ST like the black death though with this initial press release. However i can see that they look to be still refining the turbo and final clocks so i can see their point in waiting until
that is finalized, before giving us the ST benches.

sirmo · Dec 15, 2016

TheELF said:
Yup it does not,but AMD avoiding ST tests like the pest although this would be their biggest and best argument for people to switch to zen makes me personally think that ST will be confined.

To me it's interesting that months ago they showed us a fairly close performance in Blender only to show us a pretty hefty lead in Handbrake only a few days ago. If they were cherry picking performance you'd think they would have showed Handbrake first. I think they could be holding back on the performance rather than the lack of.

AMD Zen “RYZEN” CPUs Detailed – 8 Cores, 3.4Ghz+ & Auto Overclocking With “XFR”

Lifer

Golden Member

Lifer

Lifer

Diamond Member

Lifer

Platinum Member

Golden Member

Platinum Member

Golden Member

No Lifer

Diamond Member

Golden Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Platinum Member

Lifer

Lifer

Diamond Member

Lifer

Golden Member