Can Ryzen dominate server CPUs?

DaveSimmons · Mar 6, 2017

Atari2600 said:
32 cores* at a likely base frequency of 2.8-3.0 GHz vs. 22 cores* at a base frequency of 2.4 GHz!

*with each core roughly comparable in IPC.

If Zen's 32-core really does match intel's 24-core in performance per core, and power use per core is in the same neighborhood, and there are no hardware or software issues, then AMD could have a winner on their hands.

Google, Amazon, Microsoft, Facebook, etc. don't care which brand powers their cloud monster as long as the hardware is reliable and efficient enough.

xdfg · Mar 6, 2017

kalmquist · Mar 6, 2017

xdfg said:
1. Thread parent in the same paragraph claimed that Zen would be a hit in HPC while simultaneously denying the existence of AVX software. HPC runs 100% AVX workloads, and HPC-specific literature talks about AVX throughput extensively.

The paper Early performance evaluation of AVX for HPC shows AVX producing a performance benefit of more than 10% on 3 of the 15 benchmarks. Things many have changed since 2011, but if not it would seem that Zen could perform well on many HPC workloads.

scannall · Mar 6, 2017

Naples looks quite interesting. Particularly the 32 core version. It'll be fun to see how it actually performs in the real world.

dahorns · Mar 6, 2017

kalmquist said:
The paper Early performance evaluation of AVX for HPC shows AVX producing a performance benefit of more than 10% on 3 of the 15 benchmarks. Things many have changed since 2011, but if not it would seem that Zen could perform well on many HPC workloads.

That's a bit of a misrepresentation:

We utilized preproduction system with the first member of AVX enabled CPU, the new 2nd generation Intel Core I7-2600 processor. We have found that evaluated platform shows significant performance improvement versus exactly the same configured platform but without AVX support. The performance improvement we have been able to observe behaves as we have been expecting taking in to account theoretical performance of both CPUs. Intel AVX delivers significant performance improvements to compute-intensive codes and for same test we have observed 88% difference between AVX version and SSE 4.2 version. The impact is significantly smaller when the application is memory-bound as both platforms have the same configured memory subsystem. As might be expected, for larger systems the interconnect bandwidth and topology have an increasing impact but in our testing scenario limited to the single platform all such tests, do not show any advantage or disadvantage for AVX platform as in fact they are irrelevant for a single socket platform validation. The performance advantage of compute-intensive benchmarks on AVX enabled platform is between 1.58 and 1.88 over the version limited to SSE 4.2. All not compute-intensive benchmarks show around 8% to 10% performance improvement. Intel Turbo Boost Technology was always enabled and delivers additional frequency improvement and consequently improvement in performance.
* * *
In summary, we can state that AVX even in the early phase of the platform implementation not optimized for HPC brings lot of performance improvement for compute intensive applications and become a compelling choice for many of the new HPC installations.

Basically, things that AVX should improve, improved a lot. Things it shouldn't improve, didn't get significantly worse.

strategyfreak · Mar 6, 2017

xdfg said:
I see a lot of misconceptions in this thread.

1. Thread parent in the same paragraph claimed that Zen would be a hit in HPC while simultaneously denying the existence of AVX software. HPC runs 100% AVX workloads, and HPC-specific literature talks about AVX throughput extensively.

Zen is clearly an efficient design, and it could potentially replace Intel or act as a second vendor in a range of server use cases, particularly web and database. However, there are several obstacles preventing it from being a general replacement for Intel's Xeon series. Lack of AVX (especially AVX-512) units make it entirely irrelevant in HPC. Current rumors for Naples indicate a 4-die MCM package, which would essentially be three layers of NUMA (2 sockets, 4 dies, 2 CCX per die). Compared to Intel's unified L3 cache on Xeon E5s, this would be a significant challenge to scale-up (big server) workloads.

As a scientist who actually uses a number of HPC programs to run calculations, I take issue with your statement that HPC is "100% AVX workloads." While there certainly are programs out there that support AVX and gain a significant speedup by using it, the number of programs that use AVX and benefit from it are more limited than you suggest. I never intended to claim that there were no AVX HPC programs (and if I did otherwise then it is my mistake), just that the number of real HPC programs that use and take advantage of AVX are not as extensive as you might believe.

The issue with AVX is that its acceleration is achieved, as its name implies, through new vector instructions. However, not all problems can be easily vectorized (or even vectorized at all). Those with even minimal programming experience should understand that turning scalar code into vector code isn't always a trivial task, and sometimes is just impossible. There's two issues here: 1) the theoretical issue of vectorization (how can I express my problem in a vectorized/array manner?) 2) practical implementation (how to get the system to recognize and accelerate your vectorized code, either through the use of intrinsics or a compiler). Again, I'm not a programming expert, but I can grasp and appreciate the difficulty of this problem.

Code with many scalar sections are unlikely to see much of a speedup (see the link below). Having heard directly from some programmers writing HPC chemistry ab intio/DFT code, the largest gains they've seen from AVX are up to 10%. The issue, quoting them directly, is that scalar sections (as discussed above) don't get a speedup and thus limit the benefit you get from AVX.

You don't even have to believe me or the programmers I'm talking to. This presentation from CERN's physics HPC group brings up many of the exact same issues that I discussed.

As for the literature on AVX in HPC, I think some critical reading is needed. I won't deny that AVX is capable of producing tremendous gains in terms of FP performance. But translating those FLOPS to actual, real-life performance isn't easy, as the article cited by Kalmquist and the document I linked to shows.

IMO the most important thing AMD needs to worry about for HPC is not AVX, but memory bandwidth and the processor interconnect in SMP systems. A lot of programs are limited not just by processing speed but the system's ability to keep all of those processors fed from memory. The "three layers of NUMA" you alluded to might be more problematic than weak AVX performance.

strategyfreak · Mar 6, 2017

dahorns said:
That's a bit of a misrepresentation:

Basically, things that AVX should improve, improved a lot. Things it shouldn't improve, didn't get significantly worse.

I don't see how you could conclude his post was saying AVX made things worse. I think both his post and the data from the article support the idea that AVX offers tremendous speedups to applications that can fully take advantage of it, but those applications are a small minority of all HPC programs (for the reasons I mentioned above). Those that weren't sped up, weren't slowed down either.

There are some cases compiling for AVX can slow things down slightly (see the link in my post above) due to the AVX offset of some Intel processors, though it's unlikely to be major problem.

Yakk · Mar 6, 2017

Ryzen sure looks to have technical ability to be very competitive in the server market. However I'd say the biggest factor will probably be what kind of business deals AMD can setup with the big boys who won't sellout wholesale to Intel buying their exclusivity.

dahorns · Mar 6, 2017

strategyfreak said:
I don't see how you could conclude his post was saying AVX made things worse. I think both his post and the data from the article support the idea that AVX offers tremendous speedups to applications that can fully take advantage of it, but those applications are a small minority of all HPC programs (for the reasons I mentioned above). Those that weren't sped up, weren't slowed down either.

There are some cases compiling for AVX can slow things down slightly (see the link in my post above) due to the AVX offset of some Intel processors, though it's unlikely to be major problem.

His summary implies (perhaps unintentionally) that the AVX speedup was moderate, i.e. ~10%. Had he said that several of the test showed a improvement of >50%, it would at least have been a more accurate representation of the actual data and the authors' conclusions. And he also left out the important note that those test that weren't sped up were limited by other factors such that AVX was not expected to provide a significant boost.

JDG1980 · Mar 6, 2017

How many workloads are suitable for AVX but not suitable for offloading to a GPU?

Cefizelj · Mar 6, 2017

Enterprise and HPC probably are not going to be Ryzen strong markets. In enterprise you are often paying so much for software, that CPU in severs doesn’t mater.

Similarly, price of CPUs is not first priority for HPC. Power usage is important, particularly for the big guys. Rayzen looks somewhat promising on this front, but as it was already said in HPC a lot depends on memory bandwidth and processor interconnect. This is the strength of IBM Power processors. MCM package is probably less of an issue. IBM has been using in its mainframes and some Power chips. I might be wrong, but I think even CPUs in Watson have MCM package

AMD game in HPC is much more about APUs. They also got some government contact for research for exascale supercomputers, and are parenting with Micron on some memory related research. I suspect HBM and that high bandwidth cache stuff and 512 TB of addressable memory space we heard about at Vega presentation a while back has a lot to with where AMD is going in HPC. Zen core is part of this plan, but Ryzen itself not so much. Systems for next two years or so are already planned anyway and they are all Skylake E or Power9.

Best opportunity for Ryzen seem to be hyperscale (cloud) and people running open code software. They are the most price sensitive when it comes to hardware. Ryzen should be really good for web servers and datacenters. Hypersalers have been pushing for more suppliers for years. They control whole software stack, so application optimization is not such a big obstacle. They also have economy of scale on their side Google said they would change processor architecture even for just one generation

Atari2600 · Mar 7, 2017

dahorns said:
Had he said that several of the test showed a improvement of >50%, it would at least have been a more accurate representation of the actual data

No it would have been equally fallacious.

As might be expected, for larger systems the interconnect bandwidth and topology have an increasing impact but in our testing scenario limited to the single platform all such tests, do not show any advantage or disadvantage for AVX platform as in fact they are irrelevant for a single socket platform validation

Now, small-medium business (say engineering), will run HPC software on single or dual-socket workstations so won't be affected by the ability of the platform to scale workloads. But, they will tend not to recompile their code to take advantage of all the operating instructions within their hardware.

HPC clusters are going to be dependent on their architecture for scaling and with each being largely unique, cross comparison is nigh-on-impossible.

Furthermore, the author's test was a canned test to evaluate the impact of AVX - it is not and can not be representative of all algorithms and code stacks used in HPC.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Anyway, something else to consider:

Lets say AVX offers a 1.75x speed up (highly unlikely to be that good in practice, but anyway).

The current top Xeon is 22cores @ 2.4 GHz and is Broadwell based.

Broadwell is roughly comparable in IPC to Zen in throughput.

The top Zen Opteron will be 32C and could be at up to 3.0GHz.

32/22 = 1.45

3.0/2.4 = 1.25

1.45 x 1.25 = 1.81

Assuming perfect scaling of the software across threads, the Zen Opteron would already offer better performance than the existing AVX enabled top of the line 22C Xeon!

Gideon · Mar 7, 2017

Well I'm hyped:
Napels Slides

sauce

dahorns · Mar 7, 2017

Atari2600 said:
No it would have been equally fallacious.

How so? He said 3 of 15 tests showed greater than 10% benefit for AVX. The actual results are that those 3 tests each showed greater than 50% benefit for AVX (57%, 83%, and 88%).

If you're questioning the conclusions and methodology of the article, that is completely independent of my aim, which was to more accurately summarize it.

Atari2600 said:
The current top Xeon is 22cores @ 2.4 GHz and is Broadwell based.

Broadwell is roughly comparable in IPC to Zen in throughput.

The top Zen Opteron will be 32C and could be at up to 3.0GHz.

So, you're comparing the base frequency of the Xeon (boosts to 3.6GHz) and against the rumored turbo of the Zen?

See http://www.tweaktown.com/news/55836/amd-zen-based-naples-cpu-64-threads-128-pcie-3-lanes/index.html

As for specs, we're looking at up to 32 Zen CPU cores with 64 threads of performance, 64MB of L3 cache, a base clock of 1.4GHz and a turbo clock of 2.8GHz.

Read more: http://www.tweaktown.com/news/55836/amd-zen-based-naples-cpu-64-threads-128-pcie-3-lanes/index.html

Let's run the math using like to like, base-to-base and turbo-to-turbo:

Base
1.4/2.4 = .583 * 1.45 = .845

Turbo (using 3.0 instead of 2.8 for Naples)
3.0/3.6 = .833 * 1.45 = 1.208

And then there is this:

However, AMD is not currently offering clock-speed guidance, and the only benchmark it has demonstrated—a floating point-based seismic analysis workload—feels unsatisfactory. AMD compared a dual-socket Naples system against a dual Xeon E5-2699A V4 system. First the company handicapped the Naples machine by limiting it to 44 cores and 1866MHz memory; it ran the sample workload in 18 seconds, compared to Intel's 35 seconds. With the Naples machine using all 64 cores and memory at its maximum speed of 2400MHz, Naples finished the workload in 14 seconds.

These scores strongly suggest that the workload is almost entirely memory bandwidth constrained. At 44 cores and 1866MHz, the Naples machine has twice as much memory bandwidth as the Xeon, thanks to having twice as many memory channels; it also has almost exactly twice the throughput. The full strength (64 cores and 2400MHz) Naples machine shows a 29 percent improvement over the restricted one. The full strength machine also has 29 percent more memory bandwidth and 45 percent more cores. On top of all this, we know that Broadwell's actual floating point performance is, at least when using the AVX instruction set, about twice as much as Zen's at the same clock speed.

Finally, AMD offered a workload that quadrupled the size of the data set. The Naples machine completed the task in 54 seconds, while the Intel system crashed with inadequate memory. This proved convincingly that, yes, 2TB of RAM is more than 1.5TB of RAM.

http://arstechnica.com/information-...essor-more-cores-bandwidth-memory-than-intel/

leoneazzurro · Mar 7, 2017

And having more bandwidth available is exactly the fault of whom?

Atari2600 · Mar 7, 2017

dahorns said:
So, you're comparing the base frequency of the Xeon (boosts to 3.6GHz) and against the rumored turbo of the Zen?

I estimated the 3.0 GHz based on the power consumption calcs* from Cinebench from Fott and The Stilt.

*At ~3.0 GHz, Ryzen uses ~40-45W. 4x45 = 180W.

I haven't even read the naples news - off to do that now.

dahorns · Mar 7, 2017

leoneazzurro said:
And having more bandwidth available is exactly the fault of whom?

Oh, it is good for AMD. But a bandwidth constrained test has limited use in evaluating broader performance characteristics. One would think that AMD would have offered up less constrained test if it had one that was as comparably favorable. As a result, I'm going to assume that in general Naples falls behind by 5%-10% other than in certain niche scenarios. Which would still be a pretty awesome result for them, especially if performance per watt is as competitive as suggested. Though, it is disappointing that it will be limited to dual-socket configurations.

leoneazzurro · Mar 7, 2017

Of course, like testing only AVX2 workloads (not AVX where the throughput should be comparable) paints a picture of Intel machines a little too bright.

Atari2600 · Mar 7, 2017

leoneazzurro said:
Of course, like testing only AVX2 workloads (not AVX where the throughput should be comparable) paints a picture of Intel machines a little too bright.

Well, yeah, that's it too.

Unfortunately, showing performance for a HPC application means....

your CPU can to X in that specific application with that specific problem.

No more, no less. By-and-by large it means little else. Which is a bit of a balls for us when we want to make arm-waving comparisons.

StefanR5R · Mar 7, 2017

dahorns said:
Let's run the math using like to like, base-to-base and turbo-to-turbo:

Math can be done once actual clocks have been disclosed, but not just yet.

(Furthermore, all-core turbo is more relevant than single-core turbo in engineering and HPC loads. And in case of Intel, consider AVX turbo when applicable.)

[...] the only benchmark it has demonstrated—a floating point-based seismic analysis workload—feels unsatisfactory.

Click to expand...

The arstechnica reporter does not tell whether the fact that it was only one demo left him feeling dissatisfied, or particulars of the demo, or the buffet after the demo.

These scores strongly suggest that the workload is almost entirely memory bandwidth constrained.

Click to expand...

Which is not uncommon among engineering and HPC workloads. Also, either the AMD demonstrators said as much, or at least part of their audience clearly understood that this demo was designed to highlight memory bandwidth in particular.

Genx87 · Mar 7, 2017

teejee said:
In my business (automotive OEM's) people get fired for single sourcing of a supplier though.
In a mature competitive market it is very important to keep costs at a minimum by negotiating with all possible suppliers and buy the cheapest part that is good enough.

Are you buying the CPUs and installing them yourself? If not, then the single supplier angle is kind of silly in this discussion.

freeskier93 · Mar 7, 2017

teejee said:
In my business (automotive OEM's) people get fired for single sourcing of a supplier though.
In a mature competitive market it is very important to keep costs at a minimum by negotiating with all possible suppliers and buy the cheapest part that is good enough.

That's not the main reason for multiple supplier sources...

If you are doing mass manufacturing/assembly having a single supplier is risky because if that supplier dries up, or can't deliver, you are totally screwed. More so if you employ Just In Time (JIT) manufacturing.

That's not relevant here where costs are long term and unavailability has a relatively small impact.

xdfg · Mar 7, 2017

DrMrLordX · Mar 7, 2017

Interesting comparison between Naples and Xeon. I'm wondering if AVX2/AVX512 will be all that big of a dealbreaker with GPGPU options being available.

StefanR5R · Mar 7, 2017

xdfg said:
Skylake-EP (Purley) has been available to cloud operators since Q4 2016. It has 6ch of DDR4-2666 (83% of 8x DDR4-2400 banwidth) and AVX-512.

Naples is specified to support DDR4-2666 as well, btw.

Can Ryzen dominate server CPUs?

Elite Member

Member

Member

Golden Member

Senior member

Junior Member

Junior Member

Golden Member

Senior member

Golden Member

Junior Member

Golden Member

Golden Member

Senior member

Golden Member

Golden Member

Senior member

Golden Member

Golden Member

Elite Member

Lifer

Senior member

Member

Lifer

Elite Member