Ryzen's halved 256bit AVX2 throughput

Carfax83 · Mar 10, 2017

nickmania said:
gpu video encode does not look bad, there are tons of videos encoded using the gpu in YouTube. in fact every camera in the market encode the footage using a dedicated chip in real time, not a general purpose X86 GPU.

the 3d engines are not prepared for GPU rendering yet, they just render using open gl, but is moving fast. Cinema 4d is going to add the amd render so we will see in a few months.

Maybe "look bad" was the wrong term, but they definitely don't look on par with what CPUs deliver.

Carfax83 · Mar 10, 2017

Ancalagon44 said:
What makes you say that? How do you know whether it is true or not?

First, why would they use CPUs? There is no benefit. GPUs are purpose built for rendering - they are much more efficient at it. Not only that, there is much better software support for GPU rendering than CPU rendering. They just don't gain anything from CPU rendering.

Don't shoot the messenger man. I looked really hard to find a render farm that used GPUs, but I couldn't find any. They all use CPUs, and I think one of the biggest reasons why is because CPUs can access more RAM.

Also bear in mind that rendering and encoding are two separate steps. They could render on GPUs and then encode on CPUs if they really wanted to. Although even then, a GPU is likely a better choice.

I'm curious but when was the last time you saw a quality comparison between footage encoded by a CPU and a GPU, or Quick Sync?

Carfax83 · Mar 10, 2017

NeoLuxembourg said:
If you are talking about dedicated hardware (Intel QuickSync, AMD VCE, ...), then yes, CPUs do a better job, but it's not true for Cuda/OpenCL encoders.

I'll ask you the same question. When was the last time you saw a qualitative comparison between GPU and CPU encoders?

Carfax83 · Mar 10, 2017

Nothingness said:
My understanding is that most (all?) large rendering farms are only using CPUs. Easier integration, less software issues, more RAM, etc.

Exactly. GPUs are much faster, but what if your asset size routinely exceeds the GPU's framebuffer capacity?

Stormflux · Mar 10, 2017

Carfax83 said:
Now rendering is another matter, as rendering is naturally more suited to the GPU. But even that has some limitations, as GPUs don't have as much RAM to play with as CPUs. I tried looking, but I couldn't find any examples of a big rendering studio that uses GPUs for rendering.

You're trying real hard to justify the results and purpose of the thread...

Most big studios have spent millions on their CPU based farms because for decades that has been the only option, and any change, if any isn't going to happen over night. Now this will by a double whammy...

When investing into future technology and nodes, do you think these big studios are going to wipe out their current investments by utilizing AVX2 that, their established farms probably don't even support? Or go with the status quo and possibly double down on price/perf?

Let alone find or develop any software TO utilize AVX2 Which will be the REAL indicator of future industry trends. Renderman, the most established renderer on the market shows no signs of AVX2 specific utilization, but recently have added GPU accelerated denoising. GPU Rendering is fighting over 30 years of established norms and actually is gaining traction. I personally know a few studios that will be using Redshift, a newer (last 3 years) GPU render that features an out-of-core architecture that when memory is pushed, can leverage the system ram.

The fact is that, the industry is using more and more GPU functions than CPU advancements. And any effort is going towards a heterogeneous environment, not specialized segmentation. Due to investment returns. Any change has to show it's worth it.

Edit:

The only renderer that pushed out using a modern instruction set is Vray with Embree. But even then, they're falling into irrelevancy, and have a GPU path of their own.

Carfax83 · Mar 10, 2017

naukkis said:
Ryzen is so good because it lacks 256bit ops which are just waste of power. 6900K running AVX is 140W tdp, 1800X is 95W. Ryzen already have at least equal performance per watt compared to Intel's best offerings and let's wait and compare 180W Naples to Intel 180W offerings - it will be very competitive in AVX workloads too.

TDP and power usage aren't the same thing you understand? The 95w TDP refers to the heat dissipation capability required for the chip. In the Computerbase.de review that I cited in my OP, the 1800x consumed 160w in the multicore Cinebench test, which was just 5w less than the Broadwell-E CPUs.

As for AVX2, here is a benchmark that has heavy use of AVX2:

As you can see, the 6700K is able to overtake the 1800x due to its formidable SIMD capabilities. What this says to me is that as software becomes more and more optimized, we'll see Intel's offerings demonstrate a greater lead over AMD's. Here is a more real world benchmark which has heavy use of SIMD:

Intel should drop 256 bit ops or AMD will offer faster all-around cpu soon.

LOL sure!

Carfax83 · Mar 10, 2017

Stormflux said:
The only renderer that pushed out using a modern instruction set is Vray with Embree. But even then, they're falling into irrelevancy, and have a GPU path of their own.

I'll take your word for it man, as you know more about this than I do.. Like I said, rendering is naturally suited more to the GPU than the CPU, so it makes sense. I just couldn't find any examples doing a google search of render farms that used GPUs. That's not to say they don't exist however.

Encoding is another matter though. From everything I've read, GPUs might be fast, but for doing high quality encodes, the CPU is better.

Janooo · Mar 10, 2017

The results might be more affected by memory throughput than CPU capability.
Till the memory speeds are confirmed than any discussion about the CPU instructions in the light of the results does not make much sense to me.

TandemCharge · Mar 10, 2017

There are 3 variables going on here:
1) Ryzen has half the FMA mathematical operations compared to Intel.
2) Ryzen has 2 AGUs but Intel has 3 (it has an extra store port)
3) Ryzen cpus have dual channel memory and Intel has quad channel.

Your examples mostly point out to the last 2 points (DAWbench) rather than AVX2's half throughput. Unfortunately due to time constraints, none of the reviewers run a profiler on the benchmarks so we may never know for sure.

xdfg · Mar 10, 2017

I do not know what "rendering" means in this thread, but many DSP and HPC software are already using AVX. The FFTW library is a common building block in all manner of audio and video DSP, and it has supported AVX+FMA for the longest time. Anyone that has run prime95 (FFT) knows how important SIMD width is to dense compute loads. In fact, Skylake greatly enhances the ability to sustain AVX workloads, due to its dramatically increased L3 cache bandwidth. Optimizing software to fit in L1 cache is often impossible or intractably time consuming, but L3 is a much easier target (1 MB/thread).

Those claiming that "GPU" will replace software video encoding clearly have no knowledge of the domain. Real-time streaming has always been run on ASICs (these days, Intel Xeon E3 QuickSync accelerators), but high quality encoding for offline is a strictly CPU workload. This is not going to change either, as video encoding is a highly serial workload, where the results of a DSP block (e.g. DCT) are immediately fed into highly branching mode decision logic.

As for AMD's future plans in Zen 2, they will likely add an AVX-256 unit. I doubt we will see AVX-512 units though, and it may not matter as AVX-512 is mainly desirable for compatibility with Xeon Phi. Intel is unlikely to ever bring AVX-512 to consumer platforms, due to the die area cost.

krumme · Mar 10, 2017

Avx2 is darn costly for mm2. Not only for the beefy units but mm2 for feeding it. There is probably also some sideeffect beefing up all those registers besides pure mm2 cost.
My guess is its a zen 3, 7nm add on if we see it at all? They might just leave that market to Intel and bet on vega tech tacked on for similar segments with less branching loads.

naukkis · Mar 10, 2017

Carfax83 said:
TDP and power usage aren't the same thing you understand?

Sure. But as 256 bit execution is very power consuming Intel cpu's will throttle to their rated TDP. Without 256 bit ops 140W TDP Intel chips consumes only ~90W and that even with higher voltage and clock with boost states.

Without 256bit ops Intel CPU's could probably have easy +1GHz clock headroom, trading that for 256 bit ops which are almost useless for most workloads is just giveaway for competitors.

xdfg · Mar 10, 2017

guachi · Mar 10, 2017

The statements in the OP aren't backed up by the graphs.

At least the statement that as workload lengthens the AMD chip falls behind isn't borne out by the evidence shown. Every chip in the middle length Blender test gains on the 6900K. All of them.

It's only in the third test where Ryzen falls behind the 6900K. But even then the other AMD chips gain (and so do the other two Intel chips tested). How is it that the lowly 1090t has a 40% gain between the first Blender test and the third one where it goes from 20% as fast as the 6900K to 28% as fast?

In the Handbrake test, the 6850K gains 2% in the H.265 test and the 7700K gains 10% vs. their respective H.264 tests. All the Ryzen chips lose 16% relative to the 6900K. The 9590 loses 20% and the 1090t a whopping 70%.

itsmydamnation · Mar 10, 2017

Carfax83 said:
TDP and power usage aren't the same thing you understand? The 95w TDP refers to the heat dissipation capability required for the chip. In the Computerbase.de review that I cited in my OP, the 1800x consumed 160w in the multicore Cinebench test, which was just 5w less than the Broadwell-E CPUs.

As for AVX2, here is a benchmark that has heavy use of AVX2:

As you can see, the 6700K is able to overtake the 1800x due to its formidable SIMD capabilities. What this says to me is that as software becomes more and more optimized, we'll see Intel's offerings demonstrate a greater lead over AMD's. Here is a more real world benchmark which has heavy use of SIMD:

LOL sure!

It has little to do with SIMD capabilities outside FMA.......
Zen has 512bits of SIMD width, Skylake has 512bits of SIMD width. What one of the big differences is load/store bandwidth and memory bandwidth. So what that actually means is as workloads become more and more optimized ( more SIMD ops per Memory op) Zen is likely to improve more then skylake does.

But that said you make is sound like AVX/2 are some new things that no one uses and everyone is going to just start using it because some reason...... That completely ignores reality.... 6 years AVX has been available for......

lolfail9001 · Mar 10, 2017

itsmydamnation said:
Zen has 512bits of SIMD width, Skylake has 512bits of SIMD width.

Yet somehow Zen has half the FMA throughput ^^

itsmydamnation said:
But that said you make is sound like AVX/2 are some new things that no one uses and everyone is going to just start using it because some reason......

For reference: see how Blender had AVX support landed like months after AMD showed Zeppelin against 6900k in it. Need more reasons?

itsmydamnation · Mar 10, 2017

lolfail9001 said:
Yet somehow Zen has half the FMA throughput ^^

learn to read mate......

For reference: see how Blender had AVX support landed like months after AMD showed Zeppelin against 6900k in it. Need more reasons?

When you have a 4 year release cycle that can happen, if you cared enough you could compile from source and get all of that improvement for ages.

naukkis · Mar 10, 2017

xdfg said:
256-bit SIMD units do not increase power consumption by anywhere near two-fold, which is the whole point of vector units. At stock frequencies, it adds perhaps 10-20 W to TDP for over 50% performance gains on appropriate workloads. AVX also does not limit attainable OC by anywhere near 1 GHz. Kaby Lake is commonly stable at 5 GHz for non-AVX and 4.7-4.8 GHz for AVX, but it supports a separate AVX frequency target, so you can always be running at maximum OC.

You are seriously underestimating 256 bit execution power needs at high frequencies. Intel's core arch increases power usage massively when executing SIMD - it's not far away from doubling. And as Haswell and later cpu's design priority have been in 256 bit execution so it's have 256 bit load/store, more L1-L2 bandwith etc needing huge fabrics at core level which limits core clocking potential and makes designing much more complicated.

Zen instead is very balanced - no huge power increases with SIMD execution - it has massively more clocking potential than Intel's archs, and Intel sure acknowledges that now. I bet that Intel has no choices but strike back with similar design - abandon too wide SIMD to stay competitive.

xdfg · Mar 10, 2017

lolfail9001 · Mar 10, 2017

naukkis said:
- it have massively more clocking potential than Intel's archs,

TomCruiseLaugh.jpg

I mean, seriously, not on 14nm LPP.

itsmydamnation said:
When you have a 4 year release cycle that can happen, if you cared enough you could compile from source and get all of that improvement for ages.

And you would be wrong, as illustrated by The Stilt right around the time of demo.

naukkis · Mar 10, 2017

xdfg said:
Is that why Kaby Lake can routinely OC to 5 GHz, while ZEN only achieves 4 GHz? The entire point of SIMD is increased power efficiency by reducing the fraction of resources spent on instruction fetch/decode and maximizing utilization of ALUs. As I previously stated, you can set a separate target frequency for AVX OC in KBL.

Zen OC is severely limited by manufacturing process. SIMD alu point I agree and that was approach with Sandy/Ivy but with Haswell Intel decided to implement 256 bit load/store cababilities to increase SIMD alu usage which increased complexity and decreased clocking possibilities.

And CPU benefits more from high clocks than wider SIMD. There's very few applications where 256 bit SIMD execution brings real performance increases as most of heavy SIMD execution is memory bandwith limited anyway.

We'll see soon when Naples launch, it's spot on to bring more memory and L3 bandwith instead of wider SIMD and probably ends to be very competitive against Intel's offerings even in highly optimized AVX2-workloads.

Didn't you guys see what AMD demoed with Naples, seismic analysis which is absolutely something to benefit from wide SIMD - Intel have optimization guides for Xeon phi and SIMD-512......

lopri · Mar 10, 2017

xdfg said:
Is that why Kaby Lake can routinely OC to 5 GHz, while ZEN only achieves 4 GHz? The entire point of SIMD is increased power efficiency by reducing the fraction of resources spent on instruction fetch/decode and maximizing utilization of ALUs. As I previously stated, you can set a separate target frequency for AVX OC in KBL.

Kaby Lake is a quad-core design, and Zen is an octa-core design. And Zen is AMD's first shot at 14nm while Kaby Lake is Intel's 4th (or 3rd, depending on who you ask). What you stated about SIMD efficiency is true enough but the issue for both Intel and end-users is that AVX is used for only parts of codes. (if at all) That means even when the workload is 5% AVX, once it is detected the CPU clock will have to be lowered for 95% of the non-AVX workload. This causes actual productivity loss.

xdfg · Mar 10, 2017

TandemCharge · Mar 10, 2017

I don't think that AMD is looking for high overclocks with this chip. They are using high density libraries for this chip.
Their primary target is the Broadwell-D and that if you look, clocks and memory configuration will be similar. There is another thread which talks about power consumption at lower clocks and voltages and you will find that the Ryzen is competitive in this regard.
Also the other target is EX series chips where Naples go in.

All the chips whether Ryzen , Broadwell or Skylake EX will have lower clocks in server usage and performance is probably targeted at that.

Intel has massive engineering resources which meant that they can design a chip which can target client workloads and server at the same time. However AMD don't have that luxury and hence the lucrative server market is their first target.

knutinh · Mar 11, 2017

Carfax83 said:
You're missing the point. The point is that Intel's margin of victory increases with heavier workloads, and with greater AVX2 optimization. I've never done rendering, but I've heard that professional rendering/encoding jobs can take many hours to complete. With that in mind, the performance gap will likely be much larger than the small rendering and encoding jobs that we see in the Ryzen reviews, including even Computerbase.de's review and they used the biggest jobs I could find.

So Intel has 2x the peak AVX FMA throughput of AMD. Even with a memory bandwidth of 2x, I would not necessarily expect a 2x speedup of even something like "professional rendering". Perhaps something really streamlined and FMA-centric like matrix multiply or convolution.

For maximum performance, you need to have a smart compiler with optimized settings and/or a manual jockey doing intrinsics for the target. Did they use that for this new AMD?

I do not understand why Intel should suddenly get the upper hand for big jobs. Some warm-up effect of branch prediction? Thermals?

Ryzen's halved 256bit AVX2 throughput

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Diamond Member

Diamond Member

Golden Member

Junior Member

Member

Diamond Member

Senior member

Member

Senior member

Platinum Member

Golden Member

Platinum Member

Senior member

Member

Golden Member

Senior member

Elite Member

Member

Junior Member

Member