Will AMD support AVX-512 and Intel TSX ?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Ehh, I wouldn't use an integrated GPU for everything when they have a far more restricted programming model compared to CPUs and there's very little register space too. There's essentially too many pitfalls for an integrated GPU to entirely replace tasks that CPU's SIMD extension would handle.

First of all there's an overhead cost to be paid between CPU and GPU communication despite all of AMD's gallant efforts with HSA whereas the SIMD extension is usually integrated onto the CPU.

Second, there's not many established frameworks out there that would embrace the idea of heterogeneous programming. Many app writers would rather an ISA extension instead of dealing with other ISAs altogether if we take a look at how slowly CUDA adoption has progressed and then we have the Cell processor too which is a spectacular example of a failed heterogeneous processor. (it didn't help that it also had an extreme NUMA system) The consensus is that symmetry is highly valued when it comes to hardware design.

Third, I feel that GPUs are arguably too wide for their own good to take advantage of the many different vectorizable loops that a CPU SIMD would. A growing trend with GPUs is that you need an extreme amount of data level parallelism to exploit it but another problem that arises with it is that you're also limited with the amount of vector registers you have on a GPU. With a CPU the common worst case you can do is spill to the L2 or even the L3 cache but with a GPU you'll run your caches dry very quickly and then soon enough you'll hit the wall that is the main memory which will cause significant slowdowns when you're practically bandwidth bound.

In short GPUs have too many pitfalls in comparison to CPU SIMDs. I think that AVX-512 strikes the ideal balance between programming model and parallelism. It makes vectorization at the compiler level easier too compared to AVX2 or previous SIMD extensions. Game consoles should have AVX-512 since it meets the needs of game engine developers and arguably hardware designers but a bonus side effect is a lower latency.

AVX-512 is too a low hanging fruit when we want to increase the performance of next gen console CPUs (we're already at our limit on the CPU side for current gen consoles and there will always be loops that are too small to see a speedup on GPUs so a wider SIMD would come in handy along with a vastly more refined programming model. I believe AVX-512 is the console manufacturers salvation in their perf/mm^2 and perf/watt targets when we consider that ALUs are cheap these days. They can't increase clocks without sacrificing perf/watt and they definitely do not want something as beefy as Ryzen to sacrifice so much die space for relatively little gains in ILP (graphics and physics programmers are going to get mad where all that die space went into) so I don't think it's coincidence that AVX-512 would slot in nicely in these scenarios ...

The area spent on AVX-512 and the datapaths to support it would be considerable. There are better ways to spend that area- Jaguar CPUs are far from maxed out in scalar ILP. What you are proposing sounds more like Xeon Phi than a general purpose CPU. And even if you don't want to beef up the CPU, you can throw some more GPU shaders in instead.
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
AVX-512 is a terrible waste of die space and power from AMD's point of view. They are better off improving the CPU execution engine, branch prediction, cache performance, memory latency which will have a direct impact on all types of applications and not a very minor subset of applications. I do not think AMD will add 256 bit AVX units even with Zen 2 and Zen 3. AMD's strategy should be to use the best compute engine for the workloads. For general purpose integer workloads its the CPU and for HPC/ AI /Machine learning its the GPU. AMD also has APUs which provide the best of both.

I would like to see AMD move to a 8 core CCX with higher fabric speeds in Zen 2 , although it seems like they may move to a 6 core CCX. For the next gen consoles which will probably come out in late 2020 AMD should focus on 4K60fps gameplay as a standard. AMD are better off pushing core clock speeds as high as possible within the 30-35W TDP budget for the CPU portion of a game console SoC to make sure that the CPU bottleneck never arises in games and the next gen consoles are able to churn out buttery smooth 60 fps gaming.
 
Last edited:
Reactions: Drazick

itsmydamnation

Platinum Member
Feb 6, 2011
2,864
3,417
136
You people are missing the point, the internal unit cost is only one part. The bigger part is that every interface has to be twice as wide. AMD have 4 128SIMD units, Intel in effect have 2 128/256 SIMD units + 1 for specific tasks. If amd wants more SIMD throughput then they can increase the amount of Load/Store/AGU in the Core. This would have the added benefit of increasing SMT performance and make sustaining 5-6 uops a cycle more likely, that would be a win VS Intel as Skylake is really limited to about 4 uops a cycle. The question would be whats the power cost, and the cache complexity cost.

To processes a 256bit vector in 128bit units is only 1 extra cycle for most operations, with 4 SIMD units Load and Store is far more likely to be the biggest current bottleneck. Moving to 256bit loads and stores doesn't help anything except 256bit ops.
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,655
136
You people are missing the point, the internal unit cost is only one part. The bigger part is that every interface has to be twice as wide. AMD have 4 128SIMD units, Intel in effect have 2 128/256 SIMD units + 1 for specific tasks. If amd wants more SIMD throughput then they can increase the amount of Load/Store/AGU in the Core. This would have the added benefit of increasing SMT performance and make sustaining 5-6 uops a cycle more likely, that would be a win VS Intel as Skylake is really limited to about 4 uops a cycle. The question would be whats the power cost, and the cache complexity cost.

To processes a 256bit vector in 128bit units is only 1 extra cycle for most operations, with 4 SIMD units Load and Store is far more likely to be the biggest current bottleneck. Moving to 256bit loads and stores doesn't help anything except 256bit ops.
Well a better way of stating my point. I was more talking about what the "actual" penalty is for AMD with AVX2 (on Ryzen 7, none) and what it means in terms of it's competition (very small losses, to breaking even with higher core offerings). Besides what you said there isn't value in upping single core AVX2 work as long as they have a core advantage. But your point is just as good. What does full 256bit AVX bring, well it would make AMD better because of their core lead in AVX2 specific tasks. But that is where it ends. Nothing else would benefit from working that into the core. About the only reason for AMD to do that would be if AVX512 became important and they wanted to do the same resource sharing (which honestly I don't know if it is possible).
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Except, per mm^2, including AVX512 is very poor value for processing every other instruction.

Actually AVX-512 provides very good value in terms of compute density compared to every other instruction. It's got best value since you can process 16 32-bit elements compared 1 32/64-bit element that you can do with a fully fleshed out execution units. It's pretty clear that a lone execution unit will almost never match a SIMD unit in terms of transistor per processed element but to make matters worse is that quite a bit of these execution units on a CPU are 64-bits wide and game code rarely ever needs that much precision so the worst thing that happens is that a compiler just promotes most of these 8/16-bit ops into higher precision and that just translates to waste in the end ...

Except, you have to radically down clock your core to remain within thermal margins. Which introduces all sorts of issues in code parallelisms where its not simply feeding every core with similar instructions and in decoupling the core running AVX512 and the uncore feeding other threads.

Depends on what you mean by "radically". We have no data on AVX-512 but if we take a look at the Intel Xeon E5-2697 v3, (14 cores, 35MB of cache) there's only a 15% hit in clocks from a transition of SSE to AVX1/2. That's not bad all things considered when you have a 70% FLOPs increase ...

They can. Process shrinks. IIRC neither console on a store shelf is on the latest process at the minute.

You act as if dennard scaling still existed but it didn't take into account leakage current or threshold voltage however these things don't scale with transistor dimensions so frequency is far away from a low hanging fruit when it comes to perf/watt. The newest mid generation console updates like the PS4 Pro and project Scorpio only gives a frequency boost of 31% and 37% respectively and their built on the much more recent 16/14nm process node.

Just how long do you think it will remain viable to increase frequency for relatively little decrease in perf/watt ?

Note also that AVX512 sacrifices clocks to fit in the same power envelope as other operations.

It's a much better trade off when we look at it from a FLOPs/watt perspective ...

As opposed to a core that devotes significant space to delivering one subset of instructions to the detriment of others?

Increasing your throughput by 2x the ops isn't so bad when we take a look at Skylake vs Zen. The former has lower IPC than the latter (4 uops vs 5 uops) yet it has higher real world IPC. It's kind of funny that an 8-core Skylake-X will most likely have a budget of 3.X billion transistors compared to Ryzen 7's 4.8 billion transistors yet the former will deliver higher IPC and full rate AVX/AVX2 when all is said and done ...

If you have AVX512, your probably going to be sacrificing the number of cores you have to use (to fit in the same mm^2 and power budget).

I'm not sure why you're exaggerating the cost of of having a 2x wider SIMD so much if we take a look at Westmere-EX and Sandy Bridge-EP. A full fledged Westmere-EX has 10 cores whereas Sandy Bride-EP has 8 cores but to compensate the latter also has a roughly proportionally smaller die and lower transistor count all the while the latter got a 2x wider SIMD in the end. I fail to see what's so significant about the cost of moving to a wider SIMD from a perf/mm^2 perspective ...

From a programmer perspective, it may be nice to have. But, if you are wise enough to appreciate the losses you'll have elsewhere in the CPU to deliver that AVX512 (all other things being equal), then it makes little sense.

What's not wise is is convincing yourself to naively believe that there will be a higher payoff by increasing IPC when that's not the case as we see with Skylake and Zen ...
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
Why should game consoles take the responsibility of increasing AVX512 adoption?

Is LINPACK restricted to L1 a "game"?

CPUs spilling over their caches isn't a problem but GPUs doing the same is? Why do you think SKL-X has 4X the L2 then?

SSE 4.1/4.2 has been around for ten years, but only now are we seeing games not working on older CPUs that don't support it. AVX is even more specialized.
 
Reactions: Drazick

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
The area spent on AVX-512 and the datapaths to support it would be considerable. There are better ways to spend that area- Jaguar CPUs are far from maxed out in scalar ILP. What you are proposing sounds more like Xeon Phi than a general purpose CPU. And even if you don't want to beef up the CPU, you can throw some more GPU shaders in instead.

Kind of a stretch to claim what I'm proposing has more similarities to Xeon Phi when I don't wish for next gen consoles CPUs to be based on P45C or Airmont cores that have two 512-bit vector units when my highest expectations are cut down Haswell cores for lower power purposes that is bolted on with a 512-bit vector unit and has a bug free TSX implementation with a modest increase to a 2.8 GHz clock. A higher clocked 16 core Xeon D with AVX-512 was more in line with what I had in mind ...

Trust me, you definitely do not want to use GPU shaders for everything that's deemed parallelizable when the programming models of CPUs are vastly light years ahead of GPUs. CPU SIMDs are ideal as they get for small width and register heavy loops. To really get a noticeable speedup with GPUs you need very wide with very low register pressure loops ...

That's why I keep lobbying AMD to push AVX-512 in their processors so that one day AVX-512 may fall into the hands of console programmers since it makes sense from a perf/watt, perf/mm^2, and ease of use (low latency and homogeneous ISA, hardly anybody wants a repeat of the PS3 with it's DMA engines along with 3 ISAs if you include the GPU ISA) perspective ...

Even half rate AVX-512 would be fine in next gen consoles when there's programming improvements with the extension itself compared to AVX/AVX2 ...
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Why should game consoles take the responsibility of increasing AVX512 adoption?

Is LINPACK restricted to L1 a "game"?

CPUs spilling over their caches isn't a problem but GPUs doing the same is? Why do you think SKL-X has 4X the L2 then?

SSE 4.1/4.2 has been around for ten years, but only now are we seeing games not working on older CPUs that don't support it. AVX is even more specialized.

I didn't claim that spilling past the cache for a CPU wasn't a problem, it's that GPUs will much more often likely to spill to MEM when it's register file is multiple times larger than it's cache when it's the opposite for CPUs where it's register file is tiny in comparison to the cache.

The short of it is spilling to cache > spilling to MEM ...

I also don't think game consoles should be the ones with absolute sole responsibility to increase AVX-512 adoption but I expect AAA games to be a 'trendsetter' for many other high performance applications. AAA games soon became 64-bit only which only resulted in the betterment of technical quality ...

For all the downplaying and bashing consoles get they seem to set higher standards (D3D12 equivalent gfx API and AVX) than us high end 'privileged' PC users if we can even call ourselves that when there's nothing privileged about getting downgraded software where the devs are enjoying their walled gardens called 'game consoles' ...

Kind of pathetic that the devs behind No Man's Sky wouldn't keep the SSE4.2 requirement cause of some phenom users out there and I'm absolutely disgusted that Intel would event think about selling CPUs with no AVX in this day and age ...
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
I didn't claim that spilling past the cache for a CPU wasn't a problem, it's that GPUs will much more often likely to spill to MEM when it's register file is multiple times larger than it's cache when it's the opposite for CPUs where it's register file is tiny in comparison to the cache.

The short of it is spilling to cache > spilling to MEM ...

I also don't think game consoles should be the ones with absolute sole responsibility to increase AVX-512 adoption but I expect AAA games to be a 'trendsetter' for many other high performance applications. AAA games soon became 64-bit only which only resulted in the betterment of technical quality ...

For all the downplaying and bashing consoles get they seem to set higher standards (D3D12 equivalent gfx API and AVX) than us high end 'privileged' PC users if we can even call ourselves that when there's nothing privileged about getting downgraded software where the devs are enjoying their walled gardens called 'game consoles' ...

Kind of pathetic that the devs behind No Man's Sky wouldn't keep the SSE4.2 requirement cause of some phenom users out there and I'm absolutely disgusted that Intel would event think about selling CPUs with no AVX in this day and age ...
It's not as if GPUs aren't making any progress in addressing these issues regarding cache spillage. Volta has configurable L1, better cache utilization and bigger registers. GCN has also made some progress in this regard.

Cache spillage is the biggest issue when it comes to running AVX on CPU.

AVX doesn't result in the 2X speedup it promises wrt SSE, except in highly specialized cases, and in those cases most of the time is spent waiting for vectorized loops to complete. I don't think this is what your typical game logic is like.

No Man's Sky is far from being a AAA title that pushed technical boundaries. Resident Evil 7 is a better comparison, though it was strange that it ran perfectly until a point in the game on CPUs without SSE4, which suggests to me that it was not a technical limitation per se.
 
Reactions: Drazick

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
It's not as if GPUs aren't making any progress in addressing these issues regarding cache spillage. Volta has configurable L1, better cache utilization and bigger registers. GCN has also made some progress in this regard.

Cache spillage is the biggest issue when it comes to running AVX on CPU.

AVX doesn't result in the 2X speedup it promises wrt SSE, except in highly specialized cases, and in those cases most of the time is spent waiting for vectorized loops to complete. I don't think this is what your typical game logic is like.

No Man's Sky is far from being a AAA title that pushed technical boundaries. Resident Evil 7 is a better comparison, though it was strange that it ran perfectly until a point in the game on CPUs without SSE4, which suggests to me that it was not a technical limitation per se.

I'm not saying that GPUs aren't making any progress but GPU and CPUs will most likely forever have a different parallel programming paradigms that are not very compatible with each other ...

The GV100 still dwarfs it's 6MBs of L2 cache with 21MBs worth of registers! It's a little better in comparison to the GP100 when the register file size increased by 40% and L2 cache by 50% with Volta.

AVX has lot's of applicability outside of special cases. Your game logic would have a lot more 'breadth' so to say in terms of different objects, actors, and sometimes mechanics and AI in the game engine. So you might have to populate the virtual world with a little more details to take advantage of a wider SIMD width but it's a performance gain in the end when you only burden 2 cores instead of 4 with these things so you can save those precious CPU cycles for something else like audio codecs for example or increasing the other qualities of your game logic ...
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
AVX has lot's of applicability outside of special cases.
That's debatable.

I certainly don't believe that's the case, else we would have seen it adopted widely as it has been supported since SB. In the context of gaming, maybe it would be useful in things like emulation, where you call a few instructions repeatedly. But that is a niche application.

AVX2 and AVX512 being used by games extensively in the future is unlikely to happen. Console generations stay in the market long enough to make the next iteration being faster due to process shrinks and(or) architectural changes worth more than any potential uplift by using special instructions, in my opinion.

AVX is best utilized in specific HPC workloads, and I don't see it changing in the future.
 
Reactions: Drazick

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
Kind of a stretch to claim what I'm proposing has more similarities to Xeon Phi when I don't wish for next gen consoles CPUs to be based on P45C or Airmont cores that have two 512-bit vector units when my highest expectations are cut down Haswell cores for lower power purposes that is bolted on with a 512-bit vector unit and has a bug free TSX implementation with a modest increase to a 2.8 GHz clock. A higher clocked 16 core Xeon D with AVX-512 was more in line with what I had in mind ...

Sorry, for context what I thought you were proposing was adding AVX-512 to the existing console CPU cores. Those Jaguar cores are very similar to Silvermont in size and performance, which is where my comparison to Knights Landing came from.

However, that said- 8 Broadwell level cores plus AVX-512 would be an enormous jump in die area, power consumption and cost for the consoles. Given that Zen is already at roughly Broadwell level performance, you're basically proposing Zen plus AVX-512. 8 Zen cores are already 95W- even with a die shrink and reduced clock speeds, taking that and adding AVX-512 would take up a massive chunk of the console's thermal budget. Again, that thermal budget could be better spent on extra GPU shaders.

Trust me, you definitely do not want to use GPU shaders for everything that's deemed parallelizable when the programming models of CPUs are vastly light years ahead of GPUs. CPU SIMDs are ideal as they get for small width and register heavy loops. To really get a noticeable speedup with GPUs you need very wide with very low register pressure loops ...

Remember that with modern console GPUs, multiple compute and graphics tasks can be scheduled simultaneously- you don't need a task wide enough to saturate the entire GPU.

The latency of launching and waiting for a GPU task is still far higher, of course. But I feel that the subset of problems which are wide enough for 8-core 32-element CPU SIMD to be valuable, while not being wide enough for a GPU compute task, is really not big enough to justify the added complexity.

As for the programming model... eh. I've played around with vectorizing compilers in the past, and they're great up to a certain point. You can massage an awful lot of things into autovectorized loops. But then another developer comes along with a bug to fix or a feature to add, slaps an innocuous if/else branch into what looks like a regular old "for" loop, and all of a sudden the vectorization becomes dramatically less efficient... if it doesn't fail to vectorize entirely. Getting good performance out of that vectorized code, and maintaining that good performance, means that you basically need to think like a GPU developer anyway. And if I need my whole team to think like a GPU developer, I'd rather use a GPU-focused API and make it more explicit.

(That being said, I'm not a console developer, I'm a desktop CPU + CUDA developer. So take what I say with a pinch of salt )

That's why I keep lobbying AMD to push AVX-512 in their processors so that one day AVX-512 may fall into the hands of console programmers since it makes sense from a perf/watt, perf/mm^2, and ease of use (low latency and homogeneous ISA, hardly anybody wants a repeat of the PS3 with it's DMA engines along with 3 ISAs if you include the GPU ISA) perspective ...

Even half rate AVX-512 would be fine in next gen consoles when there's programming improvements with the extension itself compared to AVX/AVX2 ...

I do like the improvements in AVX-512 from a developer point of view. Masked operations? Scatter support? This is exactly what you need to make autovectorization more feasible. But those features are going to add more and more complexity to the vector units. The number of (named) vector registers has doubled, and each vector is twice as wide- you're up to 2KB of named register space per core. And then you have the additional mask registers on top too. We haven't yet seen how this is reflected in the physical register file, but I can only assume that this will also have grown significantly in SKX. They certainly had to significantly boost L2 cache size in order to deal with the increased cache pressure. And the actual execution units will have to increase in complexity in order to deal with the new masked operations- there will have to be wiring for this additional operand, along with more complication in the scheduling and retiring.

I just don't think it's a sensible trade off to make in a console. It's a game of tradeoffs, working within constraints to provide the optimal solution. AVX-512 would be a nice feature, but would you really choose it over 20% more GPU shaders?
 
Reactions: tamz_msc

Atari2600

Golden Member
Nov 22, 2016
1,409
1,655
136
Actually AVX-512 provides very good value in terms of compute density compared to every other instruction.

That is not what I meant.

When its not being used, the baggage necessary to enable AVX512 is much more significant.

Your entire datapath from L1 uncore to SIMD needs to be wide enough to carry the 512bit operation. That is not insignificant. It would also be largely idle for other operations - unless you are willing to kill SMT performance when performing AVX512 ops?

AVX512 is not a free lunch. It has its place. But right now that place is in niche HPC where utilisation is high enough to justify inclusion. Maybe when we're at 5nm processes or something then viability will improve for more general purpose CPUs.
 
Reactions: Drazick

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
Sorry, for context what I thought you were proposing was adding AVX-512 to the existing console CPU cores. Those Jaguar cores are very similar to Silvermont in size and performance, which is where my comparison to Knights Landing came from.

However, that said- 8 Broadwell level cores plus AVX-512 would be an enormous jump in die area, power consumption and cost for the consoles. Given that Zen is already at roughly Broadwell level performance, you're basically proposing Zen plus AVX-512. 8 Zen cores are already 95W- even with a die shrink and reduced clock speeds, taking that and adding AVX-512 would take up a massive chunk of the console's thermal budget. Again, that thermal budget could be better spent on extra GPU shaders.

8 fully fleshed Broadwell cores with AVX-512 would be a substantial jump in die area and power consumption but I wasn't aiming for that either. A cut down Haswell core as a base would be my ideal starting point. Somewhere in between Sandy Bridge and Haswell then work our way up from there to extend those cores with AVX-512 and bug free TSX which higher clocks for a 16 core part ...

All of that seems like a reasonable budget of around ~5 billion transistors with 5nm GAAFETs and we really don't want to hold back on hitting an aggressive target of having under 16.6ms for game logic since this very well might be our very last console generation as we are reaching a plateau for transistor technology and I don't want AMD to hold back in which they could realistically score a home run with new consoles in 202X having AVX-512 ...

Remember that with modern console GPUs, multiple compute and graphics tasks can be scheduled simultaneously- you don't need a task wide enough to saturate the entire GPU.

It's just absolutely not wise to use the GPU for everything when all a game programmer wants to do is hit performance targets like frametimes and GPUs just don't quite provide the low latencies that a wider SIMD extension could ...

In fact you could actually hurt performance if you naively try to port around moderate data set workloads to GPU from CPUs when there's a possibility that you could increase the frametimes which is just bad news since sticking with the CPU would've kept lower frametimes instead!

The latency of launching and waiting for a GPU task is still far higher, of course. But I feel that the subset of problems which are wide enough for 8-core 32-element CPU SIMD to be valuable, while not being wide enough for a GPU compute task, is really not big enough to justify the added complexity.

Which brings me back to the above! I do not believe that AVX-512 serves to be a practical subset of GPUs when there are many areas where in which a GPU can't touch when it comes to giving a speedup in many applications that a CPU would normally excel in ...

Let's say that the field of applications are shaped like a square. Corners 1 & 2 can be accelerated with a GPU but it can't touch corners 3 & 4. Corner 3 is occupied with a CPU currently but no processor so far can touch corner 4 until now with another extension or so like AVX-512 which show a definitive advantage with these new family of CPUs ...

It is my view that AVX-512 is meant to reach spots that neither a CPU or GPU ever could and bring new performance heights in a continuum of applications that have different bottlenecks (whether it'd be high throughput, bandwidth, latency or etc) as specified by the 4 corners in my example ...

As for the programming model... eh. I've played around with vectorizing compilers in the past, and they're great up to a certain point. You can massage an awful lot of things into autovectorized loops. But then another developer comes along with a bug to fix or a feature to add, slaps an innocuous if/else branch into what looks like a regular old "for" loop, and all of a sudden the vectorization becomes dramatically less efficient... if it doesn't fail to vectorize entirely. Getting good performance out of that vectorized code, and maintaining that good performance, means that you basically need to think like a GPU developer anyway. And if I need my whole team to think like a GPU developer, I'd rather use a GPU-focused API and make it more explicit.

(That being said, I'm not a console developer, I'm a desktop CPU + CUDA developer. So take what I say with a pinch of salt )

That's not true at all. The most common thing you would have to worry about for both is getting a good data layout, that's practically half the work right there for you to setup. As long as you can ensure that there is enough data sets you will almost certainly get a speedup with AVX-512 just from the compiler side alone whereas you have to forgo many trials of radically GPU centric algorithms just to get it efficiently running (gotta be careful about that register pressure and those memory access patterns). You can mix and match scalar + vector instructions fairly freely on CPUs but doing that on a GPU would rear up on it's ugly head one bad surprise after another and there's far less ways to synchronize across threads/waves ...

I do like the improvements in AVX-512 from a developer point of view. Masked operations? Scatter support? This is exactly what you need to make autovectorization more feasible. But those features are going to add more and more complexity to the vector units. The number of (named) vector registers has doubled, and each vector is twice as wide- you're up to 2KB of named register space per core. And then you have the additional mask registers on top too. We haven't yet seen how this is reflected in the physical register file, but I can only assume that this will also have grown significantly in SKX. They certainly had to significantly boost L2 cache size in order to deal with the increased cache pressure. And the actual execution units will have to increase in complexity in order to deal with the new masked operations- there will have to be wiring for this additional operand, along with more complication in the scheduling and retiring.

And that's what game programmers have been asking for a long time on their wishlist!

I could understand having to increase the L2 cache to reduce some bank conflicts and increasing the load and store ports to 512-bits to increase bandwidth to maintain the sustained throughput but AVX-512 isn't ballooning out of control in complexity like you seem imply. Even the quadrupled register file may look hefty at first but it's really meager in comparison the amount that GPUs have to dedicate and GPUs make it easier than you think it is to double the amount of ALUs plus in the future once we get to denser transistor sizes the so called 'complexity' will become vanishingly small like we see today with x87, MMX, and SSE ...

I just don't think it's a sensible trade off to make in a console. It's a game of tradeoffs, working within constraints to provide the optimal solution. AVX-512 would be a nice feature, but would you really choose it over 20% more GPU shaders?

I don't see how in the future AVX-512 is NOT worthwhile when we consider that CPUs will take smaller and smaller portions of a die in the future if console hardware designers are only targeting to get a modest 85% of the IPC Haswell has (basically sandy bridge) and bake in AVX-512 with these cores. All of this sounds very doable with a ~5 billion transistor budget (all the while getting 16 cores too in the process!) when we consider that a Broadwell core is twice as fat (20mm^2 using Xeon E5 as ref) as a Knights Landing core which has two AVX-512 units (10mm^2) despite the former having 4x less the SIMD width ... (both built on the same 14nm process)

If the new consoles we're built around the 5nm GAAFETs the CPUs would most likely only be worth 30mm^2 at MOST so really console manufacturers are just paying peanuts for AVX-512 in the end and if some game programmers don't want to mess around with AVX-512, they can just turn off the functionality for higher clocks so it's really a win-win scenario all around as time goes on plus console manufacturers don't have to face another codebase fragmentation if they plan to upgrade again. (PS4 Pro is feeling this as GPU doesn't have the same ISA but I wonder if it's backwards compatible) Increasing IPC seems rather transparent at the compiler when you share the same ISA in comparison so if console manufacturers want they can go in the direction of upgrading the CPU microachitecture instead ...

I don't see the big deal of paying a little (low risk) in to see big gains (high potential reward) and that's exactly how I view AVX-512 in the future. I'm sure every once in a while an engine programmer would like to see their engines being able to hit the 16.6ms threshold just to be able to reach the magical 60FPS in case they ever want it.

Besides, if we take a look at the PS4 APU the jaguar cores have consumed roughly around 30% of the die size so console manufacturers don't share your vision of having an extremely skewed CPU to GPU compute ratio being ideal as it is and lot's of game programmers don't find that fun in that either. (aside from maybe the graphics programmers)
 
Reactions: Carfax83

knutinh

Member
Jan 13, 2006
61
3
66
...
AVX512 is not a free lunch. It has its place. But right now that place is in niche HPC where utilisation is high enough to justify inclusion. Maybe when we're at 5nm processes or something then viability will improve for more general purpose CPUs.
I would be curious to know what the potential would be for Adobe products (Photoshop, Lightroom) if they hired competent programmers/optimizers and targeted AVX512 + high core counts due to the iMac pro being used by "media professionals".

Is "pixel processing" the bottle neck in those products (as opposed to, say, memory/disk bandwidth, SQL Lite or some LUA cruft in the case of Lightroom), and is the potential for vectorized speedup near infinite as my intuition says?

If you are pushing a slider called "brightness" and a 20-40MP raw (or intermediate developed) source file is processed on-th-fly to render to a 5k display at (ideally) 60fps by "multiplying float32 numbers", it would seem that really wide vector hardware would be a good thing. So why was the "GPU-acceleration" a flop that everyone seems to suggest disabling? Perhaps hw/drivers are fragmented. Perhaps the GPU is not sufficiently flexible, does not support the programming models and libraries that photo editors rely upon, rather having to shuffle data back and forth between cpu and gpu or constrict the functionality offered to the user?

-k
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
That's debatable.

I certainly don't believe that's the case, else we would have seen it adopted widely as it has been supported since SB. In the context of gaming, maybe it would be useful in things like emulation, where you call a few instructions repeatedly. But that is a niche application.

AVX has already been used in PhysX for quite some time for cloth simulation. If AVX-512 did get used, it's likely that AVX-512 it would be employed in a physics engine, where the code naturally responds well to vectorization. AVX-512 could very well nullify GPU physics completely. While GPUs are extremely powerful, there is still a massive latency penalty involved with moving these calculations back and forth from the CPU to the GPU. There is actually a lot of growth potential for more advanced physics in games, as physics has been comparatively neglected by developers compared to graphics. I could also see AVX-512 being used for other types of calculations, like occlusion culling.

AVX is best utilized in specific HPC workloads, and I don't see it changing in the future.

To really get the most out of wider vectors, developers will probably have to rethink their coding strategies entirely, similarly to how the advent of multicore CPUs led to game engines which are designed around task based parallelism.
 

tamz_msc

Diamond Member
Jan 5, 2017
3,865
3,729
136
AVX has already been used in PhysX for quite some time for cloth simulation. If AVX-512 did get used, it's likely that AVX-512 it would be employed in a physics engine, where the code naturally responds well to vectorization. AVX-512 could very well nullify GPU physics completely. While GPUs are extremely powerful, there is still a massive latency penalty involved with moving these calculations back and forth from the CPU to the GPU. There is actually a lot of growth potential for more advanced physics in games, as physics has been comparatively neglected by developers compared to graphics. I could also see AVX-512 being used for other types of calculations, like occlusion culling.



To really get the most out of wider vectors, developers will probably have to rethink their coding strategies entirely, similarly to how the advent of multicore CPUs led to game engines which are designed around task based parallelism.
PhysX is for a subset of PCs with a particular type of hardware. This is about consoles. Do you think there aren't latency issues involved with wider SIMD on the CPU?

Perhaps there are reasons why developers can get away with the level of physics simulation that are currently present in today's games - because it doesn't necessarily imply a better visual presentation. Maybe resources are better spent in utilizing more GPU-compute for better lighting, weather simulation and post-processing among other things - things which are more immediately noticeable.

The 2X throughput that each iteration of AVX promises over the previous one is very often when calculating loops of the type

Code:
A[i] = B[i]*C[i] + D[i]

Real AVX-accelerated code is far from providing the same kind of speedup. If you can identify the kind of operation I'm talking about, then you'll probably realize that NVIDIA at least has got you covered. They could easily put a few tensor cores in a consumer GPU - if they wanted to, and do the same thing with vectors instead of 4x4 matrices.
 
Reactions: Drazick

ryzenmaster

Member
Mar 19, 2017
40
89
61
As vectors become wider, their use cases become narrower. I've heard people chuckle when bringing up thread ripper cause they think they have the trump card: "oh but Intel has better AVX support and in the future it will become more useful". Except that their trump card is about as legit as the other Trump(the one in white house).

The vast majority of us do not, and should not care about AVX2 or AVX-512. I find it unlikely that AMD would hold it as high priority given how vocal they've been about heterogeneous computing. Bulldozer may have been slightly optimistic and ahead of its time when it comes to offloading FP math to the GPU, but I don't see them abandoning the principle any time soon.
 

mattiasnyc

Senior member
Mar 30, 2017
356
337
136
But is it really underperforming. If the rumors in the top 16c chip is true at $1k. It will priced at the cost of a 10c Intel chip. Clocks will be a wash, with TR possibly having an all core advantage when running AVX2. It will have have a 6 core advantage. So while it will probably be down on AVX the difference will be marginal at best.

I know what you're getting at and I personally agree. I think it might be a wash on a performance per dollar basis, especially including all relevant benchmarks.

I think part of it boils down to workflow and perception. I know that for my type of work (audio) there are certain things I've ended up doing to make up for my home computer being a dinosaur. That works for me because most of my paid work is done elsewhere. But if I did all of my work at home then the equation changes slightly (for me). Even one task where one CPU is behind might be time-consuming enough to warrant spending more on a CPU that does the job slightly faster. In other words over the lifetime of the computer it's a net gain to spend more up front.

But in my industry (pro-audio) AVX is probably 99.8% irrelevant, and in pro-video editing/color grading probably "should" be irrelevant, but possibly isn't.

Either way I'm as excited about x399 coming out as I am annoyed that I don't know the release date and pricing (since I'm really wanting a new computer now...)

The question is going to really be outside the people who use that number to convince themselves to get those 12c+ options, how many people will really need AVX2 performance so much all the other advantages is wiped out by being a little slower (16c TR vs 12c i9), or just plain insanely priced like the 14+ options.

Well, I've followed the discussions on Ryzen after its release, in forums related to content creation. While there were some absolutely legitimate concerns it was also pretty clear that people really do express themselves in a way that looks very biased and also then read what people write with a 'filter' on their glasses.

So for example (and there's an on-topic point coming...), if I understand correctly the official specs for DDR4 has it go up only so far, and most of the speeds people want to hit are technically overclocking the memory. Well, when people discussed Ryzen the relatively low amount of validated memory from the board manufacturers wasn't described that way in that context, instead it became "Ryzen has a problem with DDR4 memory". Now, one can argue back and forth just how much of a benefit user X sees in task Y when memory goes from 2400 to 3200, but regardless of that benefit it was as far as I can see never a "problem" or "issue" with Ryzen. All it was was a matter of giving it a little bit of time for things to get certified. It's really no big deal.

So the point I was getting at was that similarly to memory at launch there will be some that "play up" the importance of AVX2 or AVX512 and the perception will be different from yours. Instead of looking at all the tasks the CPU will perform and then evaluating total performance, some will see it as a true CPU issue, a problem that disqualifies the CPU, regardless of how the rest of it shakes out.....

I type too much... I'm sorry..
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,655
136
I know what you're getting at and I personally agree. I think it might be a wash on a performance per dollar basis, especially including all relevant benchmarks.

I think part of it boils down to workflow and perception. I know that for my type of work (audio) there are certain things I've ended up doing to make up for my home computer being a dinosaur. That works for me because most of my paid work is done elsewhere. But if I did all of my work at home then the equation changes slightly (for me). Even one task where one CPU is behind might be time-consuming enough to warrant spending more on a CPU that does the job slightly faster. In other words over the lifetime of the computer it's a net gain to spend more up front.

But in my industry (pro-audio) AVX is probably 99.8% irrelevant, and in pro-video editing/color grading probably "should" be irrelevant, but possibly isn't.

Either way I'm as excited about x399 coming out as I am annoyed that I don't know the release date and pricing (since I'm really wanting a new computer now...)
I guess that's my point TBK wants us to believe that AVX512 is so important to everything that AMD is missing out a bunch by not having. But the actual implementation is so small and it's probably more one or two use cases where most of the instructions are AVX512 or almost none of it and it doesn't make a hole lot of sense to spend a bunch of time developing for that niche if the opportunity for entry into that use is small as they are already pretty comfortable with their Intel implementation and probably because they gave Intel feedback on what instructions to include. Specially if it means compromising their design too much. Then of course their is the "Ryzen is bad at AVX2 mindset it has to overcome. It's not worse if you have the core count and clock speed to make up for it. I just would assume that people who would look at these products would at least figure out what they actual "AVX2 Penalty" and not lock into it. I can't think of a scenario where AVX2 or 512 would be so important for consumer or business outside servers that they would take the hit on the rest of the workload. I get that if a single job is the most important job in a workflow and ties up everything than that job should be the focus. Specially if it is the only part that is performance driven. I just don't see that existing for either SIMD set.
Well, I've followed the discussions on Ryzen after its release, in forums related to content creation. While there were some absolutely legitimate concerns it was also pretty clear that people really do express themselves in a way that looks very biased and also then read what people write with a 'filter' on their glasses.

So for example (and there's an on-topic point coming...), if I understand correctly the official specs for DDR4 has it go up only so far, and most of the speeds people want to hit are technically overclocking the memory. Well, when people discussed Ryzen the relatively low amount of validated memory from the board manufacturers wasn't described that way in that context, instead it became "Ryzen has a problem with DDR4 memory". Now, one can argue back and forth just how much of a benefit user X sees in task Y when memory goes from 2400 to 3200, but regardless of that benefit it was as far as I can see never a "problem" or "issue" with Ryzen. All it was was a matter of giving it a little bit of time for things to get certified. It's really no big deal.
It's pretty hilarious how things become a problem. It's a problem when memory doesn't run out of spec, it's a problem when a CPU doesn't have the OC headroom someone wants, its a problem when one person out of thousands posting about their Ryzen configurations has a problem with an M.2 drive dropping in and out. This works on both sides of the isle though, if you're a fan you defend your choice to the death and the other side you blow up every hiccup as a major issue. Though Intel luckily doesn't have to deal to the lingering mental imprint that AMD does when an unbiased and uniformed person hears that.

So the point I was getting at was that similarly to memory at launch there will be some that "play up" the importance of AVX2 or AVX512 and the perception will be different from yours. Instead of looking at all the tasks the CPU will perform and then evaluating total performance, some will see it as a true CPU issue, a problem that disqualifies the CPU, regardless of how the rest of it shakes out.....

I type too much... I'm sorry..

Yeah I get that TBK really does think AVX will be really important or has some grand scheme he would love to see implemented. Though console wise as long as they use Cat or Atom like CPU's they will never widen the bus to add AVX 512 or similar functionality. It defeats the design elements that make it such intriguing option for the consoles. If they are willing to up the power envelope and complicate the cores a little more with something like Zen in the future than maybe, but that is at minimum another 4 years out. But really attempting to make a big deal out of this is like trying to make a big deal out of the TIM in SL-X (well this one is more civil).
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
8 fully fleshed Broadwell cores with AVX-512 would be a substantial jump in die area and power consumption but I wasn't aiming for that either. A cut down Haswell core as a base would be my ideal starting point. Somewhere in between Sandy Bridge and Haswell then work our way up from there to extend those cores with AVX-512 and bug free TSX which higher clocks for a 16 core part ...

All of that seems like a reasonable budget of around ~5 billion transistors with 5nm GAAFETs and we really don't want to hold back on hitting an aggressive target of having under 16.6ms for game logic since this very well might be our very last console generation as we are reaching a plateau for transistor technology and I don't want AMD to hold back in which they could realistically score a home run with new consoles in 202X having AVX-512 ...



It's just absolutely not wise to use the GPU for everything when all a game programmer wants to do is hit performance targets like frametimes and GPUs just don't quite provide the low latencies that a wider SIMD extension could ...

In fact you could actually hurt performance if you naively try to port around moderate data set workloads to GPU from CPUs when there's a possibility that you could increase the frametimes which is just bad news since sticking with the CPU would've kept lower frametimes instead!



Which brings me back to the above! I do not believe that AVX-512 serves to be a practical subset of GPUs when there are many areas where in which a GPU can't touch when it comes to giving a speedup in many applications that a CPU would normally excel in ...

Let's say that the field of applications are shaped like a square. Corners 1 & 2 can be accelerated with a GPU but it can't touch corners 3 & 4. Corner 3 is occupied with a CPU currently but no processor so far can touch corner 4 until now with another extension or so like AVX-512 which show a definitive advantage with these new family of CPUs ...

It is my view that AVX-512 is meant to reach spots that neither a CPU or GPU ever could and bring new performance heights in a continuum of applications that have different bottlenecks (whether it'd be high throughput, bandwidth, latency or etc) as specified by the 4 corners in my example ...



That's not true at all. The most common thing you would have to worry about for both is getting a good data layout, that's practically half the work right there for you to setup. As long as you can ensure that there is enough data sets you will almost certainly get a speedup with AVX-512 just from the compiler side alone whereas you have to forgo many trials of radically GPU centric algorithms just to get it efficiently running (gotta be careful about that register pressure and those memory access patterns). You can mix and match scalar + vector instructions fairly freely on CPUs but doing that on a GPU would rear up on it's ugly head one bad surprise after another and there's far less ways to synchronize across threads/waves ...



And that's what game programmers have been asking for a long time on their wishlist!

I could understand having to increase the L2 cache to reduce some bank conflicts and increasing the load and store ports to 512-bits to increase bandwidth to maintain the sustained throughput but AVX-512 isn't ballooning out of control in complexity like you seem imply. Even the quadrupled register file may look hefty at first but it's really meager in comparison the amount that GPUs have to dedicate and GPUs make it easier than you think it is to double the amount of ALUs plus in the future once we get to denser transistor sizes the so called 'complexity' will become vanishingly small like we see today with x87, MMX, and SSE ...



I don't see how in the future AVX-512 is NOT worthwhile when we consider that CPUs will take smaller and smaller portions of a die in the future if console hardware designers are only targeting to get a modest 85% of the IPC Haswell has (basically sandy bridge) and bake in AVX-512 with these cores. All of this sounds very doable with a ~5 billion transistor budget (all the while getting 16 cores too in the process!) when we consider that a Broadwell core is twice as fat (20mm^2 using Xeon E5 as ref) as a Knights Landing core which has two AVX-512 units (10mm^2) despite the former having 4x less the SIMD width ... (both built on the same 14nm process)

If the new consoles we're built around the 5nm GAAFETs the CPUs would most likely only be worth 30mm^2 at MOST so really console manufacturers are just paying peanuts for AVX-512 in the end and if some game programmers don't want to mess around with AVX-512, they can just turn off the functionality for higher clocks so it's really a win-win scenario all around as time goes on plus console manufacturers don't have to face another codebase fragmentation if they plan to upgrade again. (PS4 Pro is feeling this as GPU doesn't have the same ISA but I wonder if it's backwards compatible) Increasing IPC seems rather transparent at the compiler when you share the same ISA in comparison so if console manufacturers want they can go in the direction of upgrading the CPU microachitecture instead ...

I don't see the big deal of paying a little (low risk) in to see big gains (high potential reward) and that's exactly how I view AVX-512 in the future. I'm sure every once in a while an engine programmer would like to see their engines being able to hit the 16.6ms threshold just to be able to reach the magical 60FPS in case they ever want it.

Besides, if we take a look at the PS4 APU the jaguar cores have consumed roughly around 30% of the die size so console manufacturers don't share your vision of having an extremely skewed CPU to GPU compute ratio being ideal as it is and lot's of game programmers don't find that fun in that either. (aside from maybe the graphics programmers)

How far in the future are you talking? Because I feel like we may be talking at cross purposes somewhat The jump from what's in today's consoles to what you are proposing is enormous. For reference, here's a comparison of die sizes of CPUs:



Even comparing 28nm to 22nm, a Haswell core without AVX-512 is almost five times the size. We will need a lot of die shrinks to happen before we can fit 8 Haswell-plus-AVX-512 CPUs onto a console APU without significantly increasing die size.

I would agree that in the long, long term, it could well become a small enough addition to a core to make its way in. It's cool tech, and it has its usage. But by the time it's feasible in a console in, say, 2027, will we even still be on x86? Maybe we'll be talking about ARM SVE instead.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
How far in the future are you talking? Because I feel like we may be talking at cross purposes somewhat The jump from what's in today's consoles to what you are proposing is enormous. For reference, here's a comparison of die sizes of CPUs:



Even comparing 28nm to 22nm, a Haswell core without AVX-512 is almost five times the size. We will need a lot of die shrinks to happen before we can fit 8 Haswell-plus-AVX-512 CPUs onto a console APU without significantly increasing die size.

I would agree that in the long, long term, it could well become a small enough addition to a core to make its way in. It's cool tech, and it has its usage. But by the time it's feasible in a console in, say, 2027, will we even still be on x86? Maybe we'll be talking about ARM SVE instead.
What about a single ccx of 6c zen2 at 7nm for a console start 2019? With reduced or no l3 and no avx 512. Or perhaps even a 8c sans l3? Is that doable technically and what about die size?
If i remember a lot of the 193mm2 area on ryzen 2 ccx is "wasted" anyway so if they stay within a ccx they gain mm2 and avoid latency cros ccx issue's?
 

Topweasel

Diamond Member
Oct 19, 2000
5,436
1,655
136
What about a single ccx of 6c zen2 at 7nm for a console start 2019? With reduced or no l3 and no avx 512. Or perhaps even a 8c sans l3? Is that doable technically and what about die size?
If i remember a lot of the 193mm2 area on ryzen 2 ccx is "wasted" anyway so if they stay within a ccx they gain mm2 and avoid latency cros ccx issue's?

I don't want to say you can't have Zen without L3. But it would be pretty close to the truth. A big reason AMD went cacheless on their APU's was due to the fact that any additional cache they added subtracted from the APU CU's they could add. With Zen they obviously didn't want to be stuck in that situation again and made L3 more important than ever and added the L3 cache to the center of the CCX instead of doing what they and intel have done which is add the cache afterwards once they figure out how much they wanted to add.

Zen really isn't about the cores themselves but the CXX module. That is Zen. Just like the Bulldozer architect wasn't about a single BD die but the full CMT module.

Now again these are customer customer chips so it's all up for grabs. But it's going to take a lot of money to get AMD to basically carve up a CCX.
 

ThatBuzzkiller

Golden Member
Nov 14, 2014
1,120
260
136
That's debatable.

I certainly don't believe that's the case, else we would have seen it adopted widely as it has been supported since SB. In the context of gaming, maybe it would be useful in things like emulation, where you call a few instructions repeatedly. But that is a niche application.

AVX2 and AVX512 being used by games extensively in the future is unlikely to happen. Console generations stay in the market long enough to make the next iteration being faster due to process shrinks and(or) architectural changes worth more than any potential uplift by using special instructions, in my opinion.

AVX is best utilized in specific HPC workloads, and I don't see it changing in the future.

We don't see it cause AVX isn't faster in the case of consoles when it's practically half rate and Sandy Bridge supporting it isn't good enough either since there's lots of users who refuse to ditch old systems (as we can see from the No Man's Sky and 3D Mark Time Spy backlash when they required SSE4.1 ?) so that just means that the app developers withhold taking use of the new extensions ... (There's lot's of gravity involved with software support such as userbase, drivers, and the compilers before adopting new hardware and extensions)

I believe that only consoles can set an example for common end users how important these features are by showing has this reflects to PC games ...

Realistically speaking if you can find a speedup using SSE in a game engine then you can almost certainly find a speedup using AVX-512. You would usually need tens of thousands elements to work on before you start seeing a payoff whereas if you need to do several hundreds or a couple thousand iterations then AVX-512 is best suited to the task when those vector units are running higher clocks along with a lower startup cost since you can keep the scalar and vector code entangled to each other so that translates to a lower latency in the end which is a fair bit of a win ...
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
PhysX is for a subset of PCs with a particular type of hardware. This is about consoles. Do you think there aren't latency issues involved with wider SIMD on the CPU?

Subset of PCs with a particular type of hardware? No idea what you're talking about. PhysX isn't just restricted to PCs, it's also found on loads of console games as well. Witcher 3 used PhysX for instance, and so does every Unreal Engine 3 and Unreal Engine 4 game plus many other games. In fact, PhysX is the most popular middleware physics engine by far.

As for latency issues with wider SIMD, I would imagine that SIMD generally reduces latency, since it uses fewer instructions to perform calculations. Also, remember that there are twice as many AVX-512 registers, from 16 to 32.

Perhaps there are reasons why developers can get away with the level of physics simulation that are currently present in today's games - because it doesn't necessarily imply a better visual presentation. Maybe resources are better spent in utilizing more GPU-compute for better lighting, weather simulation and post-processing among other things - things which are more immediately noticeable.

I actually agree with you here. This is precisely why developers haven't focused on improving the quality of physics. Graphical improvements are much more noticeable and therefore, much more marketable. But that doesn't mean that improved physics can't offer a great deal in terms of enhanced gameplay and immersion.
The 2X throughput that each iteration of AVX promises over the previous one is very often when calculating loops of the type

Code:
A[i] = B[i]*C[i] + D[i]

Real AVX-accelerated code is far from providing the same kind of speedup.

I'm not a programmer, but from what I've read, most of the problems with AVX stem from the lack of certain instructions, or their performance. For instance, when AVX was first introduced with SB, there was no gather instruction. We had to wait till AVX2 with Haswell for the gather instruction, and even then, it was considered too slow to use. But Intel kept refining it, first with Broadwell and then with Skylake.

And now with AVX-512, you have scatter and masked operations, which make auto vectorization easier and more performant. So the point is, that wide SIMD on x86 has been a work in progress, and AVX-512 is a major milestone; not just in terms of performance, but practicality as well.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |