OpenCL OpenGL review interview at toms

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Arzachel

Senior member
Apr 7, 2011
903
76
91
I could say that AVX2 vs OpenCL is apples to snow moles, that OpenCL has a huge install base and AVX2 has none, that software renderers aren't a thing anymore for a good reason, but I think one word is enough.

Larrabee.
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
OpenCL is a new "standard" where there is no need for one, leading to fragmentation and wildly varying performance. Just look at how the GTX 680 fails against a quad-core CPU! With homogeneous computing like AVX2, developers can use existing languages, and higher performance across the board with less effort.

GTX680 is purposely limited by nVidia in anything GPGPU related so they can sell Teslas. It actually loses to the GTX580 in pretty much everything other than gaming.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
AVX2 brings GPU technology into the CPU cores. It offers the same computing power, without the overhead or limitations. So there's plenty of reason to get excited over AVX2.

This is as far as I got before I was like WTF are you talking about.

AVX at most allows for like 3-4 operations to parallel out per core, while GPUs offers many orders more of parallel execution units. It's not even the same idea let alone the same level.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
This is as far as I got before I was like WTF are you talking about.

AVX at most allows for like 3-4 operations to parallel out per core, while GPUs offers many orders more of parallel execution units.
You are clueless. Haswell will be able to do 32 floating-point operations per core per cycle. That's because AVX2 is 256-bit = 8 x 32-bit, and Haswell will have two such units for fused multiply-add (FMA) floating-point operations. That's 500 GFLOPS for a quad-core at 3.9 GHz.

For reference, Ivy Bridge's iGPU consists of 16 cores each capable of 2 x 4 FMA operations. At 1.15 GHz, that's only 300 GFLOPS.

So where's this "many orders more" you're talking about? And this is just peak theoretical throughput. In practice, GPUs can perform even much worse when facing general-purpose computing tasks. That's 3 TFLOPS losing against 230 GFLOPS. Quite pathetic.
It's not even the same idea let alone the same level.
How is AVX2 not the same idea as a GPU instruction set? It's wide SIMD instructions used in an SPMD fashion, including gather support. What do you think is missing to make it the "same idea"?

I'm afraid you've been fooled by GPU marketing. They count each SIMD lane as a separate core. Counting the same way, mainstream Haswell would have 64 cores, not four.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
So, does AVX2 optimized software actually make Moar Coars beneficial?
AVX2 extracts only Data Level Parallelism (DLP), while multi-core extracts Task Level Parallelism (TLP) or DLP. So independent cores are more versatile, but they also require more transistors and consume more power. Furthermore, having many cores makes it quadratically harder to synchronize between them. Haswell introduces TSX technology to help with that too.

So AVX2 and multi-core are orthogonal technologies. AVX2 does not make more cores beneficial, but TSX does.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
You are clueless. Haswell will be able to do 32 floating-point operations per core per cycle. That's because AVX2 is 256-bit = 8 x 32-bit, and Haswell will have two such units for fused multiply-add (FMA) floating-point operations. That's 500 GFLOPS for a quad-core at 3.9 GHz.

Dude? I'm clueless? I've seen nothing on this 32 fops per cycle. Just because it's encoded doesn't mean it can execute in one cycle. I highly doubt Intel is going to move from averaging 1.5 fops* per cycle to 32 flops per cycle in one iteration. Oh and btw. It's impossible to execute a FMA in one cycle. (Unless you have a time machine?) You can't add till the multiply completes and as far as I know Haswell is still a 14 stage pipeline. An AVX FMA is probably going to shave 40% off of that pipeline because it never has to write back the result.

PS 1.5 flops per cycle is amazing in actual code. I think the theoretical limit would be about 4 but that's a strict loop never loading or storing the result. If Intel makes over 2 I will be very impressed.

For reference, Ivy Bridge's iGPU consists of 16 cores each capable of 2 x 4 FMA operations. At 1.15 GHz, that's only 300 GFLOPS.

I just looked this up. (here and here)

Intel currently with AVX delivers 294 gflops per node (2x Xeon E5-2670) in a HPC under linpack with MKL. That's about 19 gflops per core! I'm sure they could do better with a more specific task.

So where's this "many orders more" you're talking about? And this is just peak theoretical throughput. In practice, GPUs can perform even much worse when facing general-purpose computing tasks. That's 3 TFLOPS losing against 230 GFLOPS. Quite pathetic.

I do believe that's Intel's iGPU scoring that and I would add rather nicely in that mark considering it's size and count. I would actually say Intels iGPU is maybe better at computing than graphics.

How is AVX2 not the same idea as a GPU instruction set? It's wide SIMD instructions used in an SPMD fashion, including gather support. What do you think is missing to make it the "same idea"?

It's not the instruction that makes the performance, it's the architecture. I think you should look into it and get back to this. When you learn how to program a massively parallel task with 128+ threads doing the work, you'll get it.

I'm afraid you've been fooled by GPU marketing. They count each SIMD lane as a separate core. Counting the same way, mainstream Haswell would have 64 cores, not four.

Still has only 2 threads per core and a 14 stage pipeline feed by a general purpose cache system. If you define it that way, it's kind of sad how idle all those cores are.

I don't buy into hype, I look at the problems and the tools that provide the solutions. The general purpose processor is a different beast that the massively parallel APU. They complement each other very well.

The most unique thing you will see in the next generation, and we're seeing it a bit in Trinity, is a flat memory model where you don't have to move massive amounts of data around to make use of all the resouces.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Dude? I'm clueless? I've seen nothing on this 32 fops per cycle.
Hence my point.
Just because it's encoded doesn't mean it can execute in one cycle. I highly doubt Intel is going to move from averaging 1.5 fops* per cycle to 32 flops per cycle in one iteration. Oh and btw. It's impossible to execute a FMA in one cycle. (Unless you have a time machine?)
It doesn't have to be completed it in one cycle. It just has to start a new one each cycle. So please stop making a fool of yourself and check the Intel and AMD optimization manuals. Multiply, add, and fused multiply-add all have a throughput of one per cycle per execution port.
You can't add till the multiply completes and as far as I know Haswell is still a 14 stage pipeline.
That's the length of the pipeline from the fetch to the retirement stage. The execution latency is much shorter. And again, don't confuse latency with throughput.
PS 1.5 flops per cycle is amazing in actual code. I think the theoretical limit would be about 4 but that's a strict loop never loading or storing the result. If Intel makes over 2 I will be very impressed.
How do you think GPUs achieve TFLOPS? By executing multiple independent loop iterations in parallel! AVX2 allows the CPU to achieve the exact same thing. Hence Haswell can do 32 floating-point operations per core per cycle.
I don't buy into hype, I look at the problems and the tools that provide the solutions. The general purpose processor is a different beast that the massively parallel APU.
Clearly you do buy into hype. You even believe in magic. You believe there's something a GPU can do that a CPU can't, even though they're both made out of silicon. Is it multi-core? No, both have that. Is it wide SIMD vector execution units? No, both have that too. Is it gather support? No, Haswell will support that too.

So please do tell me what magical technology a GPU has, which couldn't possibly be merged into the CPU as AVX2... Or maybe you're simply wrong and you have to stop believing in fairy tales?
The most unique thing you will see in the next generation, and we're seeing it a bit in Trinity, is a flat memory model where you don't have to move massive amounts of data around to make use of all the resouces.
Having a flat memory model doesn't make the data movement go away. It just hides it from the developer. The overhead is still there. And that's why heterogeneous computing does not scale.

AVX2 adds all the features which makes GPUs fast at throughput computing, into the CPU cores, so it doesn't suffer from the heterogeneous overhead.
 

Riek

Senior member
Dec 16, 2008
409
14
76
AVX2 brings GPU technology into the CPU cores. It offers the same computing power, without the overhead or limitations. So there's plenty of reason to get excited over AVX2.

And yes, no CPU supports it yet. But neither does any APU today support a unified address space and context switches. That's only planned to be complete by 2014. So AVX2 will get there sooner.

GCC 4.7 supports AVX2, LLVM 3.1 supports AVX2 and Visual Studio 2012 supports AVX2. So compilers are well ahead of schedule too.

Because AMD has yet to announce that they'll support AVX2. It's inevitable that they will, but they'd rather have people use HSA instead. In other words they're betting the farm on other technology. Looking at what can already be achieved with AVX, and all the phenomenal things added by AVX2, that's really going to turn out to be a big mistake on AMD's part.

Just like NVIDIA realized, they should back away from making compromises to graphics performance for the sake of GPGPU. General purpose computing is what the CPU is for, and AVX2 adds a lot more oomph to it. Heterogeneous computing doesn't scale, due to the round-trip latency and bandwidth bottleneck. So the GPU should concentrate on pure graphics only, which is a one-way process.

It's not really OpenCL versus AVX2. It's homogeneous versus heterogeneous general purpose throughput computing. OpenCL is just one way to get code auto-vectorized. But AVX2 supports many more programming languages and frameworks. So it's not a question of one or the other. Indeed as you indicate, one is hardware and the other is software. That said, OpenCL may not survive long after homogeneous computing proves to be superior, since it will have to compete against other languages which have fewer restrictions.

AVX2 can be used by any language as-is. All you need is loops with independent iterations to auto-vectorize them. AVX2's gather support is critical in enabling that. And it means developers can use languages they already know and love, instead of trying to shoehorn things into the OpenCL framework and losing performance on heterogeneous architectures.

Sure, it depends on the underlying hardware whether it's a high performance implementation or not. But that's equally true for GPUs!

Haswell's implementation of AVX2 will have three 256-bit execution units per core. Two of these will be capable of FMA operations, resulting in a peak performance of 500 GFLOPS for a quad-core. On a performance/area metric that's actually quite close to any GPU. And you don't lose any of the existing CPU qualities like far superior sequential speed, large cache space per thread, branch prediction to prevent stalls, etc.

Last but not least, AVX2 is not the end of the road. The encoding format supports extending it up to 1024-bit registers. This can be used to lower the power consumption of the CPU's front-end and out-of-order execution, by executing AVX-1024 instructions in four cycles (i.e. same ALU throughput for four times less power consumption in the rest of the pipeline). This would effectively make the CPU behave much more like a GPU in terms of power consumption. So heterogeneous computing won't have any benefits left.

AVX2 isn't a compititor vs hetregonous computing because it is a part of it... You clearly know that, yet you don't accept that fact.

AVX2 is an instruction set that can be handled by the cpu. Whether is originates from openCL, java, .NET, assembler code doesn't matter. If you want to use AVX2 all those will eventually allow you to use it. If you want to use is throughouly you will need to optimize your code to do so, for either platform.

OpenCL provides (being heterogeous) the ability to combine all the power in your pc. Be it AMD, Intel, Nvidia. Heck, even older hardware can make use of openCL, although the advantage will diminish rapidly for some tasks).

Haswell will have 2.5 times the gpu power of IvB in terms of raw calculations. Kaveri will reach 1TFlop on the gpu side. This is additional power ABOVE the cpu power(be it through SSE, AVX, AVX2). (its an AND AND not OR).

both APU's will deliver over 1Tflops of power that can be harnesd by openCL, which is not possible for any other non-heterogenous approach.

Does that mean openCL is the future for all? Ofcourse not, its only useful for certain situations. For integer based code, openCL will not bring any advantage except platform portability. For some floating point based programs the overhead will be to big to gain performance with gpu's. (since apps will still be able to use AVX2, the difference in performance won't be there... assuming same optimization and no restrictions on either side).
But seeing those benchmarks proves that openCL can improve performance alot(huge improvements) in some applications. Alot more than possible with next gen cpu's alone.. (and that is even excluding the nextGent cpu+IGP combinations which will scale due to enhancements on both parts).


Eventually you are right in saying AVX2 is step that cpu makes towards a gpu... Just like gpu's have made strides to become more like a cpu. But as long as one chip doesn't replace the other in hardware, using both chips will be >>> than one in those applications openCL aims at.

edit: i forgot adding ARM also as a supported partner within openCL.
 
Last edited:

BenchPress

Senior member
Nov 8, 2011
392
0
0
AVX2 isn't a compititor vs hetregonous computing because it is a part of it... You clearly know that, yet you don't accept that fact.
Don't be silly. There's nothing heterogeneous about AVX2. It's a homogeneous part of the x86 ISA. I'm curious what makes you think I would "know" otherwise. You're probably just confusing SPMD computing with heterogeneous computing.
AVX2 is an instruction set that can be handled by the cpu. Whether is originates from openCL, java, .NET, assembler code doesn't matter. If you want to use AVX2 all those will eventually allow you to use it. If you want to use is throughouly you will need to optimize your code to do so, for either platform.
Yes, one of the strengths of AVX2 is that it can be used with any programming language. But what do you mean by "either platform"? AVX2 doesn't suffer from round-trip latencies or bandwidth bottlenecks because the data can stay in the CPU caches between sequential and parallel processing. Also it still benefits from CPU features like out-of-order execution and a large amount of cache per thread, which avoids many stalls. So contrary to heterogeneous computing it takes far less effort, if any, to get good performance out of AVX2.
OpenCL provides (being heterogeous) the ability to combine all the power in your pc.
OpenCL itself is not heterogeneous. It's just an API and language for SPMD processing. It's software. Only the hardware can be heterogeneous or not.

Of course OpenCL targets heterogeneous processing. There is little point in using OpenCL for homogeneous processing, even though it's perfectly possible.

And that's exactly OpenCL's problem. AVX2 and its successors will prove to be far more versatile than heterogeneous processing. Developers won't choose OpenCL over just using a vectorizing compiler and the programming languages they already know. So OpenCL will die out together with heterogeneous general purpose processing.
Haswell will have 2.5 times the gpu power of IvB in terms of raw calculations. Kaveri will reach 1TFlop on the gpu side. This is additional power ABOVE the cpu power(be it through SSE, AVX, AVX2). (its an AND AND not OR).
Only mobile Haswell chips with a GT3 will have that much GPU power. And no, you can't just add up the GFLOPS. Tasks can't migrate between the CPU and GPU. Of course it's technically feasible, but it's just not desirable both due to software complexity and the overhead for moving things.

Also let me remind you that the GPUs raw performance is meaningless, as proven by a 230 GFLOPS CPU beating a 3 TFLOPS GPU. So with the arrival of AVX2, developer will always use the CPU for general purpose computing, which is what it has always been designed for. The GPU will have to go back to what it was designed for as well: graphics. So all of AMD's effort will go to waste. They've crippled the CPU, making it less good at general purpose computing, and they've crippled the GPU, making it less good at graphics!
But seeing those benchmarks proves that openCL can improve performance alot(huge improvements) in some applications. Alot more than possible with next gen cpu's alone..
You can't conclude that from those benchmarks at all. First of all, they're comparing a 125 Watt FX-8150 + 200 Watt 7970, against a 35 Watt mobile dual-core i5. And despite that, in several cases the GPU's lead really isn't that big, and you have to keep in mind that AVX2 doubles the arithmetic throughput but also requires 18 times fewer instructions for gather! In other cases the results can really only be explained by running blatantly unoptimized CPU code, just like NVIDIA has done with PhysX. They should at the very least have taken the effort to run OpenCL on the CPU to be able to tell how it compares.
 

Riek

Senior member
Dec 16, 2008
409
14
76
Don't be silly. There's nothing heterogeneous about AVX2. It's a homogeneous part of the x86 ISA. I'm curious what makes you think I would "know" otherwise. You're probably just confusing SPMD computing with heterogeneous computing.

OpenCL itself is not heterogeneous. It's just an API and language for SPMD processing. It's software. Only the hardware can be heterogeneous or not.

so heterogeneous does not use the instructions available on a cpu when it addresses a cpu? Did you tell intel and AMD that they shouldn't put decoders anymore?

Great to know... Every day one might learn something... Now i know i can use openCL to avoid using x86 commands to a cpu. Good riddance of the x86 instruction set... finally. (puts up a huge neon light flashing sarcasm sign).

See here you make it seem you get the point... but every other sentence you write contradicts the fact that you understand that openCL and instruction set are not related to eachother.

Also let me remind you that the GPUs raw performance is meaningless, as proven by a 230 GFLOPS CPU beating a 3 TFLOPS GPU.

So you mean the 7870 with 2.5TFlops single precision (with FMA) isn't +180% faster than the quad core?
you like to take number as they suit you don't you?
RAW performance is irrelevant the soon you add FMA in the mix. (which you will eagerly do for Hasswel also... despite the double throughput it would bring only ~10% extra performance..... that will make hasswel flunk in efficiency also)

You can't conclude that from those benchmarks at all. First of all, they're comparing a 125 Watt FX-8150 + 200 Watt 7970, against a 35 Watt mobile dual-core i5. And despite that, in several cases the GPU's lead really isn't that big, and you have to keep in mind that AVX2 doubles the arithmetic throughput but also requires 18 times fewer instructions for gather!
AVX2 with FMA will bring ~10% performance benifit with double the throughput... great ... now all lets get a hot cecemel and a coockie.
the APU which is a quad core+apu are doing pretty well also... which is not above 100W. but you didn't notice those did you? Heck i doubt you even looked at the whole review on tomshardware and not limited yourself to CS6... (I am right am i not?)
Also in case you might have missed the whole damn thread... this isn't an AMD vs intel thing........ Who cares if they compare with a select part of cpu's? the FX8150 isn't exactly the slowest cpu...


requires 18 times fewer instructions for gather
This is completely useless statement without:
a) latency
b) OOO impact on those instructions.

Try looking for XOP and what it does for the number of instructions... and see how much the bennifit was for that..


will AVX2 bring any benefits for gaming if software developers use it? thanks
Not really, Games are not limited by that. its mostly branches that kills gaming performance. Maybe if PhysX gets a decent compile you will see a performance jump for those parts, but that will be the case without AVX(2) also.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
So you mean the 7870 with 2.5TFlops single precision (with FMA) isn't +180% faster than the quad core?
you like to take number as they suit you don't you?

Just a note.

The numbers listed from nVidia and AMD for GPUs is essentially theoretical. Only Telsa and Firestream cards can actually deliver. But even then it can be hard to fully utilize the cards

Second issue 1 flop aint equal to 1 flop. We already see how nVidia for example did much better with less flops on their cards. Or for example some of the projects needing 4 flops on a GPU to do what the CPU does in 1 flop.

But else I agree. GPU will still be faster for these gpgpu tasks for a while yet atleast.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
BenchPress: Also let me remind you that the GPUs raw performance is meaningless, as proven by a 230 GFLOPS CPU beating a 3 TFLOPS GPU.

Problem I have is that you are being very selective in examples. What about the 7970 that achieves 4x the score? But you HAVE to use the worst one, don't you?

Riek: AVX2 with FMA will bring ~10% performance benifit with double the throughput... great ... now all lets get a hot cecemel and a coockie.

Actually, CPUs usually achieve closer to their flops rating. 10% performance gain is merely your speculation, right? That may be true in client applications, but for what we are talking about, gains close to theoretical isn't out of reach. The caveat is that it depends on application. In Xeon E5, applications can commonly gain 25-50% gain with enabling AVX, while some reaches 100%. They say Sandy Bridge's AVX is limited by load port in most applications but what's to say it won't change with yet another 2x improvement in FP throughput?

The way I see it though, whether it is a CPU trying to be good in graphics/parallel or a GPU trying to be good in compute, all ends up being a CPU. The companies can market it whatever they want, but they would eventually fit it in the original CPU definition.

ShintaiDK: But else I agree. GPU will still be faster for these gpgpu tasks for a while yet at least.

Like what? The word GPGPU is general in itself. It can mean so many things you can be right and wrong simultaneously.
 
Last edited:

Riek

Senior member
Dec 16, 2008
409
14
76
Riek: AVX2 with FMA will bring ~10% performance benifit with double the throughput... great ... now all lets get a hot cecemel and a coockie.

Actually, CPUs usually achieve closer to their flops rating. 10% performance gain is merely your speculation, right? That may be true in client applications, but for what we are talking about, gains close to theoretical isn't out of reach. The caveat is that it depends on application. In Xeon E5, applications can commonly gain 25-50% gain with enabling AVX, while some reaches 100%. They say Sandy Bridge's AVX is limited by load port in most applications but what's to say it won't change with yet another 2x improvement in FP throughput?

Well you read my statement wrong, but thats probably because i wasn't all that clear. I was goaling for the FMA operation. (not the AVX128-> 256bit change).
Which doubles the throughput theoretical numbers, but doesn't give near as much benifit as the theoretical increase would indicate. And the 10% is indeed an estimate remaining from the discussion on how much FMA would bennifit Bulldozer. (although you may correct me on that one, but i don't expect it to be that far from it).
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
Riek: Which doubles the throughput theoretical numbers, but doesn't give near as much benifit as the theoretical increase would indicate. And the 10% is indeed an estimate remaining from the discussion on how much FMA would bennifit Bulldozer. (although i don't expect it to be that far from it).

What, based on client applications? Also let's not use Bulldozer as an example. You don't know if there are bottlenecks elsewhere from it reaching full potential. It's like with Core 2 Intel was greatly ahead in client but almost on par with the Opterons in Servers. But when Nehalem came servers got HUGE gains. The Penryn based server chips had a powerful CPU core but a weak platform.
 
Last edited:

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
The way I see it though, whether it is a CPU trying to be good in graphics/parallel or a GPU trying to be good in compute, all ends up being a CPU. The companies can market it whatever they want, but they would eventually fit it in the original CPU definition.

yes, both arquitectures, (CPU and GPU) are in convergence...

AMD calls it APUs....but today, is just a marketing name...
 

CPUarchitect

Senior member
Jun 7, 2011
223
0
0
will AVX2 bring any benefits for gaming if software developers use it? thanks
Yes, most definitely.

What makes it very attractive to (game) developers is that it only takes minimal effort to benefit from AVX2. Some scalar code can even become eight times faster just by changing a compiler setting. This is in stark contrast to the effort it takes to rewrite things for OpenCL and make it run well on many configurations.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
AVX2 with FMA will bring ~10% performance benifit with double the throughput

from where did you pull this measly 10% ?

I'll say that for any compute intensive kernel with balanced ADD / MUL, FMA will be 2.5x to 3x the performance of SSE code on the same Haswell chip, probably even more perf. increase vs previous chips thanks to other uarch enhancements, so AVX2 vs SSE is 4x the peak throughput with up to 3x the effective throughput, more than 3x effective throughput compared to the previous generation, this is clearly a big deal

reference: slide 62 of the IDF Spring AVX2 presentation BJ12_ARCS002_102_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ, FMA vs SSE with 2.31x effective speedup and there is more shuffling code than on a typical high performance kernel
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
so heterogeneous does not use the instructions available on a cpu when it addresses a cpu? Did you tell intel and AMD that they shouldn't put decoders anymore?
When only the CPU is being used, it's called homogeneous processing. Heterogeneous processing means multiple instruction set architectures are being used. Hence you can't "address" a CPU only with heterogeneous processing. As I suspected before, you are confusing SPMD computing with heterogeneous computing. One does not imply the other.
you like to take number as they suit you don't you?
No, I like to take relevant numbers from real-world cases. The GTX 680 is a popular cutting edge GPU. Nobody can claim GPUs are more powerful than CPUs and ignore this utterly pathetic result for a high-end NVIDIA GPU. The fact that the HD 7970 is faster doesn't make up for that. And NVIDIA will likely continue to optimize its consumer GPUs for graphics instead of GPGPU. They realize there is no future in consumer GPGPU when AVX2 is clearly going to be superior.

Even looking at the HD 7970 result there is no reason to be impressed. This is a huge chip against four tiny CPU cores. 3.8 TFLOPS against 230 GFLOPS. And keep in mind that the majority of people doesn't have a GPU like that, while a quad-core did become mainstream. So add to this that AVX2 doubles the throughput and radically speeds up gather operations, and it's clear that heterogeneous computing becomes a tough sell. The ROI for AVX2 is much higher.
RAW performance is irrelevant the soon you add FMA in the mix. (which you will eagerly do for Hasswel also... despite the double throughput it would bring only ~10% extra performance..... that will make hasswel flunk in efficiency also)
FMA improves performance in floating-point intensive workloads way more than 10%. It should be more like 50%.

But that's not all. Sandy/Ivy Bridge were actually held back by cache bandwidth. Haswell will double it. And gather support fixes the sequential memory access bottleneck. So all these things taken together it shouldn't be all that hard to achieve twice the effective throughput, if not more.
This is completely useless statement without:
a) latency
b) OOO impact on those instructions.
a) The most likely gather implementation will be capable of accessing one cache line per clock cycle. Also, it should consist of three uops, but they execute on different ports (look at vblendvps for how that would work). In any case the exact latency depends on the memory access pattern. However, gather is invaluable in GPUs, so there's no reason to expect any less of an effect in CPUs.
b) There's a dependency chain for performing the same operation with extract/insert. Gather should be able to reorder the read operations. So this is definitely very beneficial to out-of-order execution. How much exactly depends on the use case and the precise implementation.

In any case replacing a sequence of 18 instrutions with a single instruction must have a profound impact on the performance of SPMD code with memory indirections. It's not "completely useless" without the full details.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
And NVIDIA will likely continue to optimize its consumer GPUs for graphics instead of GPGPU. They realize there is no future in consumer GPGPU when AVX2 is clearly going to be superior.

yes....that's why the just anounced GK110 tesla cards
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |