software renderers aren't a thing anymore for a good reason
, but I think one word is enough. Larrabee.
OpenCL is a new "standard" where there is no need for one, leading to fragmentation and wildly varying performance. Just look at how the GTX 680 fails against a quad-core CPU! With homogeneous computing like AVX2, developers can use existing languages, and higher performance across the board with less effort.
AVX2 brings GPU technology into the CPU cores. It offers the same computing power, without the overhead or limitations. So there's plenty of reason to get excited over AVX2.
You are clueless. Haswell will be able to do 32 floating-point operations per core per cycle. That's because AVX2 is 256-bit = 8 x 32-bit, and Haswell will have two such units for fused multiply-add (FMA) floating-point operations. That's 500 GFLOPS for a quad-core at 3.9 GHz.This is as far as I got before I was like WTF are you talking about.
AVX at most allows for like 3-4 operations to parallel out per core, while GPUs offers many orders more of parallel execution units.
How is AVX2 not the same idea as a GPU instruction set? It's wide SIMD instructions used in an SPMD fashion, including gather support. What do you think is missing to make it the "same idea"?It's not even the same idea let alone the same level.
AVX2 extracts only Data Level Parallelism (DLP), while multi-core extracts Task Level Parallelism (TLP) or DLP. So independent cores are more versatile, but they also require more transistors and consume more power. Furthermore, having many cores makes it quadratically harder to synchronize between them. Haswell introduces TSX technology to help with that too.So, does AVX2 optimized software actually make Moar Coars beneficial?
You are clueless. Haswell will be able to do 32 floating-point operations per core per cycle. That's because AVX2 is 256-bit = 8 x 32-bit, and Haswell will have two such units for fused multiply-add (FMA) floating-point operations. That's 500 GFLOPS for a quad-core at 3.9 GHz.
For reference, Ivy Bridge's iGPU consists of 16 cores each capable of 2 x 4 FMA operations. At 1.15 GHz, that's only 300 GFLOPS.
So where's this "many orders more" you're talking about? And this is just peak theoretical throughput. In practice, GPUs can perform even much worse when facing general-purpose computing tasks. That's 3 TFLOPS losing against 230 GFLOPS. Quite pathetic.
How is AVX2 not the same idea as a GPU instruction set? It's wide SIMD instructions used in an SPMD fashion, including gather support. What do you think is missing to make it the "same idea"?
I'm afraid you've been fooled by GPU marketing. They count each SIMD lane as a separate core. Counting the same way, mainstream Haswell would have 64 cores, not four.
Hence my point.Dude? I'm clueless? I've seen nothing on this 32 fops per cycle.
It doesn't have to be completed it in one cycle. It just has to start a new one each cycle. So please stop making a fool of yourself and check the Intel and AMD optimization manuals. Multiply, add, and fused multiply-add all have a throughput of one per cycle per execution port.Just because it's encoded doesn't mean it can execute in one cycle. I highly doubt Intel is going to move from averaging 1.5 fops* per cycle to 32 flops per cycle in one iteration. Oh and btw. It's impossible to execute a FMA in one cycle. (Unless you have a time machine?)
That's the length of the pipeline from the fetch to the retirement stage. The execution latency is much shorter. And again, don't confuse latency with throughput.You can't add till the multiply completes and as far as I know Haswell is still a 14 stage pipeline.
How do you think GPUs achieve TFLOPS? By executing multiple independent loop iterations in parallel! AVX2 allows the CPU to achieve the exact same thing. Hence Haswell can do 32 floating-point operations per core per cycle.PS 1.5 flops per cycle is amazing in actual code. I think the theoretical limit would be about 4 but that's a strict loop never loading or storing the result. If Intel makes over 2 I will be very impressed.
Clearly you do buy into hype. You even believe in magic. You believe there's something a GPU can do that a CPU can't, even though they're both made out of silicon. Is it multi-core? No, both have that. Is it wide SIMD vector execution units? No, both have that too. Is it gather support? No, Haswell will support that too.I don't buy into hype, I look at the problems and the tools that provide the solutions. The general purpose processor is a different beast that the massively parallel APU.
Having a flat memory model doesn't make the data movement go away. It just hides it from the developer. The overhead is still there. And that's why heterogeneous computing does not scale.The most unique thing you will see in the next generation, and we're seeing it a bit in Trinity, is a flat memory model where you don't have to move massive amounts of data around to make use of all the resouces.
AVX2 brings GPU technology into the CPU cores. It offers the same computing power, without the overhead or limitations. So there's plenty of reason to get excited over AVX2.
And yes, no CPU supports it yet. But neither does any APU today support a unified address space and context switches. That's only planned to be complete by 2014. So AVX2 will get there sooner.
GCC 4.7 supports AVX2, LLVM 3.1 supports AVX2 and Visual Studio 2012 supports AVX2. So compilers are well ahead of schedule too.
Because AMD has yet to announce that they'll support AVX2. It's inevitable that they will, but they'd rather have people use HSA instead. In other words they're betting the farm on other technology. Looking at what can already be achieved with AVX, and all the phenomenal things added by AVX2, that's really going to turn out to be a big mistake on AMD's part.
Just like NVIDIA realized, they should back away from making compromises to graphics performance for the sake of GPGPU. General purpose computing is what the CPU is for, and AVX2 adds a lot more oomph to it. Heterogeneous computing doesn't scale, due to the round-trip latency and bandwidth bottleneck. So the GPU should concentrate on pure graphics only, which is a one-way process.
It's not really OpenCL versus AVX2. It's homogeneous versus heterogeneous general purpose throughput computing. OpenCL is just one way to get code auto-vectorized. But AVX2 supports many more programming languages and frameworks. So it's not a question of one or the other. Indeed as you indicate, one is hardware and the other is software. That said, OpenCL may not survive long after homogeneous computing proves to be superior, since it will have to compete against other languages which have fewer restrictions.
AVX2 can be used by any language as-is. All you need is loops with independent iterations to auto-vectorize them. AVX2's gather support is critical in enabling that. And it means developers can use languages they already know and love, instead of trying to shoehorn things into the OpenCL framework and losing performance on heterogeneous architectures.
Sure, it depends on the underlying hardware whether it's a high performance implementation or not. But that's equally true for GPUs!
Haswell's implementation of AVX2 will have three 256-bit execution units per core. Two of these will be capable of FMA operations, resulting in a peak performance of 500 GFLOPS for a quad-core. On a performance/area metric that's actually quite close to any GPU. And you don't lose any of the existing CPU qualities like far superior sequential speed, large cache space per thread, branch prediction to prevent stalls, etc.
Last but not least, AVX2 is not the end of the road. The encoding format supports extending it up to 1024-bit registers. This can be used to lower the power consumption of the CPU's front-end and out-of-order execution, by executing AVX-1024 instructions in four cycles (i.e. same ALU throughput for four times less power consumption in the rest of the pipeline). This would effectively make the CPU behave much more like a GPU in terms of power consumption. So heterogeneous computing won't have any benefits left.
Don't be silly. There's nothing heterogeneous about AVX2. It's a homogeneous part of the x86 ISA. I'm curious what makes you think I would "know" otherwise. You're probably just confusing SPMD computing with heterogeneous computing.AVX2 isn't a compititor vs hetregonous computing because it is a part of it... You clearly know that, yet you don't accept that fact.
Yes, one of the strengths of AVX2 is that it can be used with any programming language. But what do you mean by "either platform"? AVX2 doesn't suffer from round-trip latencies or bandwidth bottlenecks because the data can stay in the CPU caches between sequential and parallel processing. Also it still benefits from CPU features like out-of-order execution and a large amount of cache per thread, which avoids many stalls. So contrary to heterogeneous computing it takes far less effort, if any, to get good performance out of AVX2.AVX2 is an instruction set that can be handled by the cpu. Whether is originates from openCL, java, .NET, assembler code doesn't matter. If you want to use AVX2 all those will eventually allow you to use it. If you want to use is throughouly you will need to optimize your code to do so, for either platform.
OpenCL itself is not heterogeneous. It's just an API and language for SPMD processing. It's software. Only the hardware can be heterogeneous or not.OpenCL provides (being heterogeous) the ability to combine all the power in your pc.
Only mobile Haswell chips with a GT3 will have that much GPU power. And no, you can't just add up the GFLOPS. Tasks can't migrate between the CPU and GPU. Of course it's technically feasible, but it's just not desirable both due to software complexity and the overhead for moving things.Haswell will have 2.5 times the gpu power of IvB in terms of raw calculations. Kaveri will reach 1TFlop on the gpu side. This is additional power ABOVE the cpu power(be it through SSE, AVX, AVX2). (its an AND AND not OR).
You can't conclude that from those benchmarks at all. First of all, they're comparing a 125 Watt FX-8150 + 200 Watt 7970, against a 35 Watt mobile dual-core i5. And despite that, in several cases the GPU's lead really isn't that big, and you have to keep in mind that AVX2 doubles the arithmetic throughput but also requires 18 times fewer instructions for gather! In other cases the results can really only be explained by running blatantly unoptimized CPU code, just like NVIDIA has done with PhysX. They should at the very least have taken the effort to run OpenCL on the CPU to be able to tell how it compares.But seeing those benchmarks proves that openCL can improve performance alot(huge improvements) in some applications. Alot more than possible with next gen cpu's alone..
Don't be silly. There's nothing heterogeneous about AVX2. It's a homogeneous part of the x86 ISA. I'm curious what makes you think I would "know" otherwise. You're probably just confusing SPMD computing with heterogeneous computing.
OpenCL itself is not heterogeneous. It's just an API and language for SPMD processing. It's software. Only the hardware can be heterogeneous or not.
Also let me remind you that the GPUs raw performance is meaningless, as proven by a 230 GFLOPS CPU beating a 3 TFLOPS GPU.
AVX2 with FMA will bring ~10% performance benifit with double the throughput... great ... now all lets get a hot cecemel and a coockie.You can't conclude that from those benchmarks at all. First of all, they're comparing a 125 Watt FX-8150 + 200 Watt 7970, against a 35 Watt mobile dual-core i5. And despite that, in several cases the GPU's lead really isn't that big, and you have to keep in mind that AVX2 doubles the arithmetic throughput but also requires 18 times fewer instructions for gather!
This is completely useless statement without:requires 18 times fewer instructions for gather
Not really, Games are not limited by that. its mostly branches that kills gaming performance. Maybe if PhysX gets a decent compile you will see a performance jump for those parts, but that will be the case without AVX(2) also.will AVX2 bring any benefits for gaming if software developers use it? thanks
So you mean the 7870 with 2.5TFlops single precision (with FMA) isn't +180% faster than the quad core?
you like to take number as they suit you don't you?
Riek: AVX2 with FMA will bring ~10% performance benifit with double the throughput... great ... now all lets get a hot cecemel and a coockie.
Actually, CPUs usually achieve closer to their flops rating. 10% performance gain is merely your speculation, right? That may be true in client applications, but for what we are talking about, gains close to theoretical isn't out of reach. The caveat is that it depends on application. In Xeon E5, applications can commonly gain 25-50% gain with enabling AVX, while some reaches 100%. They say Sandy Bridge's AVX is limited by load port in most applications but what's to say it won't change with yet another 2x improvement in FP throughput?
The way I see it though, whether it is a CPU trying to be good in graphics/parallel or a GPU trying to be good in compute, all ends up being a CPU. The companies can market it whatever they want, but they would eventually fit it in the original CPU definition.
Yes, most definitely.will AVX2 bring any benefits for gaming if software developers use it? thanks
AVX2 with FMA will bring ~10% performance benifit with double the throughput
When only the CPU is being used, it's called homogeneous processing. Heterogeneous processing means multiple instruction set architectures are being used. Hence you can't "address" a CPU only with heterogeneous processing. As I suspected before, you are confusing SPMD computing with heterogeneous computing. One does not imply the other.so heterogeneous does not use the instructions available on a cpu when it addresses a cpu? Did you tell intel and AMD that they shouldn't put decoders anymore?
No, I like to take relevant numbers from real-world cases. The GTX 680 is a popular cutting edge GPU. Nobody can claim GPUs are more powerful than CPUs and ignore this utterly pathetic result for a high-end NVIDIA GPU. The fact that the HD 7970 is faster doesn't make up for that. And NVIDIA will likely continue to optimize its consumer GPUs for graphics instead of GPGPU. They realize there is no future in consumer GPGPU when AVX2 is clearly going to be superior.you like to take number as they suit you don't you?
FMA improves performance in floating-point intensive workloads way more than 10%. It should be more like 50%.RAW performance is irrelevant the soon you add FMA in the mix. (which you will eagerly do for Hasswel also... despite the double throughput it would bring only ~10% extra performance..... that will make hasswel flunk in efficiency also)
a) The most likely gather implementation will be capable of accessing one cache line per clock cycle. Also, it should consist of three uops, but they execute on different ports (look at vblendvps for how that would work). In any case the exact latency depends on the memory access pattern. However, gather is invaluable in GPUs, so there's no reason to expect any less of an effect in CPUs.This is completely useless statement without:
a) latency
b) OOO impact on those instructions.
And NVIDIA will likely continue to optimize its consumer GPUs for graphics instead of GPGPU. They realize there is no future in consumer GPGPU when AVX2 is clearly going to be superior.
yes....that's why the just anounced GK110 tesla cards