As I said only one GNC CU has the same per clock throughput as four Haswell cores. You can pack 12 CUs with mostly the same transistor budget of four Haswell cores (without LLC).
That's meaningless because those compute units are completely helpless by themselves! It's like comparing an F1 engine against a family car. Sure, the former is small and powerful, but it's not going anywhere without a frame, wheels, steering wheel, etc. So again, you have to look at the system as a whole to be able to compare things.
And when you start doing that it becomes
easy to see that CPUs can increase their vector throughput by a very large factor with only a modest increase in die size. FMA support barely has an impact but already doubles the peak throughput. Beyond that the increase will be noticeable but still relatively minor. Meanwhile GPUs can no longer increase their compute density. Sure, Kepler increased the theoretical compute density, but it has
laughable compute performance in practice. So GPUs have reached their limit, while CPUs have lots of potential left.
You can get more transistors in a chip every new node (Moore's law), but with facing the utilization wall you can't lower the energy required to turn on them as much as you need. The only way to avoid this is to design the chip with much lower pJ/instruction rate.
Which is exactly why I mentioned executing AVX-1024 instructions over multiple cycles! Today's CPUs spend the majority of their power consumption on fetching/decoding/scheduling instructions, not on their actual execution. So the trick is to use wider vectors and execute the instructions over multiple cycles. GCN does the exact same thing.
AVX 2 is not for CPU cores.
You can't be serious. It's part of
the x86 instruction set.
They implement now because they must, but AVX2 has some problems.
- It's only support gather, and not scatter. I know gather is more important, but scatter is also very useful.
That's a non-issue. Scatter operations are very rare in data parallel algorithms. For the few cases where they're useful, AVX2 features a versatile permutation instruction. And besides, they can still implement a fully generic scatter instruction later. So this really isn't an argument against homogeneous computing.
- To feed two 256 bit FMA units in a core you'll need 6 reads and 2 writes per clock. Haswell only has 4 ports, so this is a limitation.
That's just plain wrong. Operands can be read from the bypass network, register file, and cache ports. Agner Fog found
no practical limit to the number of register reads for Sandy Bridge, meaning it can already sustain 6 reads and 3 writes per clock.
GPUs are much more flexible in these manners, and you can achive their peak performance much more workloads.
No. Anyone who has ever done any GPGPU programming can tell you that more often than not you can't get anywhere near peak performance. Again just look at these
pathetic results. The HD 7970, which is rated at 3800 GFLOPs, is only three times faster than a quad-core CPU with 230 GFLOPS. It's insane to call GPUs flexible.
Skylake will have Larrabee cores (probably with 1024 bit vectors) extending 2 or 4 main cores. This is a heterogeneous route.
Wrong again. Intel has declared that it will consolidate VEX and MVEX. That's a homogeneous route.
Suppose that HSA has not the future. We will still need a virtual ISA with a good open infrastructure to program data-parrallel codes much easier than OpenCL-C/C++.
No we don't. AVX2+ can be used by
any programming language through auto-vectorization. No need for any new virtual ISA. AMD is having a pipe dream if they think another software layer will fix all the fundamental problems with heterogeneous computing.
You are talking nonsense wiht vertex processing and other insignificant things for GPGPU.
I don't think you understood my analogy. You say heterogeneous computing is superior because the classic CPU and GPU are each more optimized for a specific task. But the same was true for vertex and pixel pipelines on old GPUs, and yet they unified them into homogeneous shader cores. So clearly something is wrong about your theory. What you lose in "optimization" for a more specific task is very minor compared to the advantages of unification!
Nothing is preventing future CPU architectures from achieving high enough throughput to become superior to a heterogeneous architecture.
If the code is data-parrallel than run it on the iGPU, if serial or task parrallel, then run on the CPU cores. This is the only aspect what should be considered.
It's not quite that simple. First of all code is never completely data-parallel or sequential. There's a complex mix of ILP, DLP and TLP, which varies over time. Secondly, data transfers between heterogeneous cores take precious bandwidth, and synchronization between them has high latency. So basically developers are forced to 'categorise' code and ensure minimal interaction between those sections of code. This is not an easy task (read: takes lots of time and thus money), and you always lose performance either by switching between the core types too often or by running code on a core type that is less optimal for it. This is inherent to heterogeneous computing.
With a homogeneous architecture these fundamental issues disappear. They can efficiently switch between parallel or sequential code from one cycle to the next and can deal with any level of ILP/DLP/TLP.
You can't feed Haswell as efficient as you can feed a GPU.
Sure you can.
I have explained earlier the limitations of AVX/2.
Those weren't any significant limitations. And it will continue evolving to optimize the CPU cores for high throughput.
I think I understand your problem. You think in theories of what you can get from a hardware.
Quite the contrary. In theory a heterogeneous architecture has higher peak performance. In practice a unified/homogeneous architecture proves to be superior due to load balancing, no migration overhead, and improved programmability.
GPUs are bad at executing branchy code if the input was mapped in SPMD-on-SIMD fashion (CPUs are also don't like this), but if you don't do that than it is easier to compile a pre-vectorized input to a GPU.
GCN does SPMD-on-SIMD. So what alternative are you talking about?