interesting article on AMD fusion.

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
That is not correct.

There's nowhere near even a single order of magnitude difference, and CPUs continue to catch up. In fact the mainstream desktop Haswell chip will be capable of close to 500 GFLOPS on the CPU cores and only 400 GFLOPS on the iGPU! So it starts to make a lot of sense to fully unify them and have twice the number of homogeneous cores.

....

you just showed us a graph that a gpu (cypress @40nm) have 3 times the perf/mm² and ~50% better flops/watt than a server cpu (SB-EP @32nm)
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Dont compare CPU flops to GPU flops. or even 1 flop vs 1 flop on a GPU. Biggest mistake ever.

One example is bitcoin mining, simply because of 1 extra instruction (BIT_ALIGN_INT) on AMD GPUs compared to nVidia. Not because of flops itself.
 
Last edited:

zlatan

Senior member
Mar 15, 2011
580
291
136
I'm sorry but that's complete nonsense. Haswell will double the throughput with AVX2, with no impact on latency.

That is not true either. AVX can be extended to 1024-bit instructions, and by executing these on 256-bit units in four cycles the throughput stays the same but it would allow to hide a lot of latency. Also note that Xeon Phi is an in-order architecture but it only needs 4-way threading to hide latency, not dozens. Having fewer threads helps cache coherence and thus hit rate, which in turn reduces the average latency that needs to be hidden.

Haswell is not optimized for high throughput. Only one GCN CU has as fast per clock throughput as four Haswell cores. And GCN has more flexible vector execution units, so it execute branchy code more efficient.

Xeon Phi will not scale as good as a modern GPU (Tahiti for example). You will see when it will release. There is a reason that NVIDIA and AMD avoiding the Larrabee route. They have much more experience then Intel creating data-parrallel processors, and they know how to design a fully scalable multiprocessor for high throughput. They are doing this from years now. Intel just want to do this, and Xeon Phi is the third Larrabee iteration. And I can tell you it's far from perfect. GK110 will be more useful for the HPC market.
 
Last edited:

zlatan

Senior member
Mar 15, 2011
580
291
136
....

you just showed us a graph that a gpu (cypress @40nm) have 3 times the perf/mm² and ~50% better flops/watt than a server cpu (SB-EP @32nm)
And the graph only represent DP performance. When it comes to SP GPUs will rise the bar.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
you just showed us a graph that a gpu (cypress @40nm) have 3 times the perf/mm² and ~50% better flops/watt than a server cpu (SB-EP @32nm)
Indeed, only 3 times better and 50% better. Haswell will double the peak throughput, with a negligible impact on area and power consumption.

My point is that GPUs are absolutely not "orders of magnitude" faster, and CPUs are rapidly catching up. AVX2 is not the end of this evolution, it's barely the beginning. Nothing is preventing them to create a homogenous architecture which achieves higher total performance and is easier to program. So the hard truth is that HSA has no long-term future.

Unification of the GPU's vertex and pixel pipeline has allowed much better load balancing and enabled more advanced graphics and general-purpose computing. But the GPU's high latency still severely limits the range of algorithms that can run efficiently. With a low-latency high-throughput homogeneous architecture there would be a world of new possibilities.

So the only question is, who will get there first?
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Indeed, only 3 times better and 50% better. Haswell will double the peak throughput, with a negligible impact on area and power consumption.

kinda funny... because GCN doubled the DP performance over cypress...

so, yeah... gpus will still be 3 times faster than AVX-2, with node disadvantage...
 

piesquared

Golden Member
Oct 16, 2006
1,651
473
136
With GPUs having orders of magnitude faster performance in many workloads, and many times higher peak FLOPS, it seems AMD is now in the driver seat. Of course intel is hard at work patching on new instructions, desperate to keep the old, antiquated x86 paradigm, and consequently its monopoly, intact. But judging by the big industr players joining HSA, the industry may be ready to move on also. Definitely looking forward to a heterogeneous computing future.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
AMD may be in the drivers seat, but there's no passengers. Everybody else wants to ride with Intel.
 
Aug 11, 2008
10,451
642
126
With GPUs having orders of magnitude faster performance in many workloads, and many times higher peak FLOPS, it seems AMD is now in the driver seat. Of course intel is hard at work patching on new instructions, desperate to keep the old, antiquated x86 paradigm, and consequently its monopoly, intact. But judging by the big industr players joining HSA, the industry may be ready to move on also. Definitely looking forward to a heterogeneous computing future.

Yea, guess I will go run Excel tomorrow on my GPU.

They may be on to something and will gain significant market share. Or it may just be another pipe dream they are chasing that doesn't have the software to utilize the hardware, and will never be adopted in a large percent of the market. Just seems to me AMD should build something outstanding at utilizing the current environment, not always chasing some new tech that may or may not be adopted in the future. Seems kind of like building a network of stations to sell hydrogen for fuel cell cars. Yea, it might be better theoretically, but the fact is people are still using gasoline.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Haswell is not optimized for high throughput.
That's like saying a GPU with unified shader cores is not optimized for vertex processing...

You have to look at the system as a whole to determine whether it is optimized for all the tasks. High throughput is just one aspect. GPUs are still completely helpless without a CPU to run the operating system, the application code, and the graphics driver. So it would be totally wrong to look at the GPU in isolation and try to compare it directly to a homogeneous CPU.

So considering that Haswell's CPU cores can do way more in total than any GPU, it is very much optimized for high throughput too. And more importantly, there's still lots of room for future improvement. AVX-1024 can dramatically increase the throughput and/or lower the power consumption.
And GCN has more flexible vector execution units, so it execute branchy code more efficient.
GCN has 64 element wavefronts, and thus has very unfavorable branch granularity. Haswell's AVX-256 is only 8 elements, making it more efficient at branchy code. Even AVX-1024 would be 'only' 32 elements.
 

zlatan

Senior member
Mar 15, 2011
580
291
136
Indeed, only 3 times better and 50% better. Haswell will double the peak throughput, with a negligible impact on area and power consumption.

My point is that GPUs are absolutely not "orders of magnitude" faster, and CPUs are rapidly catching up. AVX2 is not the end of this evolution, it's barely the beginning. Nothing is preventing them to create a homogenous architecture which achieves higher total performance and is easier to program. So the hard truth is that HSA has no long-term future.

Unification of the GPU's vertex and pixel pipeline has allowed much better load balancing and enabled more advanced graphics and general-purpose computing. But the GPU's high latency still severely limits the range of algorithms that can run efficiently. With a low-latency high-throughput homogeneous architecture there would be a world of new possibilities.

So the only question is, who will get there first?

As I said only one GNC CU has the same per clock throughput as four Haswell cores. You can pack 12 CUs with mostly the same transistor budget of four Haswell cores (without LLC). Even if you calculate with clock differences the GPU is still 3-4 times faster, with 40-60% lower power consumption.

No, GPUs are rising the bar. You just only count Moore's law in your speculation, but you don't count the utilization wall, or the death of Dennard scaling. One of the main reason that the GPUs are rise is the law of physics. You can get more transistors in a chip every new node (Moore's law), but with facing the utilization wall you can't lower the energy required to turn on them as much as you need. The only way to avoid this is to design the chip with much lower pJ/instruction rate.

Intel know this, and they also go to the heterogeneous route. AVX 2 is not for CPU cores. They implement now because they must, but AVX2 has some problems.
- It's only support gather, and not scatter. I know gather is more important, but scatter is also very useful.
- To feed two 256 bit FMA units in a core you'll need 6 reads and 2 writes per clock. Haswell only has 4 ports, so this is a limitation.
GPUs are much more flexible in these manners, and you can achive their peak performance much more workloads.
In a low latency core Intel won't implement wider vector units. Skylake will have Larrabee cores (probably with 1024 bit vectors) extending 2 or 4 main cores. This is a heterogeneous route.

Suppose that HSA has not the future. We will still need a virtual ISA with a good open infrastructure to program data-parrallel codes much easier than OpenCL-C/C++. HSA is ideal, because it's open for every fundation member, it has a very formal model for execution and memory, it has vendor extension mechanism, no predication ...
This is what programers want and it's open. Breaking down the monopoly of a company is not a reason to not take this route. Developer just don't care about this. You can see it now with OpenCL. WinZip 16.5 and some VLC functions only run on AMD hardwares. If a company don't have the tools to create good support for their hardwares, than developers just don't care about it.
 
Last edited:

zlatan

Senior member
Mar 15, 2011
580
291
136
That's like saying a GPU with unified shader cores is not optimized for vertex processing...

You have to look at the system as a whole to determine whether it is optimized for all the tasks. High throughput is just one aspect. GPUs are still completely helpless without a CPU to run the operating system, the application code, and the graphics driver. So it would be totally wrong to look at the GPU in isolation and try to compare it directly to a homogeneous CPU.

So considering that Haswell's CPU cores can do way more in total than any GPU, it is very much optimized for high throughput too. And more importantly, there's still lots of room for future improvement. AVX-1024 can dramatically increase the throughput and/or lower the power consumption.

GCN has 64 element wavefronts, and thus has very unfavorable branch granularity. Haswell's AVX-256 is only 8 elements, making it more efficient at branchy code. Even AVX-1024 would be 'only' 32 elements.

You are talking nonsense wiht vertex processing and other insignificant things for GPGPU.
If the code is data-parrallel than run it on the iGPU, if serial or task parrallel, then run on the CPU cores. This is the only aspect what should be considered.
You can't feed Haswell as efficient as you can feed a GPU. I have explained earlier the limitations of AVX/2.

I think I understand your problem. You think in theories of what you can get from a hardware. Programing is not as easy. There is a huge difference what the hardware can do in theory, and how you can program it. GPUs are bad at executing branchy code if the input was mapped in SPMD-on-SIMD fashion (CPUs are also don't like this), but if you don't do that than it is easier to compile a pre-vectorized input to a GPU.
 
Last edited:

happysmiles

Senior member
May 1, 2012
344
0
0
in conclusion:

AVX2 vs HSA.

conclusion of the same arguments over multiple threads:

wait until Haswell release and whether AVX2 or GPGPU becomes defacto.
 

youshotwhointhe

Junior Member
Aug 23, 2012
11
0
0
That is not correct.

There's nowhere near even a single order of magnitude difference, and CPUs continue to catch up. In fact the mainstream desktop Haswell chip will be capable of close to 500 GFLOPS on the CPU cores and only 400 GFLOPS on the iGPU! So it starts to make a lot of sense to fully unify them and have twice the number of homogeneous cores.


This post is so laughable I had to register and account so I could respond:

1) This chart is only for double precision, a known weakness of GPUs. Also, generally not very important in consumer applications.

2) The Cypress card is the equivalent of a 5870 (released Sept. 2009). A fair comparison would be to Lynnfield (also Sept. 2009). That would be worse even than the Westmere chips shown in the chart (probably even making an order of magnitude worse for double precision).

3) Theoretical FLOPS and real FLOPS are completely different. This both hurts both low-latency and high-throughput designs in different scenarios. Running branchy code on a GPU can often make it perform worse than a CPU, however a CPU will never outperform a GPU when solving a data-intensive embarrassingly parallel problem.

4) Why would Intel be investing huge amounts of money (and die area) into their own OpenCL compatible throughput architectures? Have they not heard how many FLOPS Haswell has?
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
As I said only one GNC CU has the same per clock throughput as four Haswell cores. You can pack 12 CUs with mostly the same transistor budget of four Haswell cores (without LLC).
That's meaningless because those compute units are completely helpless by themselves! It's like comparing an F1 engine against a family car. Sure, the former is small and powerful, but it's not going anywhere without a frame, wheels, steering wheel, etc. So again, you have to look at the system as a whole to be able to compare things.

And when you start doing that it becomes easy to see that CPUs can increase their vector throughput by a very large factor with only a modest increase in die size. FMA support barely has an impact but already doubles the peak throughput. Beyond that the increase will be noticeable but still relatively minor. Meanwhile GPUs can no longer increase their compute density. Sure, Kepler increased the theoretical compute density, but it has laughable compute performance in practice. So GPUs have reached their limit, while CPUs have lots of potential left.
You can get more transistors in a chip every new node (Moore's law), but with facing the utilization wall you can't lower the energy required to turn on them as much as you need. The only way to avoid this is to design the chip with much lower pJ/instruction rate.
Which is exactly why I mentioned executing AVX-1024 instructions over multiple cycles! Today's CPUs spend the majority of their power consumption on fetching/decoding/scheduling instructions, not on their actual execution. So the trick is to use wider vectors and execute the instructions over multiple cycles. GCN does the exact same thing.
AVX 2 is not for CPU cores.
You can't be serious. It's part of the x86 instruction set.
They implement now because they must, but AVX2 has some problems.
- It's only support gather, and not scatter. I know gather is more important, but scatter is also very useful.
That's a non-issue. Scatter operations are very rare in data parallel algorithms. For the few cases where they're useful, AVX2 features a versatile permutation instruction. And besides, they can still implement a fully generic scatter instruction later. So this really isn't an argument against homogeneous computing.
- To feed two 256 bit FMA units in a core you'll need 6 reads and 2 writes per clock. Haswell only has 4 ports, so this is a limitation.
That's just plain wrong. Operands can be read from the bypass network, register file, and cache ports. Agner Fog found no practical limit to the number of register reads for Sandy Bridge, meaning it can already sustain 6 reads and 3 writes per clock.
GPUs are much more flexible in these manners, and you can achive their peak performance much more workloads.
No. Anyone who has ever done any GPGPU programming can tell you that more often than not you can't get anywhere near peak performance. Again just look at these pathetic results. The HD 7970, which is rated at 3800 GFLOPs, is only three times faster than a quad-core CPU with 230 GFLOPS. It's insane to call GPUs flexible.
Skylake will have Larrabee cores (probably with 1024 bit vectors) extending 2 or 4 main cores. This is a heterogeneous route.
Wrong again. Intel has declared that it will consolidate VEX and MVEX. That's a homogeneous route.
Suppose that HSA has not the future. We will still need a virtual ISA with a good open infrastructure to program data-parrallel codes much easier than OpenCL-C/C++.
No we don't. AVX2+ can be used by any programming language through auto-vectorization. No need for any new virtual ISA. AMD is having a pipe dream if they think another software layer will fix all the fundamental problems with heterogeneous computing.
You are talking nonsense wiht vertex processing and other insignificant things for GPGPU.
I don't think you understood my analogy. You say heterogeneous computing is superior because the classic CPU and GPU are each more optimized for a specific task. But the same was true for vertex and pixel pipelines on old GPUs, and yet they unified them into homogeneous shader cores. So clearly something is wrong about your theory. What you lose in "optimization" for a more specific task is very minor compared to the advantages of unification!

Nothing is preventing future CPU architectures from achieving high enough throughput to become superior to a heterogeneous architecture.
If the code is data-parrallel than run it on the iGPU, if serial or task parrallel, then run on the CPU cores. This is the only aspect what should be considered.
It's not quite that simple. First of all code is never completely data-parallel or sequential. There's a complex mix of ILP, DLP and TLP, which varies over time. Secondly, data transfers between heterogeneous cores take precious bandwidth, and synchronization between them has high latency. So basically developers are forced to 'categorise' code and ensure minimal interaction between those sections of code. This is not an easy task (read: takes lots of time and thus money), and you always lose performance either by switching between the core types too often or by running code on a core type that is less optimal for it. This is inherent to heterogeneous computing.

With a homogeneous architecture these fundamental issues disappear. They can efficiently switch between parallel or sequential code from one cycle to the next and can deal with any level of ILP/DLP/TLP.
You can't feed Haswell as efficient as you can feed a GPU.
Sure you can.
I have explained earlier the limitations of AVX/2.
Those weren't any significant limitations. And it will continue evolving to optimize the CPU cores for high throughput.
I think I understand your problem. You think in theories of what you can get from a hardware.
Quite the contrary. In theory a heterogeneous architecture has higher peak performance. In practice a unified/homogeneous architecture proves to be superior due to load balancing, no migration overhead, and improved programmability.
GPUs are bad at executing branchy code if the input was mapped in SPMD-on-SIMD fashion (CPUs are also don't like this), but if you don't do that than it is easier to compile a pre-vectorized input to a GPU.
GCN does SPMD-on-SIMD. So what alternative are you talking about?
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
in conclusion:

AVX2 vs HSA.
Close. It's more about AVX2+ vs HSA.

Developers won't adopt technology unless it delivers sufficient ROI at low risk. Which also means that this technology has to scale, so it will stick around long enough. AVXn most definitely has a bright future. Haswell's AVX2 doubles the peak floating-point and integer throughput, and brings gather support (which used to be a GPU-specific feature). AVX3 can easily increase the throughput even more (AVX can be extended to 1024-bit), and other GPU compute features can be implemented into the CPU cores too.

HSA cannot possibly have a long-term future. Ever since the inception of GPGPU and multi-core CPUs, the GPU and CPU have grown closer together (both physically and in terms of capabilities). If heterogeneous computing was the future then they'd remove vector processing from the CPU and go back to single-core, and make the GPU non-uniform again. Obviously the contrary is happening. The CPU is gaining in throughput, and the GPU continues to become more programmable. So it's inevitable that sooner or later all computing will be done by homogeneous cores.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
This post is so laughable I had to register and account so I could respond:

1) This chart is only for double precision, a known weakness of GPUs. Also, generally not very important in consumer applications.
I fully realize that. But it still adequately illustrates that GPUs are nowhere near "orders of magnitude" faster than CPUs. Now that was hilarious!

Also note that this chart used the complete die size for Ivy Bridge, including the iGPU, while only the performance of the CPU cores was counted. Keep in mind that the system agent isn't part of the CPU cores either. And again, the GPU is worthless without the CPU core too. Finally, Haswell's FMA support will practically double those peak ratings. So double precision or not, the results for Haswell's CPU cores will be very impressive.

And the real killer is that things can/will continue to improve in favor of the CPU. It can't be a coincidence that the AVX encoding already reserved bits to extend it up to 1024-bit. With all due respect you'd have to be a moron to think this isn't a very severe threat to HSA.
2) The Cypress card is the equivalent of a 5870 (released Sept. 2009). A fair comparison would be to Lynnfield (also Sept. 2009). That would be worse even than the Westmere chips shown in the chart (probably even making an order of magnitude worse for double precision).
Please read the article to understand the choice of chips that were compared. But don't fixate on it. I only used it to debunk the orders of magnitude myth. If you start looking beyond that it's blatantly obvious that HSA is in trouble and AVX2 is the beginning of homogeneous high-throughput computing.
3) Theoretical FLOPS and real FLOPS are completely different. This both hurts both low-latency and high-throughput designs in different scenarios. Running branchy code on a GPU can often make it perform worse than a CPU, however a CPU will never outperform a GPU when solving a data-intensive embarrassingly parallel problem.
That's rubbish. First of all, by executing AVX-1024 on 256-bit units in four cycles, combined with Hyper-Threading and out-of-order execution, it's going to be practically impossible to get higher utilization with a GPU. So never say never. Anything a GPU architecture can do, a CPU architecture can be made capable of too, and that's exactly the direction Intel is heading.

Furthermore, there's far more that will make a GPU choke than just branchy code. For instance they can only cope with a certain number of cache misses. Beyond that ratio, they run out of work and stall until the data is fetched. And given that the tiny caches are shared by many threads, it doesn't take a lot to get into a situation like that. CPUs have out-of-order execution to continue executing the same thread when a miss occurs, which means they get by with fewer threads which in turn means they have lots of cache space and a lower miss rate. Secondly, GPUs slow down badly when using lots of registers. They have to lower the number of threads to make more registers available but that means they get worse at hiding cache misses. CPUs don't face that problem because they have very fast and large caches for spilling and restoring registers. And then there's a slew of architecture specific bottlenecks, such as Kepler's insanely low L1D cache bandwidth.

Writing GPGPU code which runs efficiently on the majority GPU architectures can be hell, which is a very big limiting factor on the adoption of heterogeneous computing. Even if AMD can create a reasonably efficient HSA implementation, lots of people will still have GPUs such as Kepler which focus more on graphics performance. So AVX2+ will be a far more reliable source of throughput computing performance.
4) Why would Intel be investing huge amounts of money (and die area) into their own OpenCL compatible throughput architectures? Have they not heard how many FLOPS Haswell has?
I don't see them investing huge amounts of money into OpenCL on the iGPU. Ivy Bridge was the first to support it at all, only months ago. Mainstream desktop Haswell will be limited to GT2, and mobile Haswell will clearly only get GT3 because of retina displays, not because of compute capabilities. And I have yet to hear about any compute-specific enhancements to the iGPU architecture. So it seems pretty low on their list of priorities. Meanwhile AVX2 is clearly all about achieving high throughput from the CPU cores.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
How does a lack of scatter support affect the abilities of auto-vectorizing compilers? Doesn't it pretty much make it so you can't auto-vectorize code that works on non-regular data?
 

zlatan

Senior member
Mar 15, 2011
580
291
136
in conclusion:

AVX2 vs HSA.

conclusion of the same arguments over multiple threads:

wait until Haswell release and whether AVX2 or GPGPU becomes defacto.
Don't need to compare these. They are not created for the same problem.
AVX2 is an ISA for Intels heterogeneous approach.
HSA is a platform infrastructure for easy heterogeneous and data-parallel programing. With HSA Bolt your code won't be longer the a serial C code, but it will run almost as fast as a much longer OpenCL-C/C++ code.
 

zlatan

Senior member
Mar 15, 2011
580
291
136
BenchPress: First of all. I'm a programer. So I know how GPGPU programing works. But your pro-Intel glass blind you, so I won't trying to convince you. But in the next few years you will see how Intel go on the heterogeneous route with Skylake.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Don't need to compare these. They are not created for the same problem.
AVX2 is an ISA for Intels heterogeneous approach.
HSA is a platform infrastructure for easy heterogeneous and data-parallel programing. With HSA Bolt your code won't be longer the a serial C code, but it will run almost as fast as a much longer OpenCL-C/C++ code.

i am with you...
i don't see AVX2 and HSA kill each other....this is not amd64 vs IA64

HSA is about to use a powerfull hardware that is useless 90% of the time, while AVX2 is a good feature for CPUs

truth is HSA is kinda pointless for desktops, even AMD folks says this...

i said "kinda", because if console games do use HSA in the next generation, using an APU will be almost required
 

aj654987

Member
Feb 11, 2005
117
14
81
I found the article to be more hype than anything else sorta PH1 and BD type hype. xtreme has a topic on same article. I love how AMD tries to go back to 2002 . As the foundation when AMD didn't buy ATI till 06. Intel had already choozen their road in 2004 after they bought Elbrus and choose between 3 test projects what we know as larrabbee was choosen in 2004 . That morphed into nights ferry and finnally into nights corner. AMd is just saying we came up with fusion 1st when in fact intel had already made its choice in 2004. The GPU part of the project failed for the time being . What the future holds is anyones guess. But AMDs assertion that Intel is going in wrong direction are just empty words until we see the fat lady sing.


I agree and Intel's last several integrated GPU's get underrated. The driver issues were fixed years ago, almost any game will run on them, not like they will run on a $200 GPU of course but competing products like ION and fusion dont make as big of a difference as they used to, especially with all of them using system ram rather than dedicated vram, it is only going to get so fast.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
How does a lack of scatter support affect the abilities of auto-vectorizing compilers? Doesn't it pretty much make it so you can't auto-vectorize code that works on non-regular data?
No, a scatter operation can still be implemented with a series of scalar instructions. Since scatter operations are rare, this has a very low impact on performance. Gather is used far more frequently, and thus AVX2's gather support will provide a dramatic improvement.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Don't need to compare these. They are not created for the same problem.
They are created for exactly the same problem: general-purpose throughput computing.
AVX2 is an ISA for Intels heterogeneous approach.
That's just blatantly wrong. AVX2 is an extension of the x86 ISA. And AVX execution units are part of the CPU cores. Hence it's a fully homogeneous approach.
With HSA Bolt your code won't be longer the a serial C code, but it will run almost as fast as a much longer OpenCL-C/C++ code.
With AVX2 the serial C/C++ code will be vectorized and run up to eight times faster.
I'm a programer. So I know how GPGPU programing works.
With all due respect, that doesn't impress me much. Not because I'm a computer engineer myself, but because I've met plenty of programmers who couldn't imagine the benefits of a unified GPU architecture. In hindsight they now realize they were idiots.

Today we face the exact same kind of situation. In several years from now it will be perfectly feasible to make a mainstream homogeneous CPU with a throughput nearly as high as competing heterogeneous one, but with unlimited new possibilities in programmability. But somehow it's still stuck in many people's heads that we should stick with what has worked before, namely a separate CPU + GPU, not realizing the bottlenecks and limitations that brings...

So again, knowing how today's GPGPU programming works doesn't mean a thing. You have to look beyond AVX2 and HSA to see whether homogeneous or heterogeneous computing will survive. It's obvious that the CPU and GPU architectures continue to converge, so one day they will inevitably become one, and it looks like Intel is leading the way with AVX2.
But your pro-Intel glass blind you, so I won't trying to convince you.
I'm sorry but that's the kind of things people say when they no longer have any counter-arguments. Unfortunately for you I'm completely impartial. For starters, I've used and continue to use a fair share of AMD components in my systems. But more importantly, I hope Intel gets some renewed competition so the innovation continues and the prices drop. So I hope AMD employees read these forums and realize in time that the future isn't heterogeneous but homogeneous.
But in the next few years you will see how Intel go on the heterogeneous route with Skylake.
Based on what? Intel engineers have stated that they'll consolidate VEX and MVEX. Most probably the resulting extension from this unification will be dubbed AVX3.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
i don't see AVX2 and HSA kill each other....this is not amd64 vs IA64

HSA is about to use a powerfull hardware that is useless 90% of the time, while AVX2 is a good feature for CPUs
You're looking at it all wrong. HSA isn't just another GPU feature and AVX2 isn't just another CPU feature, which can eternally live together in harmony. They both want to significantly increase the vector throughput for general-purpose applications. HSA tries to do that the heterogeneous way, while AVX2 does it the homogeneous way by bringing GPU features inside of the CPU cores. Exceptions aside, developers will adopt only one of these. And unfortunately for AMD, AVX2 is straightforward to use by auto-vectorizing compilers and it doesn't have to deal with any of the heterogeneous overhead or the GPU's unpredictable behavior. So from a ROI perspective, it's a no-brainer.
truth is HSA is kinda pointless for desktops, even AMD folks says this...
Could you find me a quote for that?
i said "kinda", because if console games do use HSA in the next generation, using an APU will be almost required
Certainly not. First of all, HSA does not imply an APU (which is a big part of why it will have unpredictable behavior and thus why developers will opt for AVX2 instead). And secondly, CPUs with AVX2 will have plenty of throughput to power those games. So consoles won't save HSA.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |