So I was thinking...it seems to be a concern that it takes 8 AMD BD cores to match/beat 4 SNB cores
How so? It should be a concern whether
a BD core can match up, not whether double can, because double won't help enough, in the areas AMD needs to turn a profit. They rae only going for 50% on servers, after all.
but my question is, why is this a big deal? In GPU world, it takes 1536 AMD "Cores" to match 480 NVIDIA "Cores", but nobody cares because when we choose our GPUs, we care about the bottom line performance of the part, not the per-core performance.
GPUs universally handle embarrassingly parallel workloads. The kind of workloads that scale linearly, or so close that they may as well. ILP and TLP for their common work is typically limited by the data size
(2 million pixels output -> minimum 2 million small tasks). This is not so on the desktop and server, nor mobile, for most programs, though server can be close enough, sometimes. GPUs can get such high densities, in large part, because they can be processing different data streams for the same
instruction at once. This kind of operation tends not to work well for CPUs, even when the workload can scale out very well, because they tend to branch a lot, and will need to be executing different instructions each, at any given time, even running the same binary in several threads--they need to be able to run it independently.
GPGPU is allowing basically free added performance--free in the sense that the processing power is already going to be there--for situations where the CPU would previously do little bits of highly parallel work, but it's not going to work as a general purpose CPU replacement. Theoretically, speculation could be replaced with predication, but even if the binary size were small enough
(should be possible, if trusting the HW to manage the parallelism, so the compiler only has to tell it about it, and do some software preloads), memory bandwidth needed would be way too high, leading to higher costs, even
if we were to make CPUs, and software infrastructure for them, that were designed around high per-task parallelism.
So why can't we see the CPU in a similar fashion? Software is moving towards being, in general, "multi-threaded", so we should see increasing execution cores in the same way as increasing cache, widening/shortening pipeline length, improving branch predictors, etc. -- just one aspect of the final product.
What do you all think?
Once software catches up, and allows easy programming across hundreds, or thousands of cores, we will. Each loop would be given either an iteration count or timer, after which it must check back in with a scheduler
(trust me, actually making this happen is much harder than it reads, and we're years away from good language & IDE support). In this way, any number of cores up to however many can theoretically be used, can actually be used, be it 8, 16, or 500, by making each small task capable of using a currently-sleeping thread, or doing n small tasks in each thread with fewer available. However, it is going to take time, because everyone who was shouting at the top of their lungs years ago was fringe, and considered ahead of their time. Even when this happens, you won't get double performance for double the cores. With the kind of work CPUs generally do, having the extra cores
available, for short utilization periods, will be what is important. Many things you want to get done depend on previous things getting done, so those things must be done in order, one after the other, not at the same time. There's no way around that, easy or hard.