Is that die size increase due to a larger iGPU, or is it due to the "rumored" dedicated L4$ for the iGPU?
L4 is on package, not on die. And not all Haswell parts implement it.
Is that die size increase due to a larger iGPU, or is it due to the "rumored" dedicated L4$ for the iGPU?
That die is only a tiny bit larger, so it's probably just the CPU cores with AVX2 that are slightly larger, and the iGPU having Direct3D 11.1 support. That makes it GT2, with the same EU count as HD 4000. GT3 is said to be substantially larger and would need on-package eDRAM to feed it sufficient bandwidth.Is that die size increase due to a larger iGPU, or is it due to the "rumored" dedicated L4$ for the iGPU?
That die is only a tiny bit larger, so it's probably just the CPU cores with AVX2 that are slightly larger, and the iGPU having Direct3D 11.1 support. That makes it GT2, with the same EU count as HD 4000. GT3 is said to be substantially larger and would need on-package eDRAM to feed it sufficient bandwidth.
Haswell GT2 actually has 20EUs, 4 more than HD 4000 on Ivy Bridge. That's probably the die in the pictures.
Will the EUs be identical, or will they be a tweaked design?
Will the EUs be identical, or will they be a tweaked design?
Haswell's 3 iGPU variations are said to be: GT1-6 EUs, GT2-20EUs, GT3-40EUs
Haswell's 3 iGPU variations are said to be: GT1-6 EUs, GT2-20EUs, GT3-40EUs
Only because AMD hasn't put their low end on 28nm yet. Is the gap closing? Yeah, but the low end still has life to it.To tell the truth, Trinity is more or less equally capable that a Radeon 5570, so the low end is really already fading away.
Quicksync is a regressive feature, IMO.These more powerful iGPUs are simply going to re-define what the low end space is. Today's mid-level cards will become the new low end and everything else will shift down accordingly.
What I find more interesting is what we will be able to do with these new iGPUs. IE Quick Sync and other similar applications.
The only thing you'll be able to do with them is run graphics faster. For everything else, AVX2 is vastly superior.What I find more interesting is what we will be able to do with these new iGPUs. IE Quick Sync and other similar applications.
The only thing you'll be able to do with them is run graphics faster. For everything else, AVX2 is vastly superior.
Desktop Haswell will only feature GT2. And that means the peak floating-point performance of the iGPU would be about 400 GFLOPS. The quad-core CPU with AVX2 can do 500 GFLOPS. But it also doesn't suffer from round-trip heterogeneous processing overhead, has faster caches, has more cache space per thread, has out-of-order execution, has hardware transactional memory, has no API overhead, etc. A CPU achieves more per FLOP than a GPU does, and Haswell even has more processing power on the CPU end than on the GPU end.
It's not even unlikely that the CPU will help out the GPU, instead of the other way around. CPUs have become vastly more powerful in the last few years, to the point where we no longer know what to do with all the cores and vector units. So having them do graphics is starting to make perfect sense. The fact that AVX2 includes LRBni's gather instruction and can be extended to 1024-bit can't be a coincidence...
Each core has three AVX2 units (just like Sandy/Ivy Bridge have three AVX units), two of which will be capable of FMA operations, and each of them is 8 x 32 = 256-bit wide. So that's 4 cores x 2 units x 8 elements x 2 floating-point operations x ~3.9 GHz = 500 GFLOPS.Where are those FLOP numbers everybody is throwing around come from?
AVX2 will be far more applicable because any code loop with independent iterations can be parallelised in an SPMD fashion. It really combines the best of both worlds into one, because you no longer have to move work over to the GPU, work within its limitations, and then move stuff back. It avoids the latency and bandwith bottleneck of heterogeneous processing by being able to switch between sequential and vector code from one cycle to the next, while leaving all data local in the cashes. And you don't lose features like deep recursive calls or function pointers.And, how applicable are those peak performance numbers? Peak GPU performance generally isn't applicable to most applications, how useful will AVX2 actually be?
The reason two units will be capable of FMA can be deduced from the fact that any other configuration would compromise legacy performance or code with a modest amount of FMA instructions would congest the execution port for other operations.
Only because AMD hasn't put their low end on 28nm yet. Is the gap closing? Yeah, but the low end still has life to it.
Quicksync is a regressive feature, IMO.
And, until DDR4, we won't be much more than Trinity does, which kinda sucks. Even going up to 2400 and higher DDR3 speeds, it will remain far too easy to become starved for RAM I/O, and more channels (pins) are expensive.
note that 2 FMA units per core is now certain based on the IDF Spring disclosures, see the AVX2 paper BJ12_ARCS002_102_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ
I don't doubt they have their reasons. For anything WORM (such as video), however, efficiency on the writing side can take a back seat, IMO. What I'd like to see would be a compromise, exposing a highly-programmable DSP, so that developers could use bits and pieces as desired, managing the performance v. quality trade-off.I can see your point (fixed hardware in the days where anything is programmable) but I really think that Quicksync-like features have their reasons.
If it happens, and it's plain L4, I would honestly wonder. Would a large cache like that really improve performance that much? A large local memory, and write cache, for AVX2 and the GPU, absolutely.Yes, this is true: we need some radical changes (eg: true, full tile-based rendering, Intel eDram L4, ecc.) to get way higher performance without waiting for DDR4.
As a note: in this document, Intel claimed non-destructive operations as a new "key feature" of AVX instructions.
You could save memory I/O for working with buffers that can fit inside it, and deterministically prefetch for GPU and AVX2 work, without polluting the rest of the CPU's caches. Going to and from buffers in memory is rather wasteful, and both bandwidth and power are scarce. Other than that, and as a dedicated cache for streaming (not directly part of the L1-L3 hierarchy), I just don't see such a thing being that much help, outside of servers, and Intel has basically the best SRAM out there.I dont see eDRAM L4 cache or whatever to solve any bandwidth related issues in terms of the iGPU. It might be good for basicly free AA modes and such. But for the regular gaming the working set with textures is too big and need the real DDR4 bandwidth. Unless they go nuts and add a 256MB+ eDRAM or something.
I don't doubt they have their reasons. For anything WORM (such as video), however, efficiency on the writing side can take a back seat, IMO. What I'd like to see would be a compromise, exposing a highly-programmable DSP, so that developers could use bits and pieces as desired, managing the performance v. quality trade-off.
If it happens, and it's plain L4, I would honestly wonder. Would a large cache like that really improve performance that much? A large local memory, and write cache, for AVX2 and the GPU, absolutely.
With caches as big as they are today, it's only in the lowest levels that kicking out data you'll need again soon is a real problem (and not much of one, with fast enough L2 and good load/store re-ordering). Most of the time, the problem is the set of addresses that could not be predicted, which cache size increases don't help much with. A big eDRAM cache might be smaller, but would it really help desktop/mobile performance all that much, used as an additional cache level?