VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

alyarb · Jun 6, 2012

Edrick said:
Is that die size increase due to a larger iGPU, or is it due to the "rumored" dedicated L4$ for the iGPU?

L4 is on package, not on die. And not all Haswell parts implement it.

BenchPress · Jun 6, 2012

Edrick said:
Is that die size increase due to a larger iGPU, or is it due to the "rumored" dedicated L4$ for the iGPU?

That die is only a tiny bit larger, so it's probably just the CPU cores with AVX2 that are slightly larger, and the iGPU having Direct3D 11.1 support. That makes it GT2, with the same EU count as HD 4000. GT3 is said to be substantially larger and would need on-package eDRAM to feed it sufficient bandwidth.

Edrick · Jun 6, 2012

BenchPress said:
That die is only a tiny bit larger, so it's probably just the CPU cores with AVX2 that are slightly larger, and the iGPU having Direct3D 11.1 support. That makes it GT2, with the same EU count as HD 4000. GT3 is said to be substantially larger and would need on-package eDRAM to feed it sufficient bandwidth.

Thanks. Good to know.

IntelUser2000 · Jun 6, 2012

Haswell GT2 actually has 20EUs, 4 more than HD 4000 on Ivy Bridge. That's probably the die in the pictures.

exar333 · Jun 6, 2012

IntelUser2000 said:
Haswell GT2 actually has 20EUs, 4 more than HD 4000 on Ivy Bridge. That's probably the die in the pictures.

Will the EUs be identical, or will they be a tweaked design?

ShintaiDK · Jun 6, 2012

ExarKun333 said:
Will the EUs be identical, or will they be a tweaked design?

Tweaked, but not completely new design.

IntelUser2000 · Jun 7, 2012

ExarKun333 said:
Will the EUs be identical, or will they be a tweaked design?

The architecture is "Gen 7 based", and Gen 7 is Ivy Bridge. The EUs probably will remain similar but they might rearchitect the circuits around it. BTW that slide is relatively old. Beta Windows 8 drivers show OpenGL 4.0 support for Ivy Bridge, and therefore Haswell.

Arrandale/Clarkdale's GMA HD was named Gen 5.75 but they fixed a big performance bottleneck which allowed greater performance leaps than previous "new" generations brought.

Haswell's 3 iGPU variations are said to be: GT1-6 EUs, GT2-20EUs, GT3-40EUs

Ajay · Jun 7, 2012

IntelUser2000 said:
Haswell's 3 iGPU variations are said to be: GT1-6 EUs, GT2-20EUs, GT3-40EUs

Ah. thanks! I thought I had read Haswell would have up to 40EUs, and was getting a bit confused by the above conversation.

podspi · Jun 7, 2012

IntelUser2000 said:
Haswell's 3 iGPU variations are said to be: GT1-6 EUs, GT2-20EUs, GT3-40EUs

Assuming they can feed it, that 40EU unit should be a very impressive performer. I know we've been talking about this for a while, but at what point does the lower-end discrete market really start going away?

kernelc · Jun 7, 2012

Hi,
with IB/Llano/Trinity the lower end cards (AMD 5450 / 6450 and NVIDIA GT 220 / 210) begin to be really questionable already: http://www.anandtech.com/show/4263/amds-radeon-hd-6450-uvd3-meets-htpc/5

To tell the truth, Trinity is more or less equally capable that a Radeon 5570, so the low end is really already fading away.

Sure, the discrete cards can push its low limit well above what we have now, but this don't change the fact that current integrated GPU begin to be more or less adequate for many, many users.

Regards.

Homeles · Jun 7, 2012

kernelc said:
To tell the truth, Trinity is more or less equally capable that a Radeon 5570, so the low end is really already fading away.

Only because AMD hasn't put their low end on 28nm yet. Is the gap closing? Yeah, but the low end still has life to it.

Denithor · Jun 7, 2012

These more powerful iGPUs are simply going to re-define what the low end space is. Today's mid-level cards will become the new low end and everything else will shift down accordingly.

What I find more interesting is what we will be able to do with these new iGPUs. IE Quick Sync and other similar applications.

Cerb · Jun 7, 2012

Denithor said:
These more powerful iGPUs are simply going to re-define what the low end space is. Today's mid-level cards will become the new low end and everything else will shift down accordingly.

What I find more interesting is what we will be able to do with these new iGPUs. IE Quick Sync and other similar applications.

Quicksync is a regressive feature, IMO.

And, until DDR4, we won't be much more than Trinity does, which kinda sucks. Even going up to 2400 and higher DDR3 speeds, it will remain far too easy to become starved for RAM I/O, and more channels (pins) are expensive.

BenchPress · Jun 7, 2012

Denithor said:
What I find more interesting is what we will be able to do with these new iGPUs. IE Quick Sync and other similar applications.

The only thing you'll be able to do with them is run graphics faster. For everything else, AVX2 is vastly superior.

Desktop Haswell will only feature GT2. And that means the peak floating-point performance of the iGPU would be about 400 GFLOPS. The quad-core CPU with AVX2 can do 500 GFLOPS. But it also doesn't suffer from round-trip heterogeneous processing overhead, has faster caches, has more cache space per thread, has out-of-order execution, has hardware transactional memory, has no API overhead, no language limitations, etc. A CPU achieves more per FLOP than a GPU does, and Haswell even has more processing power on the CPU end than on the GPU end.

It's not even unlikely that the CPU will help out the GPU, instead of the other way around. CPUs have become vastly more powerful in the last few years, to the point where we no longer know what to do with all the cores and vector units. So having them do graphics is starting to make perfect sense. The fact that AVX2 includes LRBni's gather instruction and can be extended to 1024-bit can't be a coincidence...

podspi · Jun 7, 2012

BenchPress said:
The only thing you'll be able to do with them is run graphics faster. For everything else, AVX2 is vastly superior.

Desktop Haswell will only feature GT2. And that means the peak floating-point performance of the iGPU would be about 400 GFLOPS. The quad-core CPU with AVX2 can do 500 GFLOPS. But it also doesn't suffer from round-trip heterogeneous processing overhead, has faster caches, has more cache space per thread, has out-of-order execution, has hardware transactional memory, has no API overhead, etc. A CPU achieves more per FLOP than a GPU does, and Haswell even has more processing power on the CPU end than on the GPU end.

It's not even unlikely that the CPU will help out the GPU, instead of the other way around. CPUs have become vastly more powerful in the last few years, to the point where we no longer know what to do with all the cores and vector units. So having them do graphics is starting to make perfect sense. The fact that AVX2 includes LRBni's gather instruction and can be extended to 1024-bit can't be a coincidence...

Where are those FLOP numbers everybody is throwing around come from? And, how applicable are those peak performance numbers? Peak GPU performance generally isn't applicable to most applications, how useful will AVX2 actually be?

BenchPress · Jun 7, 2012

podspi said:
Where are those FLOP numbers everybody is throwing around come from?

Each core has three AVX2 units (just like Sandy/Ivy Bridge have three AVX units), two of which will be capable of FMA operations, and each of them is 8 x 32 = 256-bit wide. So that's 4 cores x 2 units x 8 elements x 2 floating-point operations x ~3.9 GHz = 500 GFLOPS.

The reason two units will be capable of FMA can be deduced from the fact that any other configuration would compromise legacy performance or code with a modest amount of FMA instructions would congest the execution port for other operations.

And, how applicable are those peak performance numbers? Peak GPU performance generally isn't applicable to most applications, how useful will AVX2 actually be?

AVX2 will be far more applicable because any code loop with independent iterations can be parallelised in an SPMD fashion. It really combines the best of both worlds into one, because you no longer have to move work over to the GPU, work within its limitations, and then move stuff back. It avoids the latency and bandwith bottleneck of heterogeneous processing by being able to switch between sequential and vector code from one cycle to the next, while leaving all data local in the cashes. And you don't lose features like deep recursive calls or function pointers.

bronxzv · Jun 7, 2012

BenchPress said:
The reason two units will be capable of FMA can be deduced from the fact that any other configuration would compromise legacy performance or code with a modest amount of FMA instructions would congest the execution port for other operations.

note that 2 FMA units per core is now certain based on the IDF Spring disclosures, see the AVX2 paper BJ12_ARCS002_102_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ

kernelc · Jun 8, 2012

Homeles said:
Only because AMD hasn't put their low end on 28nm yet. Is the gap closing? Yeah, but the low end still has life to it.

Yeah, this is true.

However, the lower end chips generally remain with very few units anyway. Think to the Radeon 2400XT-5450: four different chip generation and three process nodes after, they come from 40 sp / 4 TMU to 80 sp / 8 TMU only (and ROPs remained the same).

It's not a case that the new AMD low-end, the 6450, doubled on SP numbers: their lower end brothers were performance outclassed by Sandy Bridge also (albeit with very low quality).

I think that the true lower-end cards have few days remaining; however, as Denithor said, some chips from slighly higher performance class will replace them in the low-end space, until this will become non-economical (in the following 5+ years?)

Regards.

kernelc · Jun 8, 2012

Cerb said:
Quicksync is a regressive feature, IMO.

I can see your point (fixed hardware in the days where anything is programmable) but I really think that Quicksync-like features have their reasons.

Fixed hardware can be tenfold more efficient that programmable one, obviously with the correct workload. And while our Teraflop, programmable cards do only a little bit better then pure-CPU encoding, the very small Quicksync block do wonders.

The fun thing is that this design principle is the exact opposite of Larrabee (which tried to to texture mapping also via CPU cores, and found it to be 20 times less efficient that a real TMU array).

And, until DDR4, we won't be much more than Trinity does, which kinda sucks. Even going up to 2400 and higher DDR3 speeds, it will remain far too easy to become starved for RAM I/O, and more channels (pins) are expensive.

Yes, this is true: we need some radical changes (eg: true, full tile-based rendering, Intel eDram L4, ecc.) to get way higher performance without waiting for DDR4.

Anyway, I think Trinity-class GPUs are quite adequate for many users, the casual gamers ones also.

Regards.

kernelc · Jun 8, 2012

bronxzv said:
note that 2 FMA units per core is now certain based on the IDF Spring disclosures, see the AVX2 paper BJ12_ARCS002_102_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ

Great paper, thank you.

As a note: in this document, Intel claimed non-destructive operations as a new "key feature" of AVX instructions. Maybe non-destructive semantic can be useful in some case (apart the simplified assembly code generation)?

Thanks.

Cerb · Jun 8, 2012

kernelc said:
I can see your point (fixed hardware in the days where anything is programmable) but I really think that Quicksync-like features have their reasons.

I don't doubt they have their reasons. For anything WORM (such as video), however, efficiency on the writing side can take a back seat, IMO. What I'd like to see would be a compromise, exposing a highly-programmable DSP, so that developers could use bits and pieces as desired, managing the performance v. quality trade-off.

Yes, this is true: we need some radical changes (eg: true, full tile-based rendering, Intel eDram L4, ecc.) to get way higher performance without waiting for DDR4.

If it happens, and it's plain L4, I would honestly wonder. Would a large cache like that really improve performance that much? A large local memory, and write cache, for AVX2 and the GPU, absolutely.

With caches as big as they are today, it's only in the lowest levels that kicking out data you'll need again soon is a real problem (and not much of one, with fast enough L2 and good load/store re-ordering). Most of the time, the problem is the set of addresses that could not be predicted, which cache size increases don't help much with. A big eDRAM cache might be smaller, but would it really help desktop/mobile performance all that much, used as an additional cache level?

ShintaiDK · Jun 8, 2012

I dont see eDRAM L4 cache or whatever to solve any bandwidth related issues in terms of the iGPU. It might be good for basicly free AA modes and such. But for the regular gaming the working set with textures is too big and need the real DDR4 bandwidth. Unless they go nuts and add a 256MB+ eDRAM or something.

On the other hand, in the future I could easily imagine ondie/onpackage iGPU memory and the sharing of main memory gone.

bronxzv · Jun 8, 2012

kernelc said:
As a note: in this document, Intel claimed non-destructive operations as a new "key feature" of AVX instructions.

these are basically the same slides that they update at each IDF so it's probably simply some old text that remains, it's still true (besides FMA) but not so new

you can see at slide 23 what I was refering to when we were discussing FMA3 vs FMA4 the other day, FMA3 is very flexible with the 132/213/231 variants but it introduces 60 new mnemonics, alright for a compiler not so for a human coder, now I suppose nobody will use assembly for FMA code anyway so it's anecdotal at best

Cerb · Jun 8, 2012

ShintaiDK said:
I dont see eDRAM L4 cache or whatever to solve any bandwidth related issues in terms of the iGPU. It might be good for basicly free AA modes and such. But for the regular gaming the working set with textures is too big and need the real DDR4 bandwidth. Unless they go nuts and add a 256MB+ eDRAM or something.

You could save memory I/O for working with buffers that can fit inside it, and deterministically prefetch for GPU and AVX2 work, without polluting the rest of the CPU's caches. Going to and from buffers in memory is rather wasteful, and both bandwidth and power are scarce. Other than that, and as a dedicated cache for streaming (not directly part of the L1-L3 hierarchy), I just don't see such a thing being that much help, outside of servers, and Intel has basically the best SRAM out there.

kernelc · Jun 8, 2012

Cerb said:
I don't doubt they have their reasons. For anything WORM (such as video), however, efficiency on the writing side can take a back seat, IMO. What I'd like to see would be a compromise, exposing a highly-programmable DSP, so that developers could use bits and pieces as desired, managing the performance v. quality trade-off.

While it will be exciting, I think that we will never see anything similar on a CPU: an high programmable DSP will be similar to a CPU with large amount of dedicated fixed logic, eating into programmable one (cores) and interconnects. Fixed function logic is great, but only with a relatively small die estate (eg: TMUs).

If it happens, and it's plain L4, I would honestly wonder. Would a large cache like that really improve performance that much? A large local memory, and write cache, for AVX2 and the GPU, absolutely.

With caches as big as they are today, it's only in the lowest levels that kicking out data you'll need again soon is a real problem (and not much of one, with fast enough L2 and good load/store re-ordering). Most of the time, the problem is the set of addresses that could not be predicted, which cache size increases don't help much with. A big eDRAM cache might be smaller, but would it really help desktop/mobile performance all that much, used as an additional cache level?

A large L4 cache or, better still, a embedded directly addressable memory, will improve ROPs performance by a very noticeable amount, and ROPs are always short of bandwidth. For example, Trinity GPU has 8 ROPs: at 600 MHz, they are capable of 4.8 GP/s, with a required maximum bandwidth (w/o antialiasing) of 57.6 - 76.8 GB/s (1x z-read+write, 1x color read+write, for 32bit x 4 = 16 byte per pixel). Sure, color/Z compression will help a lot, but its clear that more bandwidth you have, the better is. When L4 framebuffer is transferred to main memory, bandwidth requirements are lower: you don't need to read/compare Z or color samples, but only to flush them to memory. So, required bandwidth is in the range of 38.4 GB/s.

It is difficult to say how much this will improve performance... however, I think we are in the range of ~30% for non-AA frames and >60% for AA-enabled ones.

On the other side, TMUs are fine with very low caches: a relatively small 8 KB texture cache (as on RV870) can map textels on many pixels.

Thanks.

VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

Platinum Member

Senior member

Golden Member

Elite Member

Diamond Member

Lifer

Elite Member

Lifer

Golden Member

Member

Platinum Member

Diamond Member

Elite Member

Senior member

Golden Member

Senior member

Senior member

Member

Member

Member

Elite Member

Lifer

Senior member

Elite Member

Member