VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

kernelc · Jun 8, 2012

ShintaiDK said:
I dont see eDRAM L4 cache or whatever to solve any bandwidth related issues in terms of the iGPU. It might be good for basicly free AA modes and such. But for the regular gaming the working set with textures is too big and need the real DDR4 bandwidth. Unless they go nuts and add a 256MB+ eDRAM or something.

On the other hand, in the future I could easily imagine ondie/onpackage iGPU memory and the sharing of main memory gone.

As stated above, textures don't require huge caches: when a group of pixels is being rendered, the required texture space is often in the range of only some kilobyte.

As an example, see the R600 vs RV770 case: the first used a relatively large, shared texture cache, while the latter use a number of small, private 16 KB caches, yet it performs very well. RV870 use 8 KB of texture caches and still performs quite well.

In other word: a large L4 would be, for the most part, not useful for texturing, as the required texture space is very small. However, it will be useful to cache framebuffer operations.

Regards.

kernelc · Jun 8, 2012

One more thing: I don't think that AMD will use an embedded cache/memory for graphics in the near/medium terms. AMD (as Nvidia) has an immense knowledge of graphics workload, so I think that they can improve performance without a such expensive (from a die space standpoint) strategy.

Intel, on the other side, has way lower graphics know-how and need an immediate mode to improve their performance: hence we see a 40 EUs array (while EUs scaling don't seems so good) and a large L4/embedded memory die.

Just my two euro-cents

Cerb · Jun 8, 2012

kernelc said:
A large L4 cache or, better still, a embedded directly addressable memory, will improve ROPs performance by a very noticeable amount, and ROPs are always short of bandwidth.

If used for that sort of purpose, rather than a large LLC, I think it would be great. As a true level 4 cache, I just have a hard time seeing large benefits, outside of servers, unless there's novel work we aren't yet aware of.

CPUarchitect · Jun 8, 2012

kernelc said:
As a note: in this document, Intel claimed non-destructive operations as a new "key feature" of AVX instructions. Maybe non-destructive semantic can be useful in some case (apart the simplified assembly code generation)?

You mean FMA4 instead of FMA3? The problem is that it would be the only single-uop instruction with four operands. So it wastes space in the uop cache.

As noted before though, Intel implemented three permutations of the operands, largely eliminating the chances of destructing a value you need later. Call it FMA3.5 if you want. Combined with the ability to copy registers at the renaming stage, which was added by Ivy Bridge, there's really no reason to want a wasteful FMA4 implementation instead.

CPUarchitect · Jun 8, 2012

Cerb said:
I don't doubt they have their reasons. For anything WORM (such as video), however, efficiency on the writing side can take a back seat, IMO. What I'd like to see would be a compromise, exposing a highly-programmable DSP, so that developers could use bits and pieces as desired, managing the performance v. quality trade-off.

What's wrong with AVX2? In Haswell it offers three 256-bit integer vector units per core, plus gather support. The latter can do what we previously needed 18 instructions for in the general case. A lot of video processing essentially consists of gather operations.

And if you're concerned about power consumption, that will likely be fixed with AVX-1024.

kernelc · Jun 8, 2012

CPUarchitect said:
You mean FMA4 instead of FMA3? The problem is that it would be the only single-uop instruction with four operands. So it wastes space in the uop cache.

As noted before though, Intel implemented three permutations of the operands, largely eliminating the chances of destructing a value you need later. Call it FMA3.5 if you want. Combined with the ability to copy registers at the renaming stage, which was added by Ivy Bridge, there's really no reason to want a wasteful FMA4 implementation instead.

Yes, bronxzv already posted me the realworldtech thread were this was discussed

What I means is that, in the linked PDF, Intel itself consider non-destructive instructions a "key feature", so I wonder if it is really useful for some.

Thank you for your clarifications.

kernelc · Jun 8, 2012

CPUarchitect said:
What's wrong with AVX2? In Haswell it offers three 256-bit integer vector units per core, plus gather support. The latter can do what we previously needed 18 instructions for in the general case. A lot of video processing essentially consists of gather operations.

And if you're concerned about power consumption, that will likely be fixed with AVX-1024.

I agree: gather capability is a very exciting one, I must say.

Cerb · Jun 8, 2012

CPUarchitect said:
What's wrong with AVX2?

It's not an exclusive argument. AVX2 can compliment Quick Sync, performing work on the input, if nothing else, and would be a great fit to complement a DSP with more modular functionality, as well.

CPUarchitect · Jun 8, 2012

kernelc said:
One more thing: I don't think that AMD will use an embedded cache/memory for graphics in the near/medium terms. AMD (as Nvidia) has an immense knowledge of graphics workload, so I think that they can improve performance without a such expensive (from a die space standpoint) strategy.

Exactly how do you expect them to work around the glaring bandwidth issue of APUs then? They're already aggressively pushing for higher frequency RAM, and adding more pins likely increases cost considerably. Intel must have also evaluated those options and concluded that eDRAM was cheaper and more power efficient.

Intel, on the other side, has way lower graphics know-how and need an immediate mode to improve their performance: hence we see a 40 EUs array (while EUs scaling don't seems so good) and a large L4/embedded memory die.

What makes you think EU scaling itself isn't good and it's not just a bandwidth or latency issue? Keep in mind that graphics workloads are quite "bursty". At one time you could be dealing with a very short shader that saturates the ROPs, then a shader with lots of texture operations is used, and then an arithmetic heavy shader is used, etc. And latency is relevant because waiting for a RAM access requires lots of register space to keep processing more threads, and sometimes tasks simply aren't infinitely parallel.

AMD very much relies on high arithmetic throughput, but it's quickly becoming a problem of diminishing returns due to bandwidth and latency limits. Only eDRAM can offer a cost effective solution.

bronxzv · Jun 8, 2012

kernelc said:
I agree: gather capability is a very exciting one, I must say.

in related news I have started a thread at RWT concerning the recent MIC ISA disclosure:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=129578&threadid=129578&roomid=2

we can expect some more stuff to be migrated to AVX in the future like wider vectors, scatter instructions, predication masks (to save power)

CPUarchitect · Jun 8, 2012

kernelc said:
What I means is that, in the linked PDF, Intel itself consider non-destructive instructions a "key feature", so I wonder if it is really useful for some.

Non-destructive instructions are very useful for operations with two input operands and one output operand, which is the majority of operations. Before AVX's VEX encoding we only had two operand instructions, meaning it frequently destructed values needed later on. Ivy Bridge's register copying at the renaming stage only helps lower execution port contention. It doesn't eliminate the register copying instructions themselves. VEX eliminates them entirely and thus reduces decoder pressure, frees up uop cache slots for more useful instructions, etc.

It's not really necessary for FMA since it has three input operands which allows creating useful permutations of the instructions to avoid destructing an operand that is needed later. Hence Intel's FMA implementation could be considered non-destructive for most cases you'd care about.

kernelc · Jun 8, 2012

CPUarchitect said:
Exactly how do you expect them to work around the glaring bandwidth issue of APUs then? They're already aggressively pushing for higher frequency RAM, and adding more pins likely increases cost considerably. Intel must have also evaluated those options and concluded that eDRAM was cheaper and more power efficient.

If bandwidth will become an insurmountable problem, I think AMD can push a more tile-based approach to somewhat work around bandwidth limitation.

Intel use a partial zone-rendering approach in their chipset/GMA, but true tile-rendering is very hard to do without artifact.

So, due to its greater graphic know-how, I just imagine AMD more comfortable to work with this hypotetical solution, nothing more.

What makes you think EU scaling itself isn't good and it's not just a bandwidth or latency issue? Keep in mind that graphics workloads are quite "bursty". At one time you could be dealing with a very short shader that saturates the ROPs, then a shader with lots of texture operations is used, and then an arithmetic heavy shader is used, etc. And latency is relevant because waiting for a RAM access requires lots of register space to keep processing more threads, and sometimes tasks simply aren't infinitely parallel.

Oh yes, the scaling is surely limited by bandwidth (see 3DMark pixel fill test).
What I want to say is that, while current IB does not scale so well going from 6 to 16 EUs, future Hashwell, thanks to its massive L4/eDRAM, will use up to 40 EUs.

I was not clear, sorry.

AMD very much relies on high arithmetic throughput, but it's quickly becoming a problem of diminishing returns due to bandwidth and latency limits. Only eDRAM can offer a cost effective solution.

I am skeptical regard AMD use of eDRAM: they have ZRAM and MSPACE RAM patents, yet they never used them. I better expect them to better suite their graphic arch for bandwidth constrained scenario, but I can go wrong

Thanks.

kernelc · Jun 8, 2012

CPUarchitect said:
Non-destructive instructions are very useful for operations with two input operands and one output operand, which is the majority of operations. Before AVX's VEX encoding we only had two operand instructions, meaning it frequently destructed values needed later on. Ivy Bridge's register copying at the renaming stage only helps lower execution port contention. It doesn't eliminate the register copying instructions themselves. VEX eliminates them entirely and thus reduces decoder pressure, frees up uop cache slots for more useful instructions, etc.

It's not really necessary for FMA since it has three input operands which allows creating useful permutations of the instructions to avoid destructing an operand that is needed later. Hence Intel's FMA implementation could be considered non-destructive for most cases you'd care about.

Thank you

CPUarchitect · Jun 8, 2012

Cerb said:
It's not an exclusive argument. AVX2 can compliment Quick Sync, performing work on the input, if nothing else, and would be a great fit to complement a DSP with more modular functionality, as well.

I agree AVX2 would complement (fixed-function) Quick Sync functionality (or the other way around), but what I meant is, why would you expect a (programmable) DSP to offer anything over AVX2?

Keep in mind that while AVX2 is a huge milestone, it's not the end of the road. I'd rather have them extend AVX instead of adding a less generic DSP which introduces heterogeneous bottlenecks and complicates development. Personally I believe that extending the BMI instructions to 256-bit vector operations would be incredibly useful for a wide range of applications.

And AVX-1024 would solve the power consumption issue. By executing one 1024-bit uop over four clock cycles the instruction rate would be dramatically reduced while keeping the throughput the same. This allows clock gating the decoders and reduces switching activity in other parts of the pipeline as well.

Hence I'm really not convinced that any other programmable unit would offer a benefit. Ease of programmability is a critically important feature often neglected by hardware designers. Homogeneous programmable computing can be highly efficient both in theory and practice.

bronxzv · Jun 8, 2012

clumsy comment of mine removed

bronxzv · Jun 8, 2012

CPUarchitect said:
Personally I believe that extending the BMI instructions to 256-bit vector operations would be incredibly useful for a wide range of applications.

indeed, btw weren't you the one who claimed that "all scalar instruction have a vector equivalent" in AVX2, actually there is more scalar instructions without any vector equivalent now

Cerb · Jun 8, 2012

CPUarchitect said:
I agree AVX2 would complement (fixed-function) Quick Sync functionality (or the other way around), but what I meant is, why would you expect a (programmable) DSP to offer anything over AVX2?

For the same reason AVX2 may offer advantages over plain old scalar, and prior vector. A single instruction to produce some transformation on a [packed] chunk of data, v. dozens of instructions that make up a vector loop v. hundreds of instructions to make a scalar loop.

CPUarchitect · Jun 8, 2012

kernelc said:
If bandwidth will become an insurmountable problem, I think AMD can push a more tile-based approach to somewhat work around bandwidth limitation.

Intel use a partial zone-rendering approach in their chipset/GMA, but true tile-rendering is very hard to do without artifact.

So, due to its greater graphic know-how, I just imagine AMD more comfortable to work with this hypotetical solution, nothing more.

No, tile-based approaches really are not a solution. It's only an option when the polygon count is low, like with yesterday's mobile devices. Direct3D 11 requires tesselation support, which creates many tiny triangles. Even if it was feasible, using another rendering approach would also mean that AMD would require additional design teams and driver teams, which is a very big long-term investment.

The only Intel chips which used tile-based rendering are the PowerVR based ones, which are limited to Direct3D 9 and for embedded systems only.

I am skeptical regard AMD use of eDRAM: they have ZRAM and MSPACE RAM patents, yet they never used them. I better expect them to better suite their graphic arch for bandwidth constrained scenario, but I can go wrong

eDRAM and 1T SRAM are different technologies. Using or not using one should have no effect on the other. In fact AMD has used eDRAM before: it's in every single Xbox 360!

1T SRAM technologies are all still in the research phase. It can easily take a decade from patent to product. But that doesn't have to stop AMD from using eDRAM in its APU products in shorter term. I'm just afraid that they've decided to keep parity between their discrete products and focus on increasing bandwidth the brute force way. Too bad for them Intel is ahead in DDR4 technology as well. Although AMD's own brand of RAM could be a sign of them having more in the labs...

Cerb · Jun 8, 2012

CPUarchitect said:
No, tile-based approaches really are not a solution. It's only an option when the polygon count is low, like with yesterday's mobile devices. Direct3D 11 requires tesselation support, which creates many tiny triangles. Even if it was feasible, using another rendering approach would also mean that AMD would require additional design teams and driver teams, which is a very big long-term investment.

Not only that, but ATI/AMD and NV invested into R&D to make up for their failures against the old Kyro chips (remember Serious Sam: TSE? Kyros whooped Geforces and Radeons left and right due to no overdraw in that game, back in the day), specifically to get around using TBDR, while still gaining most of the memory bandwidth savings.

CPUarchitect · Jun 8, 2012

bronxzv said:
indeed, btw weren't you the one who claimed that "all scalar instruction have a vector equivalent" in AVX2, actually there is more scalar instructions without any vector equivalent now

I believe I said every relevant scalar instruction has a vector equivalent in AVX2 (explicitly or implicitly). In any case in that context I was talking about vectorizing loops in common programming languages. BMI instructions are new additions and only exposed through assembly or intrinsics, or some of them correspond to combinations of more elementary operations. In any case, AVX2's addition of gather and vector-vector shift gather undeniably represent a leap in vector ISA completeness. I can't think of any loop with independent iterations, written in a high-level programming language, that can't be vectorized. The only exception would be the lack of a scatter instruction, but in pretty much every case that would make the loop iterations dependent so it doesn't apply anyway.

Anyway, BMI instructions are definitely an interesting class of new scalar instructions worth promoting to vector instructions. But I guess it makes sense for Intel to wait until AVX2 has sufficient uptake, and first introduce them to developers as scalar instructions.

CPUarchitect · Jun 8, 2012

Cerb said:
For the same reason AVX2 may offer advantages over plain old scalar, and prior vector. A single instruction to produce some transformation on a [packed] chunk of data, v. dozens of instructions that make up a vector loop v. hundreds of instructions to make a scalar loop.

Can you give me an example of such a transformation that can't be done efficiently either as an extension of AVX2 (including BMI and AVX-1024 for starters) or as a piece of fixed-function Quick Sync logic?

Also keep the cost/gain ratio in mind... I'm doubtful a DSP makes sense over the alternatives, but I'd love to be proven wrong.

CPUarchitect · Jun 8, 2012

bronxzv said:
in related news I have started a thread at RWT concerning the recent MIC ISA disclosure:

http://www.realworldtech.com/forums/index.cfm?action=detail&id=129578&threadid=129578&roomid=2

we can expect some more stuff to be migrated to AVX in the future like wider vectors, scatter instructions, predication masks (to save power)

Thanks for the link, this is extremely interesting!

Is it just me or has the ISA been converged toward AVX2?

BenchPress · Jun 8, 2012

CPUarchitect said:
Is it just me or has the ISA been converged toward AVX2?

It's borrowing VEX encoding from AVX1/2, and the registers have been renamed from v0/31 to zmm0/31 (similar to ymm0/15 for AVX). So yeah they clearly intend on emphasizing the similarity between both technologies, and might converge them closer in the future.

bronxzv · Jun 8, 2012

CPUarchitect said:
Thanks for the link, this is extremely interesting!

Is it just me or has the ISA been converged toward AVX2?

it looks indeed like there is a few new instructions like VBLENDMPS which may due to a cross fertilization with AVX, also a lot of the multi-instructions "utilities" from the Abrash paper are gone so there isn't much missing instructions in AVX2 vs MIC after all, most notably scatter

as I write at RWT it's pretty difficult to see a difference with the previous LRBni disclosure, it will be a very boring task to compare the new detailed specs with the old not so detailed one, I'm far more interested by the future of AVXn anyway since it's such a mainstream target

bronxzv · Jun 8, 2012

BenchPress said:
It's borrowing VEX encoding from AVX1/2, and the registers have been renamed from v0/31 to zmm0/31 (similar to ymm0/15 for AVX). So yeah they clearly intend on emphasizing the similarity between both technologies, and might converge them closer in the future.

VEX is only for some instructions (mostly the Kxxx mask instructions), 512-bit vector instructions are with the MVEX prefix (1st byte == 0x62)

VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

Member

Member

Elite Member

Senior member

Senior member

Member

Member

Elite Member

Senior member

Senior member

Senior member

Member

Member

Senior member

Senior member

Senior member

Elite Member

Senior member

Elite Member

Senior member

Senior member

Senior member

Senior member

Senior member

Senior member