VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

Cerb · Jun 8, 2012

CPUarchitect said:
Also keep the cost/gain ratio in mind... I'm doubtful a DSP makes sense over the alternatives, but I'd love to be proven wrong.

I don't think it does, either. I just hate that truly fixed function is the other option.

kernelc · Jun 8, 2012

CPUarchitect said:
No, tile-based approaches really are not a solution. It's only an option when the polygon count is low, like with yesterday's mobile devices. Direct3D 11 requires tesselation support, which creates many tiny triangles. Even if it was feasible, using another rendering approach would also mean that AMD would require additional design teams and driver teams, which is a very big long-term investment.

Triangle list generation can be accelerated by special hardware. I think this was the case with some Naomi arcade, but I can go wrong.

Anyway, with no new DRAM standard in the near term (DDR4 has a long way) and no eDRAM, I think that, if AMD want to significantly increase its graphic performance, a less bandwidth-demanding rendering approach should be considered. After all, todays accelerator already are not simple immediate-mode renderers...

The only Intel chips which used tile-based rendering are the PowerVR based ones, which are limited to Direct3D 9 and for embedded systems only.

I was speaking about zone-rendering: http://download.intel.com/design/chipsets/applnots/30262503.pdf
If I remember correctly, while this is no true TBR, it use a tiled framebuffer to increase cache/bandwidth efficiency.

eDRAM and 1T SRAM are different technologies. Using or not using one should have no effect on the other. In fact AMD has used eDRAM before: it's in every single Xbox 360!

1T SRAM technologies are all still in the research phase. It can easily take a decade from patent to product. But that doesn't have to stop AMD from using eDRAM in its APU products in shorter term. I'm just afraid that they've decided to keep parity between their discrete products and focus on increasing bandwidth the brute force way. Too bad for them Intel is ahead in DDR4 technology as well. Although AMD's own brand of RAM could be a sign of them having more in the labs...

Based on wikipedia, I thought that 1T SRAMs were commercial products...
What I want to say, however, is that AMD seems not too prone on innovate with new cache/SRAM technologies. So, albeit I hope to go wrong, I doubt that we will see eDRAM on AMD APUs anytime soon. Xbox 360's graphics don't count here: it was designed when ATI was ATI, not AMD

Thanks.

CPUarchitect · Jun 8, 2012

kernelc said:
Triangle list generation can be accelerated by special hardware.

Generating it is not the issue. When you have lots of small polygons, it simply starts consuming more memory to store all the data for each tile, than the actual color and depth buffer!

I was speaking about zone-rendering: http://download.intel.com/design/chipsets/applnots/30262503.pdf
If I remember correctly, while this is no true TBR, it use a tiled framebuffer to increase cache/bandwidth efficiency.

Yes, which is used by exactly the PowerVR architecture licensed by Intel that I linked to before.

Anyway, with no new DRAM standard in the near term (DDR4 has a long way) and no eDRAM, I think that, if AMD want to significantly increase its graphic performance, a less bandwidth-demanding rendering approach should be considered.

Such as?

eDRAM is a perfectly good solution, and ATI/AMD has used it before. I don't see why you rule it out.

Xbox 360's graphics don't count here: it was designed when ATI was ATI, not AMD

Why would that acquisition change anything?

After all, todays accelerator already are not simple immediate-mode renderers...

Sure they are.

kernelc · Jun 9, 2012

CPUarchitect said:
Generating it is not the issue. When you have lots of small polygons, it simply starts consuming more memory to store all the data for each tile, than the actual color and depth buffer!

I see your point here: I was not thinking to the needed storage buffer for very complex tiles. Maybe true TBR accelerator will remain a PowerVR affair... :whiste:

Yes, which is used by exactly the PowerVR architecture licensed by Intel that I linked to before.

The linked PDF refers to i915g chipset with Intel GMA900, not on GMA500 (licensed by PowerVR) and used on Poulsbo and similar chipset.

Such as?

I really don't know, mine was a simple consideration based on the fact that if they really want to increase APUs performance without paying the die estate that Intel is going to pay with eDRAM, they need a more radical approach with reduced bandwidth needs.

The other, more probable, is that AMD is going to increase APUs performance by simply increase their compute to TMUs/ROPs rate, but this can not be done without limits (and we already see a ceiling here).

Any news on CGN-based APUs?

eDRAM is a perfectly good solution, and ATI/AMD has used it before. I don't see why you rule it out.

Why would that acquisition change anything?

I am a big fan of eDRAM since the Bitboys affair , so I really like solutions based on it. The only thing is that it seems to me that AMD, on CPU space, is refractary to such innovations. So, I think that from AMD side we will not see anything eDRAM based for at least two more APUs generation. I hope to go wrong though!

Sure they are.

No, they are immediate renderes w/early-z rejection, and this made a lot of difference. For example, ATI R300+ use a per-polygon tiled z-buffer approach that, while quite different that TBR, recall some of the HSR techniques used by PowerVR.

Thank you for sharing you data with me

bronxzv · Jun 9, 2012

kernelc said:
die estate that Intel is going to pay with eDRAM

it's highly unlikely that Intel use eDRAM in the foreseable future (according to previous discussions I have seen at RWT and IDF presentations), the most likely scenario is conventional DRAM stacked on the CPU

my understanding is that eDRAM is more expensive (more complex process to have both logic and DRAM on the same die, lower yields for a single big die) and less dense than conventional DRAM thus stacking as the favored solution going forward

see for example slide 17 of the IDF Spring Exascale presentation BJ12_ACAS003_100_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ

bronxzv · Jun 9, 2012

CPUarchitect said:
1T SRAM technologies are all still in the research phase. It can easily take a decade from patent to product

EDIT: it looks like you are talking about 1T DRAM here (future technology), 1T SRAM is based on standard DRAM (old technology)

CPUarchitect · Jun 9, 2012

kernelc said:
The linked PDF refers to i915g chipset with Intel GMA900, not on GMA500 (licensed by PowerVR) and used on Poulsbo and similar chipset.

Yes, but they share the same technology. And either way the point was that they're all limited to Direct3D 9 and they're not suitable for high polygon counts. Hence any form of tile based rendering is not a solution for AMD's APU scaling problem.

I really don't know, mine was a simple consideration based on the fact that if they really want to increase APUs performance without paying the die estate that Intel is going to pay with eDRAM, they need a more radical approach with reduced bandwidth needs.

I'm afraid such an approach simply doesn't exist. There's an inherent limit to temporal and spacial data access locality which dictates how much bandwidth is required. AMD is quite good at tweaking all the parameters that are involved so that a good balance between performance, cost and power consumption is achieved. There's always some room for improvement, but it's bound by the law of diminishing returns as you approach the theoretical limit. It's also not worth it to create a smaller chip if the R&D cost makes it more expensive per part, and the delay in release makes them lose sales.

Intel has a huge advantage in volume, which would justify a higher R&D cost, and yet they appear to opt for eDRAM, and I suspect the primary reason is time to market.

Also, eDRAM really doesn't add a lot of cost. Again look no further than the Xbox 360. Lastly, it affects the access latencies which means that other on-die storage can become smaller and the efficiency goes up when dealing with small tasks with dependencies between them. So it gains them several things which offset the cost.

No, they are immediate renderes w/early-z rejection, and this made a lot of difference. For example, ATI R300+ use a per-polygon tiled z-buffer approach that, while quite different that TBR, recall some of the HSR techniques used by PowerVR.

That still makes them immediate renderers. The z-pyramid has very little to do with PowerVR's technology. Geometry drawn back to front still causes overdraw. It merely speeds up the front to back approach and avoids some RAM accesses.

It's also ironic that modern graphics engines often first do a pre-z pass to eliminate shading pixels more than once. So early-z and pre-z have pretty much eliminated the need for hardware-based deferred rendering. What's happening now is that the available bandwidth is no longer wasted on overdraw but is consumed by more complex shading, higher precision, higher resolutions, etc. In fact in Unreal Engine 4 "the majority of the GPUs FLOPS are going into general compute algorithms, rather than the traditional graphics pipeline"!

Graphics is quickly evolving toward software-oriented approaches. So AMD is losing its grip on the rendering process. Hence implementing the most advanced hardware TBDR on the planet wouldn't help much. Looking at GCN, they clearly understand this. And it also means there's no substitute for bandwidth other than big caches.

CPUarchitect · Jun 9, 2012

bronxzv said:
EDIT: it looks like you are talking about 1T DRAM here (future technology), 1T SRAM is based on standard DRAM (old technology)

Sorry for the confusion, I wasn't very specific about the technology I meant to refer to. DRAM already always has only one transistor per cell, so calling something 1T DRAM seems redundant to me. MoSys' 1T-SRAM (note the hyphen) is really based on traditional DRAM, but it hides the DRAM aspects by building in a controller which takes care of refreshing the cells so that it behaves more like SRAM to external logic. It would be more correct to classify it as (embedded) pseudostatic RAM (ePSRAM). This is "old" technology.

The 1T SRAM I was referring to should probably be called "capacitorless" DRAM, which relies on the floating body effect. But since it always needs a built-in controller, it always behaves like SRAM, and is six times denser than 6T SRAM. This is "new" technology.

I hope this clarifies what I meant. I expect large caches using existing high density technology to be used in the relatively short term, while longer term it will probably evolve into even denser capacitorless types.

bronxzv · Jun 9, 2012

CPUarchitect said:
The 1T SRAM I was referring

sorry for the nitpick but it's commonly known as 1T *D*RAM

bronxzv · Jun 9, 2012

CPUarchitect said:
Intel has a huge advantage in volume, which would justify a higher R&D cost, and yet they appear to opt for eDRAM

where did you get this idea ? everything point toward stacked DRAM instead I'll say

kernelc · Jun 10, 2012

CPUarchitect said:
Yes, but they share the same technology. And either way the point was that they're all limited to Direct3D 9 and they're not suitable for high polygon counts. Hence any form of tile based rendering is not a solution for AMD's APU scaling problem.

Yes, you are right: after further research, I found an Intel article claiming that GMA X3000 and up don't use zone rendering. Interesting find

I'm afraid such an approach simply doesn't exist. There's an inherent limit to temporal and spacial data access locality which dictates how much bandwidth is required. AMD is quite good at tweaking all the parameters that are involved so that a good balance between performance, cost and power consumption is achieved. There's always some room for improvement, but it's bound by the law of diminishing returns as you approach the theoretical limit. It's also not worth it to create a smaller chip if the R&D cost makes it more expensive per part, and the delay in release makes them lose sales.

Something can be done (see later)...

Intel has a huge advantage in volume, which would justify a higher R&D cost, and yet they appear to opt for eDRAM, and I suspect the primary reason is time to market.

Also, eDRAM really doesn't add a lot of cost. Again look no further than the Xbox 360. Lastly, it affects the access latencies which means that other on-die storage can become smaller and the efficiency goes up when dealing with small tasks with dependencies between them. So it gains them several things which offset the cost.

This make a lot of sense, I must say...

That still makes them immediate renderers. The z-pyramid has very little to do with PowerVR's technology. Geometry drawn back to front still causes overdraw. It merely speeds up the front to back approach and avoids some RAM accesses.

I often found these kind of accelerator called "hybrid" or "advanced" immediate renderers. I think that the difference between a basic immediate mode chip (eg: NV10) and a early-z enabled chip is too great to consider them to the same class. Anyway, terminology apart, we are speaking about the same things...

It's also ironic that modern graphics engines often first do a pre-z pass to eliminate shading pixels more than once. So early-z and pre-z have pretty much eliminated the need for hardware-based deferred rendering. What's happening now is that the available bandwidth is no longer wasted on overdraw but is consumed by more complex shading, higher precision, higher resolutions, etc. In fact in Unreal Engine 4 "the majority of the GPU’s FLOPS are going into general compute algorithms, rather than the traditional graphics pipeline"!

Graphics is quickly evolving toward software-oriented approaches. So AMD is losing its grip on the rendering process. Hence implementing the most advanced hardware TBDR on the planet wouldn't help much. Looking at GCN, they clearly understand this. And it also means there's no substitute for bandwidth other than big caches.

While this is all correct, TBR-based still have huge advantage from a bandwidth perspective, as they need minimal amount of z-traffic.

ATI's HyperZ was inspired by the same need: to limit wasted bandwidth on z-buffer. Due to ATI/AMD huge know-how, I would not be surprised if they develop new, more aggressive z and color compression schemes, somehow reminescent of a TBR-approach (eg: use higher resoluzion on-chip z caches and/or operating on larger tiles).

Maybe something is already at work in current AMD chipset, but with all the relevance attributed to the shader core, I can not found many information on these more "pedestrian" tasks. Did you have any interesting news in this regard?

Thanks.

Homeles · Jun 10, 2012

bronxzv said:
sorry for the nitpick but it's commonly known as 1T *D*RAM

This doesn't make sense. There's nothing special about 1T DRAM. Like he said, calling it 1T DRAM is redundant.

DRAM is the relatively slow stuff that's used in system memory. SRAM is the very fast memory used in cache, but because of the amount of transistors it uses (6x that of DRAM), it is very expensive.

D stands for dynamic, S stands for static. They are complete opposites of each other. There's absolutely no way to confuse the two, yet you've managed to do it.

bronxzv · Jun 10, 2012

Homeles said:
There's nothing special about 1T DRAM

it looks like you confuse with 1T1C DRAM

http://ekv.epfl.ch/files/content/sites/ekv/files/mos-ak/wroclaw/MOS-AK_JMS.pdf

also try to google on "1T DRAM"

and hey, please, don't kill the messenger! I'm not the one who chose this nomenclature

Homeles said:
D stands for dynamic, S stands for static. They are complete opposites of each other. There's absolutely no way to confuse the two

nope, there is clearly some confusion going on since we are discussing it here, for example "1T SRAM" is actually DRAM

Homeles said:
yet you've managed to do it.

wow what a polite guy! man, you are great at teaching things to people: in 5 seconds I have learned that you have absolutely no clue

CPUarchitect · Jun 11, 2012

bronxzv said:
it looks like you confuse with 1T1C DRAM

Ironically even SRAM relies on capacitance to hold its state. Each cell simply "recharges" continuously. And it only takes a very small charge to flip the bit value. In fact it's the main cause of soft errors in microprocessors. Mission critical SRAM designs therefore use a capacitor not unlike those found in DRAM! And at the other end of the spectrum even "capacitorless" DRAM still relies on the small capacitance of the floating body effect.

So it's tricky to categorize them based on capacitance. Furthermore, memory based on the floating body effect can be read without destroying the bit value, unlike classic DRAM, but does need periodic refreshing. And the charge is used to influence current, not voltage.

Suffice to say that we shouldn't jump at each other over some nomenclature. It's all very different from the legacy meanings of DRAM and SRAM. We'll probably have to wait several more years before some dominant technology which combines the qualities of both sets a new standard, and a new name.

Anyway to get back on topic: I do believe that despite that DDR4 would offer higher bandwidth, there's a need for a large L4 cache on the chips aimed at competing with mid-end discrete graphics. Whether that's done using a MCM package or stacked dies or whatever is beside the point. In any case it seems that Intel is ahead both on DDR4 and L4 cache technology.

And I really don't think AMD has some secret graphics technology which will allow their APUs to scale without more bandwidth or cache memory.

kernelc said:
Due to ATI/AMD huge know-how, I would not be surprised if they develop new, more aggressive z and color compression schemes, somehow reminescent of a TBR-approach (eg: use higher resoluzion on-chip z caches and/or operating on larger tiles).

But that would require lots of on-chip storage. And I'm afraid you overestimate how "huge" their graphics know-how lead is. GPUs nowadays mostly consist of programmable logic and relatively generic memory subsystems. Intel has plenty of experience in those fields, and they've hired many of the most brilliant minds in computer graphics over the last few years to make up for whatever they lacked in graphics knowledge.

Also keep in mind that while AMD's APUs have a notable lead in graphics performance, they've made huge sacrifices in CPU performance to make that happen. Intel's process lead allows them to easily catch up in raw iGPU performance with Haswell, but AMD has a very long way to go to catch up in CPU performance!

So basically I fear that AMD won't have any advantages left in the near future. I hope I'm wrong and they actually have some alternative to Z-RAM ready for production in the Haswell timeframe...

bronxzv · Jun 11, 2012

CPUarchitect said:
I do believe that despite that DDR4 would offer higher bandwidth, there's a need for a large L4 cache on the chips aimed at competing with mid-end discrete graphics.

indeed, and it's a given for Haswell AFAIK, at least for some SKUs, note that it will be also real great for CPU-based pure software 3D renderers

CPUarchitect said:
In any case it seems that Intel is ahead both on DDR4 and L4 cache technology.

pretty much everybody is going to stacked DRAM (see 3D-ICs at JEDEC for example) and DDR4 is industry standard so I'm not sure which lead Intel can have here, my understanding is that they will simply use commodity RAM also for the L4$

kernelc · Jun 12, 2012

CPUarchitect said:
But that would require lots of on-chip storage. And I'm afraid you overestimate how "huge" their graphics know-how lead is. GPUs nowadays mostly consist of programmable logic and relatively generic memory subsystems. Intel has plenty of experience in those fields, and they've hired many of the most brilliant minds in computer graphics over the last few years to make up for whatever they lacked in graphics knowledge.

I understand that Intel can be a fierce competitor, but it seems to me that IB is the first acceptable IGP from them. SB, albeit decent from a performance standpoint, has terrible IQ.

I think that for another 1 or 2 IGP generations, AMD will be faster. However, they risk to see their speed margin to quickly vanish in a couple of years.

Also keep in mind that while AMD's APUs have a notable lead in graphics performance, they've made huge sacrifices in CPU performance to make that happen. Intel's process lead allows them to easily catch up in raw iGPU performance with Haswell, but AMD has a very long way to go to catch up in CPU performance!

Actually, I think the situation is worse: regarding CPU speed, AMD seems incapable to compete with Intel, with or without IGPs trow into the equation. Maybe Piledriver can speed up Bulldozer a little, but pure CPU speed will remain lower then Intel.

Fortunately, with these kind of CPU monsters, absolute maximum speed does not anymore matter so much in a lot of workloads, but this is another story...

So basically I fear that AMD won't have any advantages left in the near future. I hope I'm wrong and they actually have some alternative to Z-RAM ready for production in the Haswell timeframe...

From a graphic hardware standpoint, I think they still have some advantages, but not so large. However, graphics is not only hardware, but software also: at the moment, AMD driver advantage is huge. Tomorrow it will surely shrink, but Intel will had a hard time trying to surpass them.

CPUarchitect · Jun 12, 2012

bronxzv said:
indeed, and it's a given for Haswell AFAIK, at least for some SKUs, note that it will be also real great for CPU-based pure software 3D renderers

You might be interested in this video, in particular at the 19:30 mark: Andrew Richards talks about OpenCLs future. With Haswell a versatile high performance software renderer might be closer than he realizes, although I'm not convinced OpenCL is the right API/language for this.

By the way, do you know for certain that Haswell's L4 cache will also be shared with the CPU cores? If they want to store specific graphics data in it they may not want to pollute it with CPU data.

Note that AMD might try to address its APU bandwidth issue by adding separate GDDR5 memory channels. How this will help them unify the address space is questionable though.

pretty much everybody is going to stacked DRAM (see 3D-ICs at JEDEC for example) and DDR4 is industry standard so I'm not sure which lead Intel can have here, my understanding is that they will simply use commodity RAM also for the L4$

Several rumors are suggesting that it will be an MCM package. Keep in mind that the "pretty much everybody" rule doesn't apply to Intel. In particular, stacked DRAM makes sense for mobile devices where heat isn't an issue, but as proven by Ivy Bridge, Haswell will require a good thermal interface to the heat sink and wedging DRAM in between it would not do it any good.

On the other hand a large amount of copper through-silicon interconnects may actually improve heat transfer...

bronxzv · Jun 12, 2012

CPUarchitect said:
You might be interested in this video, in particular at the 19:30 mark: Andrew Richards talks about OpenCL’s future.

yeah I watched it already, "more languages will come" (read: it's not ready for wide adoption), "there is no killer application yet (read: it's a solution seeking for a problem at the moment)", all in all I found him very bad at selling Open CL, he's probably not convinced himself at all

CPUarchitect said:
With Haswell a versatile high performance software renderer might be closer than he realizes, although I'm not convinced OpenCL is the right API/language for this.

exactly, native AVX2 will be clearly faster, doing graphics on GPGPU will be what ? GGPGPU ? what's next ? GPGGPGPU ? the idea looks very bad to me

CPUarchitect said:
By the way, do you know for certain that Haswell's L4 cache will also be shared with the CPU cores?

not sure for the desktop/laptops but much needed for server since IBM has MCM with L4 caches already

see for example slide 17 of the IDF Spring Exascale presentation BJ12_ACAS003_100_ENGf.pdf downloadable from intel.com/go/idfsessionsBJ

CPUarchitect said:
If they want to store specific graphics data in it they may not want to pollute it with CPU data.

it looks like something that can be solved by simple arbitration logic and why not user comfigurable with some flags (like the the flags for adjacent line prefetch, hyperthreading, etc.)

VR-Zone article on Intel Haswell server CPUs - DDR4 and higher TDPs

Cerb

Elite Member

kernelc

Member

CPUarchitect

Senior member

kernelc

Member

bronxzv

Senior member

bronxzv

Senior member

CPUarchitect

Senior member

CPUarchitect

Senior member

bronxzv

Senior member

bronxzv

Senior member

kernelc

Member

Homeles

Platinum Member

bronxzv

Senior member

CPUarchitect

Senior member

bronxzv

Senior member

kernelc

Member

CPUarchitect

Senior member

bronxzv

Senior member

TRENDING THREADS