Haswell to include a L4 cache?

Idontcare · Mar 21, 2012

IntelUser2000 said:
Yes, you may well be on to something. A mere framebuffer for GPU rather than seperate caching level to simplify things. That would be just a start.

So kinda like how we have seperate I and D L1$'s now, for data and instructions?

Marketing might call it an L4$ but functionally it will really be tied to the GPU as a L2$ or something for the GPU itself then?

HexiumVII · Mar 21, 2012

Wow, thats going to be some crazy cache hierarchy. We already have L2, L3, Main Memory, SSD DRAM cache, SRT, SSD, HDD.

Mr. Pedantic · Mar 21, 2012

Isn't that what system memory is supposed to be? So why can't they just save die space and improve RAM?

BFG10K · Mar 21, 2012

AFACT this will be a cache for the GPU and will be outside the CPU. Think something similar to Xenos eDRAM.

Of course you cant really store any assets in there because its too small, so youll still be crippled by system bandwidth in most cases.

IntelUser2000 · Mar 21, 2012

Mr. Pedantic said:
Isn't that what system memory is supposed to be? So why can't they just save die space and improve RAM?

You can't just take only the benefits of system memory and embedded/package DRAM and make something out of it. Idontcare, you have given a good example of balancing in this post: http://forums.anandtech.com/showpost.php?p=33158287&postcount=40

First, I'd like to add to his post that it only applies when we are looking technologies that are at cutting edge. For example, we can only say RAM is faster and greater capacity than SSD when comparing relatively recent examples(5 years or less). It may not be true in the case the RAM is from 1980's and SSD is from 2012.

If you can make RAM that's really really fast, then you can make that into embedded memory and make it even faster. So that advantage is always there. You can't beat physics, so its impossible to make memory that's farther away faster than one that's closer.

Idontcare said:
So kinda like how we have seperate I and D L1$'s now, for data and instructions?

Hmm.

So think of it like this. Rather than being a true hierarchy between L3 cache and RAM and be fast at everything, its specialized for bandwidth. It's still general purpose, but useless for most consumer CPU usage as there's zero latency reduction. And naturally it would be a totally different story for servers and workstations.

Topweasel · Mar 21, 2012

Meaker10 said:
Applications don't manage the cache, nor does the OS, this has no effect on the programming at all.

Point was that as applications get more needlessly bloated the larger the requirement for system memory, the more likely the CPU will have to access system memory when computing a thread from that application. More cache would always help, but in the end, a cpu is still going to access system memory regularly. You would think that increasing the latency to that to add another cache layer, would eventually cause the latency to be so high to system memory that any performance increase you have with on die cache would be mitigated by the loss in performance of the CPU when having to access system memory.

Mr. Pedantic · Mar 21, 2012

IntelUser2000 said:
You can't just take only the benefits of system memory and embedded/package DRAM and make something out of it. Idontcare, you have given a good example of balancing in this post: http://forums.anandtech.com/showpost.php?p=33158287&postcount=40

First, I'd like to add to his post that it only applies when we are looking technologies that are at cutting edge. For example, we can only say RAM is faster and greater capacity than SSD when comparing relatively recent examples(5 years or less). It may not be true in the case the RAM is from 1980's and SSD is from 2012.

If you can make RAM that's really really fast, then you can make that into embedded memory and make it even faster. So that advantage is always there. You can't beat physics, so its impossible to make memory that's farther away faster than one that's closer.

How much faster does it need to be?

zephyrprime · Mar 21, 2012

With the advent of integrated GPU's, it makes a lot of sense to have a cache layer shared between the CPU & GPU. Primarily, this cache layer is to improve GPU performance. The additional latency is a problem but only a minor one as the existing L1-3 caches already provide a very high (95%+ no doubt) hit rate.

GPU's are very starved for bandwidth on ddr3 and even probably ddr4 memory systems on modern motherboards. Something like a big L4 could really help gpu performance.

It's interesting to speculate what form an L4 could take. As others have already said, the xbox 360 already includes an edram cache. I think an L4 cache is likely to be similar to this and not be like the sdram L3 caches we have now. If an L4 cache were included that were big enough hold uncompressed textures for texturing, it could save a lot of bandwidth. The main memory would only be hit to read the compressed texture and then the L4 would be hit to read the actual texture pixel data. Since the compressed size of a texture is a fraction of the uncompressed size, main memory would only have to serve up a fraction of the normal bandwidth needed. GPUs already have texture caches but they are tiny compared to how big an L4 would be.

Ajay · Mar 22, 2012

Idontcare said:
So kinda like how we have seperate I and D L1$'s now, for data and instructions?

Marketing might call it an L4$ but functionally it will really be tied to the GPU as a L2$ or something for the GPU itself then?

That's the only thing that makes sense to me. Maybe in the next tick (14nm) Intel will be able to bring a respectable amount of GDDR RAM onto the die.

Khato · Apr 2, 2012

And at last an article that includes a claim for the "L4$" size at 64MB - http://semiaccurate.com/2012/04/02/haswells-gpu-prowess-is-due-to-crystalwell/ I guess the question of whether or not that's 'enough' for graphics purposes really depends upon exactly how it's used. The more promising aspect is that its inclusion implies that Haswell's GT3 graphics is actually fast enough for memory bandwidth to become a concern.

Edrick · Apr 2, 2012

Interesting article.

IntelUser2000 · Apr 2, 2012

Mr. Pedantic said:
How much faster does it need to be?

There are inherent benefits to on package/on die setups. The reason system memory channels are practically limited to 2 channels on desktop and 4 on servers is because of the routing required. Every bit of data line becomes a wire. Intel tried to change that using FB-DIMMs, but that's long gone now.

By bringing it on package, the wires become shorter(allowing higher frequencies for lower bandwidth and/or latency) and its easier to implement them. I think the need for speed is unquenchable, but the problem always have been doing it in a practical manner.

Go look at a motherboard and see the position of the memory controller(whether its on CPU or chipset) and the position of the DIMM slots. You'll notice there are bunch of wires that connect them. Imagine taking a pencil and drawing a line for the shortest, most optimal direction for all those wires. You need more than 100 of them. The real trick is keeping them at similar length.

Mr. Pedantic · Apr 3, 2012

IntelUser2000 said:
There are inherent benefits to on package/on die setups. The reason system memory channels are practically limited to 2 channels on desktop and 4 on servers is because of the routing required. Every bit of data line becomes a wire. Intel tried to change that using FB-DIMMs, but that's long gone now.

By bringing it on package, the wires become shorter(allowing higher frequencies for lower bandwidth and/or latency) and its easier to implement them. I think the need for speed is unquenchable, but the problem always have been doing it in a practical manner.

Go look at a motherboard and see the position of the memory controller(whether its on CPU or chipset) and the position of the DIMM slots. You'll notice there are bunch of wires that connect them. Imagine taking a pencil and drawing a line for the shortest, most optimal direction for all those wires. You need more than 100 of them. The real trick is keeping them at similar length.

Bringing memory on-die also creates a problem of die size; 1 long copper wire embedded in the PCB may be expensive, but so is one short copper wire built right into the die. And then there's the actual size of the cache.

Also, you didn't really answer my question. How much faster does RAM-to-CPU transfer have to be, especially for consumer chips?

IntelUser2000 · Apr 3, 2012

Mr. Pedantic said:
Bringing memory on-die also creates a problem of die size; 1 long copper wire embedded in the PCB may be expensive, but so is one short copper wire built right into the die. And then there's the actual size of the cache.

The problem with having too many wires for more memory channels is because it necessiates having more motherboard layers, and that increases the production cost. Also the complexity of having to trace longer wires makes it more costly than on-package.

Anyway the on-package memory is merely a temporary solution to a ultimate one. And that is stacked die, with CPU on one die and memory on another. Then not only you can have thousands of wires for unimaginable speeds, the integration further lowers cost. That also avoids having one large die, with big portion taken up by the memory.

Also, you didn't really answer my question. How much faster does RAM-to-CPU transfer have to be, especially for consumer chips?

You may see insignificant gains on the CPU side with a on-package solution. Once they move to a stacked RAM with large capacities like 512MB-1GB, we may even see substantial latency improvement akin to moving to a integrated memory controller.

I think bandwidth can't really be quenched for HPC(workstation) and graphics/media applications, and if we see a future where CPU and GPU is fully merged, there would be even more importance of integrated DRAM.

Tuna-Fish · Apr 3, 2012

Mr. Pedantic said:
Also, you didn't really answer my question. How much faster does RAM-to-CPU transfer have to be, especially for consumer chips?

Talking only of bandwidth, for the CPU not at all really. For the GPU, at least an order of magnitude faster, preferably two.

Mr. Pedantic · Apr 3, 2012

IntelUser2000 said:
The problem with having too many wires for more memory channels is because it necessiates having more motherboard layers, and that increases the production cost. Also the complexity of having to trace longer wires makes it more costly than on-package.

Anyway the on-package memory is merely a temporary solution to a ultimate one. And that is stacked die, with CPU on one die and memory on another. Then not only you can have thousands of wires for unimaginable speeds, the integration further lowers cost. That also avoids having one large die, with big portion taken up by the memory.

And stacking dies is alright for heat and stuff? Because it just seems strange that it's been done for memory but nothing else. Surely someone would have figured out a way to do it.

Tuna-Fish said:
Talking only of bandwidth, for the CPU not at all really. For the GPU, at least an order of magnitude faster, preferably two.

Oh, right. So why is it possible for dedicated GPUs to get that much bandwidth, but CPUs can't?

Edrick · Apr 3, 2012

Mr. Pedantic said:
Oh, right. So why is it possible for dedicated GPUs to get that much bandwidth, but CPUs can't?

I am sure CPUs could, if Intel wanted them to. But they would be much more money and offer very little benefit to 99% of the users. SB-E memory bandwidth with quad channel RAM is pushing 50GB/s bandwidth, which is more than enough for almost anything thrown at it. Where GPUs with 150GB/s (+) need that plus more. It is the nature of the beast.

Tuna-Fish · Apr 3, 2012

Mr. Pedantic said:
And stacking dies is alright for heat and stuff? Because it just seems strange that it's been done for memory but nothing else. Surely someone would have figured out a way to do it.

Heat and power distribution are precisely the reasons it isn't presently being done outside of very low power environments. Nearly everyone who does semiconductor integration is presently heavily investing into 3d stacking methods, and there are promising solutions to the existing problems looming in the horizon.

DRAM memory is a good initial candidate because it is typically much lower power than the same silicon area dedicated to logic.

Oh, right. So why is it possible for dedicated GPUs to get that much bandwidth, but CPUs can't?

It's about the workloads. The typical cpu is interested in low-latency accesses to a small subset of the memory, and is thus well served with a good cache hierarchy. The SNB cache system has a total hitrate well in excess of 95%, which means you get some 20 times more realized bandwidth than what your memory provides.

The typical GPU workload consists of rapidly streaming through large data sets. This is essentially uncacheable, as accessing an item of memory makes it the least likely one to be accessed again in the near future. So what you want is just raw bandwidth.

Vesku · Apr 3, 2012

My guess is that not only is this an eDram like graphics solution, which we were talking about when Llano first arrived. But that this will be Intel's way of sharing data sets between CPU and GPU for purposes of GPGPU.

Haswell should be the first generation of Intel OpenCL GPU that can actually do some useful work without relying on fixed function blocks (Quicksync). Given that Llano GPU was just a bit short in terms of compute.

Edrick · Apr 3, 2012

Vesku said:
Haswell should be the first generation of Intel OpenCL GPU that can actually do some useful work without relying on fixed function blocks (Quicksync). Given that Llano GPU was just a bit short in terms of compute.

I think IB is capable of OpenCL.

Tuna-Fish · Apr 3, 2012

Edrick said:
I think IB is capable of OpenCL.

Yes, I've head similar things from people working on OSS computing things.

Vesku · Apr 3, 2012

Yes, but will it be useful in IB other than the same way a $30 discrete card is "DirectX 11" capable? Llano was not quite powerful enough to encourage a lot of development. My guess is Trinity and Haswell will offer enough potential to see a real uptick in OpenCL development.

Fjodor2001 · Apr 3, 2012

But will the L4 cache and GT3 IGP only be available in the mobile Haswell CPUs?

This article seems to suggest that:

http://www.legitreviews.com/news/12656/

But not this one:

http://vr-zone.com/articles/mystery...up-the-graphics-ante-further-again/15272.html

IntelUser2000 · Apr 3, 2012

Mr. Pedantic said:
Oh, right. So why is it possible for dedicated GPUs to get that much bandwidth, but CPUs can't?

Simply because everything is a trade off in one way or another. GPUs have 1-2GB memory in the very high end parts, with CPUs you can get 32GB if you wanted to. Another example is that the EX Xeon chips feature slower memory so it can support massive capacities.

And stacking dies is alright for heat and stuff? Because it just seems strange that it's been done for memory but nothing else. Surely someone would have figured out a way to do it.

Maybe that's why the technology is still few years away. They are trying to solve those problems.

Fjdor2001 said:
But will the L4 cache and GT3 IGP only be available in the mobile Haswell CPUs?

I'm pretty sure the VR-Zone piece is nothing more than speculation. I wouldn't be surprised to see a 4 core SKU with GT3 graphics, its just that it'll be on mobile. It's less worth it making a powerful iGPU on desktops because a) its not thermally bound b) due to thermals desktops GPUs are far more powerful c) everybody is increasingly buying more mobile systems.

Ajay · Apr 3, 2012

Interesting. It would appear that Intel considered the threat from AMD's evolving line of 'Fusion" CPU/GPUs to be significant enough to forward this more complex design.

Haswell to include a L4 cache?

Elite Member

Senior member

Diamond Member

Lifer

Elite Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Golden Member

Golden Member

Elite Member

Diamond Member

Elite Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Golden Member

Golden Member

Diamond Member

Diamond Member

Elite Member

Lifer