Intel "Haswell" Speculation thread

mrob27 · Sep 21, 2012

AtenRa said:
Have a look at the memory controllers bellow the L3 Cache, the empty space is smaller than IvyBridge GT2 (HD4000) from the pic above, that only makes the die much smaller and thus not GT3 but probably GT1.

About that empty space in Ivy Bridge: Refer to the Ivy Bridge die sizes article again.

The "biggest" Ivy Bridge is the HE-4, 8.141 x 19.361 mm and has 8MB of cache. The Ivy Bridge HM-4 still has 4 cores, but only 6MB of cache, and is only 7.656 mm wide, about 1/2 millimeter smaller. Clearly they slice a bit out of each piece of L3 cache, which makes the chip narrower. It also has GT1 graphics, so as this post suggested, the GPU gets a lot smaller. For the dual-core H-2 and M-2 variants, the empty space is used to accommodate the dual-channel memory controller.

I don't completely agree with the photos in that post, clearly IntelUser2000 cut the L3 cache section in half, when actually it needs to be cut by 25%. If you look at the Sandy Bridge image from AtenRa's post:

You can see pretty easily the "dead space" underneath the ring bus stop in each of the four L3 cache slices, and that dead space is exactly 1/4 the height of the cache section. For Ivy it's similar, most easily seen in this image from Intel Sweden:

But the rest of IntelUser2000's analysis is pretty accurate and he predicted the die sizes pretty well.

Anyway...

...for Haswell it's a different game. They'll definitely cut the die in different ways and use different layouts, but suppose they might have a variant with less cache and the same GPU? Then they could use that "dead space" to accommodate the GPU.

I believe that in the maybe-Haswell-wafer image we're discussing, the GPU occupies the full width of the die, and that's why the visible dead space is so small.

mrob27 · Sep 21, 2012

There is one other Haswell die photo that appeared in news posts at about the same time. Here it is:

I found it in this article on newelectronics.co.uk

My analysis of this image is the same as the one in the BBC article: 4 cores with a new layout, GPU with 20 shader cores and a different fixed-function layout.

The nice thing about this image is that it makes it a bit easier to see the GPU layout is different from Ivy Bridge, and it's easier to see the dead space below the ring-bus stops in the L3 cache area.

We can also see more clearly that the space allocation within the L3 cache slices is different: only about 55% of each L3 section is taken up by the actual 2MB memory array, whereas in Ivy Bridge the memory takes up 65 or 70%. It's hard to see what that extra stuff is, but I suppose others might speculate... :whiste:

Idontcare · Sep 21, 2012

Not sure why but Wired seems to think this image of Haswell that we have been talking about is actually just an IB die?

www.wired.com/wiredenterprise/wp-content/uploads//2012/09/intel-ivy-brige-closeup.jpg

mrob27 · Sep 21, 2012

I forgot how many of these I scraped up last week. Here's another one:

which is in several blogs and re-posts apparently originating with this post by Intel on their Google+ page and dated 2012 Sep 13th.

This has the same new core layout, the GPU with 20 execution units, L3 cache with the memory banks taking up only about half of the space, etc.

Another new feature seen in this photo and in both of the others, and not seen in IVB and SNB dies, are the prominent "lines" between the cores. In the first "pin" image they're black; in the in-focus part of this image they're orange/yellow/green.

The odd-numbered lines (those between cores 0-1 and between cores 2-3) are a bit longer, extending almost all the way to the DDR3 controller. No such lines are seen in any of the Ivy or Sandy die photos I've looked at.

mrob27 · Sep 21, 2012

Idontcare said:
Not sure why but Wired seems to think this image of Haswell that we have been talking about is actually just an IB die?

(image here)

Wow... that image is available in a 12-megapixel version! It is at this Flicker post by "IntelFreePress" dated 2012 Sep 10th. On the Flickr page, use the "View all sizes" link to access the huge version.

image removed

Let's take pity on the bandwidth challenged, shall we?
admin allisolm

Sorry! - mrob27

This is obviously not a Sandy bridge or Ivy Bridge. There is a lot to look at here... I'll try to upload a labeled collage soon...

mrob27 · Sep 21, 2012

Okay, here is a preliminary labeling of the Haswell die shown in this Flicker post by "IntelFreePress" dated 2012 Sep 10th. Click the yellow bar to embiggen.

Perhaps you'd like to identify more of the parts.

TuxDave · Sep 21, 2012

Am I the only person zooming in on the core to figure out which chip that is? The last set of pictures make it pretty clear what I'm looking at.

Idontcare · Sep 21, 2012

TuxDave said:
Am I the only person zooming in on the core to figure out which chip that is? The last set of pictures make it pretty clear what I'm looking at.

Its steamroller, isn't it

IntelUser2000 · Sep 21, 2012

mrob27 said:
I don't completely agree with the photos in that post, clearly IntelUser2000 cut the L3 cache section in half, when actually it needs to be cut by 25%. If you look at the Sandy Bridge image from AtenRa's post:

You are right. Its cut more than I wanted to. But I figured it delivers the general idea how the cores are derived. I also agree its a GT2 part for the Haswell GPU. The GT3 replicates almost EVERYTHING, so it can't be that small.

mrob27 · Sep 22, 2012

IntelUser2000 said:
[...] I figured it delivers the general idea how the cores are derived. I also agree its a GT2 part for the Haswell GPU. The GT3 replicates almost EVERYTHING, so it can't be that small.

Yep. Your analysis was great, by the way, and gave us a lot to think about months before we had the hard numbers on Ivy Bridge. Also, you gave great Haswell predictions which still look pretty good (we have nothing solid yet on the 4c+GT3 variant, but the square die seems likely)

AtenRa · Sep 22, 2012

5 rows, 20 EUs, that seams to be GT2

Also, it seams that this layout was not designed for scaling down by removing cores. The iGPU is stretched from top to bottom without leaving empty space bellow it like in SB and IB. This is one reason why the iGPU size is narrower than the iGPU in IvyBridge although it has more EUs.

Yuriman · Sep 22, 2012

Does that mean GT3 chips will be cut from a different wafer?

ShintaiDK · Sep 22, 2012

Yuriman said:
Does that mean GT3 chips will be cut from a different wafer?

Si.

IntelUser2000 · Sep 22, 2012

AtenRa said:
5 rows, 20 EUs, that seams to be GT2

Also, it seams that this layout was not designed for scaling down by removing cores. The iGPU is stretched from top to bottom without leaving empty space bellow it like in SB and IB. This is one reason why the iGPU size is narrower than the iGPU in IvyBridge although it has more EUs.

Interesting. So it looks like at least GT1 variants will be GT2 with disabled parts.

mrob27 · Sep 22, 2012

IntelUser2000 said:
Interesting. So it looks like at least GT1 variants will be GT2 with disabled parts.

If true, that would agree with the CPU World article listing Haswell die and package variants which was met with some skepticism in this thread last month over the large die needed for 4 cores + GT3 and the inclusion of a single-channel memory controller variant. (I however believe it's pretty accurate).

BenchPress · Sep 22, 2012

Ajay said:
If someone here is actually Nicolas Capens, please note I mean no disrespect.

That would be me. But why would I feel disrespected? It's all just theories and I would rather be wrong so that Haswell has no noteworthy compromises!

However, my response to David and the many other discussions in the last couple days might clarify that's it's not exactly obvious what the implications are of Haswell's wide architecture. There could still be some surprises. Most people at RWT didn't anticipate gather support and dual FMA to be feasible or likely... I sure don't mean to disrespect anyone for bringing that up. It's just that when you discuss things at this level of detail, being wrong happens to the best of us. Lastly, it's also easier to argument against someone else's theories, than to come up with your own and defend them. But I've learned a lot by thinking outside the box. So again, I won't feel bad about being wrong.

I wish RWT had nice forums like AT. Digging through them is a PITA!

Amen to that.

One thing is clear to after reading many of these discussions, I'll have a heart attack if I ever need to understand the x86-64 ISA well enough to program @ the assembly level. No wonder AVX was under utilized.

Actually the hardware's out-of-order scheduling is insanely powerful nowadays. So you typically don't have to worry about reordering your instructions, and you can just concentrate on using as few as possible and avoiding the expensive ones. Especially with vector instructions it's easy to still beat the compilers. With AVX2 the compilers can successfully auto-vectorize a lot more code though. So the days of being able to speed up code by writing assembly are probably numbered, unless indeed you really delve into the finer details of how specific CPU architectures behave.

That said, knowing how to program in assembly is still incredibly useful to know how to write fast code in a high-level language. It's also invaluable for debugging.

Now I'm almost rooting for ARM, so we eventually get some nice high powered RISC CPUs again in the mainstream** :thumbsup:

For the sake of getting some competition, or because you think RISC is that much easier to program?

**I may go back to embedded development after completing my M.S.C.S. next year; fortunately most of that is still RISC based!

Awesome, good luck!

mrob27 · Sep 23, 2012

I noticed that the Intel Sweden version Ivy Bridge die photo shows a slightly different layout (for example, in the size of the gap between the cores and the L3 cache) and also has fewer interconnect layers. Compare it to the official Ivy Bridge die photo from April 2012, and you can see that the L3 cache memory blocks are much more symmetrical in the "Intel Sweden" photo. In the April photo, parts of the L3 cache are covered by something else: metal interconnect layers. My guess is that that yellow section in each L3 cache slice is part of the connection between the core's L2 cache and the L3 cache's controller.

To my eye, these recent Haswell wafer photos look more like the earlier Ivy Bridge photo, in that the cache blocks and other architectural features are more recognizable. So I've made a new labeled die photo comparing to both of the Ivy Bridge photos:

craiggloyd · Jun 7, 2013

Intel Ivy Bridge still isn't over 50% faster clock-clock than a 3MB/core Penryn CPU.

For instance my X9100 on my G50VT laptop overclocked to 3.5GHZ , 1.325V (!00% stable),
Super pi: 1m: 14.7s 2m: 34s

Ivy Bridge i5-3210, @3.1GHZ stock turbo;
Super pi: 1m: 12.7s 2m: 29s

So my 2009 laptop is still decent with real world responsiveness Rough estimate of ~33-50% faster clock for clock with floating point than Penryn. Of course with multitasking/SMP it would be pwned.

From reading some of the Haswell reviews, it is only an improvement of about 10% clock-clock versus Ivy bridge, so that means 46 to 65% clock-clock single threaded improvement versus Penryn.
Meh, still not worth the upgrade.
I understand the new instructions are of a benefit, but in the real world with mainstream software, responsiveness is clock-clock computational power.

Tasks such as viewing and editing large complex word documents, large PDFs, even explorere.exe when deleting a lot of files off of an SSD or searching for files is often CPU pegged with 1 core, lots more examples.

I guess I'll wait for Skymont with crazy integrated graphics

MrDudeMan · Jun 7, 2013

craiggloyd said:
Intel Ivy Bridge still isn't over 50% faster clock-clock than a 3MB/core Penryn CPU.

For instance my X9100 on my G50VT laptop overclocked to 3.5GHZ , 1.325V (!00% stable),
Super pi: 1m: 14.7s 2m: 34s

Ivy Bridge i5-3210, @3.1GHZ stock turbo;
Super pi: 1m: 12.7s 2m: 29s

So my 2009 laptop is still decent with real world responsiveness Rough estimate of ~33-50% faster clock for clock with floating point than Penryn. Of course with multitasking/SMP it would be pwned.

From reading some of the Haswell reviews, it is only an improvement of about 10% clock-clock versus Ivy bridge, so that means 46 to 65% clock-clock single threaded improvement versus Penryn.
Meh, still not worth the upgrade.
I understand the new instructions are of a benefit, but in the real world with mainstream software, responsiveness is clock-clock computational power.

Tasks such as viewing and editing large complex word documents, large PDFs, even explorere.exe when deleting a lot of files off of an SSD or searching for files is often CPU pegged with 1 core, lots more examples.

I guess I'll wait for Skymont with crazy integrated graphics

The vast majority of "real-world" users have a very different opinion than you. 33-50% faster isn't enough for you, but that's a huge improvement for the majority of users.

craiggloyd · Jun 7, 2013

Not very much improvement for 4 to 5 years of time Penryn<->Haswell

MrDudeMan · Jun 7, 2013

craiggloyd said:
Not very much improvement for 4 to 5 years of time Penryn<->Haswell

If you really believe that's the only improvement, then there's nothing left to discuss.

craiggloyd · Jun 7, 2013

I know the power consumption is much better and multitasking is much better and it has some new instructions and a good onboard GPU whereas a Penryn has no Integrated GPU or NB or memory controller and no 2nd die VRM and higher heat but Haswell's weakness is the single core performance over the generations.
Maybe after Broadwell, when the SB goes into the CPU package, we'll see more improvements because they'll be done integrating more components into the package, unless they are still aggressive with increasing integrated graphics performance a lot more by taking up more space on the die.

Revolution 11 · Jun 7, 2013

craiggloyd said:
I know the power consumption is much better and multitasking is much better and it has some new instructions and a good onboard GPU whereas a Penryn has no Integrated GPU or NB or memory controller and no 2nd die VRM and higher heat but Haswell's weakness is the single core performance over the generations.
Maybe after Broadwell, when the SB goes into the CPU package, we'll see more improvements because they'll be done integrating more components into the package, unless they are still aggressive with increasing integrated graphics performance a lot more by taking up more space on the die.

Prepare to be disappointed as Intel is not giving up on the integrated graphics. Also, you can't really increase clocks much more as we hit a power/heat wall more than a decade ago in the Pentium 4. Core size can't be increased much more without hitting diminishing returns. Core number can't be increased without hitting Amdahl's Law. IPC is already into diminishing returns unless you use new instructions.

Which has been the primary form of innovation from Penryn to Haswell. Core size is not much bigger, core number is almost the same, clocks are not much faster, IPC is up by a good deal but we can't count on continuous increases, and better memory bandwidth (a one-time trick).

But we do have AVX, AVX2, FMA, and TSX (the latter on certain Haswell SKUs sadly).

craiggloyd · Jun 7, 2013

" Core size can't be increased much more without hitting diminishing returns."
But luckily transistors are still shrinking so we can have more powerful execution units in each die shrink? Except all these components being integrated onto the die and separate dies in the package (NB, memory controller, GPU, VRM, SB) are slowing this growth.

Maybe you could help me understand something. Why can't they take all the execution units from a quad core die, and put them into 2 larger cores. Yes, you might have problems keeping all the execution units busy but then each core would be much more powerful.

Am I right that they have increased cores in order to keep execution units more fully utilitized when cpu usage is at 100%, along with hyperthreading to help?

The reason why they don't make dual cores that are as powerful as quad cores is because there is a lack of efficiency?

Because I wouldn't mind giving up a few cores for much better single threaded performance.

VirtualLarry · Jun 7, 2013

craiggloyd said:
" Core size can't be increased much more without hitting diminishing returns."
But luckily transistors are still shrinking so we can have more powerful execution units in each die shrink? Except all these components being integrated onto the die and separate dies in the package (NB, memory controller, GPU, VRM, SB) are slowing this growth.

Maybe you could help me understand something. Why can't they take all the execution units from a quad core die, and put them into 2 larger cores. Yes, you might have problems keeping all the execution units busy but then each core would be much more powerful.

Am I right that they have increased cores in order to keep execution units more fully utilitized when cpu usage is at 100%, along with hyperthreading to help?

The reason why they don't make dual cores that are as powerful as quad cores is because there is a lack of efficiency?

Because I wouldn't mind giving up a few cores for much better single threaded performance.

Making the cores "wider" is limited by the amount of ILP (instruction-level parallelism). Haswell is already 6-wide (or is that 8-wide) as far as execution pipelines goes.

So what you are asking, is already somewhat true in Haswell. They did make the cores "wider".

Intel "Haswell" Speculation thread

Member

Member

Elite Member

Member

Member

Member

Lifer

Elite Member

Elite Member

Member

Lifer

Diamond Member

Lifer

Elite Member

Member

Senior member

Member

Member

Lifer

Member

Lifer

Member

Senior member

Member

No Lifer