Intel "Haswell" Speculation thread

Page 15 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

BenchPress

Senior member
Nov 8, 2011
392
0
0
That's the keyword. In practice we'll get nowhere near that, or do you think otherwise?
I do. Haswell doesn't just add FMA support, but will also double the bandwidth, add gather support (replacing 18 legacy instructions with 1), add 256-bit integer support (replacing 3 instructions with 1), and add a whole bunch of other useful instructions. So in practice the effective throughput for parallel algorithms should easily double.

Between Nehalem and Haswell there will be a fourfold increase in peak throughput, but Sandy Bridge (which doubled the peak floating-point throughput) was severely held back by low bandwidth and a lack of many other 256-bit operations. So even though there won't be a fourfold increase in performance over Nehalem in practice, it's easy to see that since Haswell solves all of Sandy Bridge's shortcomings, it has lots of headroom to at least double the throughput in practice.
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
So let me get this straight, you are going to wait another year, until intel releases a chip with FMA, then praise them for it. Meanwhile AMD will have had a chip out with FMA4 for 1 1/2 years, and one with FMA3 for well over 6 months....
You seem to forget that Sandy Bridge increased the peak floating-point throughput too. In practice a Sandy Bridge core is more powerful than a Bulldozer module, since it can execute a 256-bit multiplication and 256-bit addition each cycle, while Bulldozer can only execute either a 256-bit multiplication, addition, or fused multiply-add.

So when I said AMD is stagnating I meant after Bulldozer (and that generation isn't in AMD's favor either), while Intel keeps pushing forward with 2 x 256-bit FMA, 256-bit integer, gather, TSX, etc. So I don't see any need to talk about AMD in this thread unless they present a true adversary to Haswell.
 

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
Haswell is very nice on paper but let's see those FMA units in practice before we make "double throughput" statements. Also let's see if each Haswell core can sustain 2x256bit load and 1x256bit store operations per second and by which magic will intel extend the datapaths without blowing up the die size compared to IB(note that Haswell will also have more die area dedicated to new GPU compared to IB;the leaked die shots of SB/IB/Haswell chips from chiphell only show that shown version of haswell with the new GPU is just slightly larger than QC IB chip-by which magic I might ask?)

 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
inf64, that's not slightly larger, but a lot. I used a ruler roughly and the Haswell and IVB dies have similar height, with IVB having only 86% of the length of Haswell. So it looks like more area for the GPU is there. What is more interesting is that Intel was selling Nehalem i7 920 for as low as $284 or so and the die size is huge compared to Haswell. No wonder Intel's Gross Margins remain so high. We are getting die size downsized by Intel/AMD and NV. We used to get more die size per $ (Perhaps Intel could have fit a 6-core Haswell into Nehalem's die size and sold it to us for $284.), but not with little competition from Bulldozer/Piledriver .
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
inf64, that's not slightly larger, but a lot. I used a ruler roughly and the Haswell and IVB dies have similar height, with IVB having only 86% of the length of Haswell. So it looks like more area for the GPU is there. What is more interesting is that Intel was selling Nehalem i7 920 for as low as $284 or so and the die size is huge compared to Haswell. No wonder Intel's Gross Margins remain so high. We are getting die size downsized by Intel/AMD and NV. We used to get more die size per $ (Perhaps Intel could have fit a 6-core Haswell into Nehalem's die size and sold it to us for $284.), but not with little competition from Bulldozer/Piledriver .



It misses a QC 32nm Westmere core tho. Something that that would have been the size of IB.
 
Last edited:

Lonbjerg

Diamond Member
Dec 6, 2009
4,419
0
0
no i understand what it means.

But can you really say we went up double since the highest line last year?

I don't think you understand.
What does the number of transistors on die tell about performance?
You are confusing the two metrics...make you own law...don't try and bend Mooore's law to fit your views...
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Haswell is very nice on paper but let's see those FMA units in practice before we make "double throughput" statements. Also let's see if each Haswell core can sustain 2x256bit load and 1x256bit store operations per second and by which magic will intel extend the datapaths without blowing up the die size compared to IB(note that Haswell will also have more die area dedicated to new GPU compared to IB;the leaked die shots of SB/IB/Haswell chips from chiphell only show that shown version of haswell with the new GPU is just slightly larger than QC IB chip-by which magic I might ask?)


Oh I'm pretty sure it can do more than 2x256bit loads and 1x256bit stores per second.

/just trolling ya.
 

Mars999

Senior member
Jan 12, 2007
304
0
0
From what I have run across, Haswell is only going to be 10% faster per clock vs. IB. IMO that sucks balls, I was hoping for more like 25-30% bump... Ah well, have to wait and see... I know AVX2 is and TSX? is going to be a nice jump, but the 10% is actual CPU to CPU speed increase based off same clock speeds.
 

StrangerGuy

Diamond Member
May 9, 2004
8,443
124
106
From what I have run across, Haswell is only going to be 10% faster per clock vs. IB. IMO that sucks balls, I was hoping for more like 25-30% bump... Ah well, have to wait and see... I know AVX2 is and TSX? is going to be a nice jump, but the 10% is actual CPU to CPU speed increase based off same clock speeds.

Methinks you are the only one who will call a +10% IPC suck.

Anyway I'm more interested to see how simplified the Haswell mobos are rather than the chip itself. <$50 mobos baby!
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
@ TuxDave
Obviously that was a typo ,meant per cycle.

Shintai, the estimated die size for QC Haswell is 20mm^2 larger than QC IB. The difference in size is mostly due to much larger GPU.

There are no QC ES0 samples, and ES1 didnt come before late may this year. That pic is from what, sept 2011?

Even if we say the ES0 was a QC with 2 cores disabled. It wasn't here until feb 2012. Before that only DC existed.



Leaked Haswell screenshot is also a dualcore:
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
Then how big will the quad core be in the version with the fastest(GT3?) GPU? Is it viable for intel to make a desktop performance part which will be practically almost 2x bigger(2x more cores than ES0+more die area for the GT3) than IB QC we have today? Even in this ES0 version the die area is 15% larger than QC IB.

I still think,judging by the wording intel has used in the slide, that ES0 is a QC die with 2nd fastest GPU and 2 cores disabled.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Even then, it didnt exist in sept 2011

If we judge by SB. Dualcores was 149mm2, quadcores 216mm2.

Why would it be 2x of IB QC today? You make zero sense in your estimates.
 

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
Ok not exactly 2x larger but plenty larger. After all if what you say is true and that is 2C part ,then 4C part will have 2x more cores (larger than IB cores) and top of the range GPU (again larger die area). This is only the speculation based on the premise that you are correct in saying the 185mm^2 is a native dual core part.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Ok not exactly 2x larger but plenty larger. After all if what you say is true and that is 2C part ,then 4C part will have 2x more cores (larger than IB cores) and top of the range GPU (again larger die area). This is only the speculation based on the premise that you are correct in saying the 185mm^2 is a native dual core part.

If the DC part is a GT3. Then a QC GT3 wont be much bigger.

Haswell reuse IBs EUs basicly. Its 7+ generation EUs.

As shown before, Intels highend desktop chips tends to swing between 150mm2 and 300mm2.
 
Last edited:

BenchPress

Senior member
Nov 8, 2011
392
0
0
Haswell is very nice on paper but let's see those FMA units in practice before we make "double throughput" statements.
There simply is no other option. They announced that Haswell's FMA support will "significantly increases peak flops". Since they currently have a MUL + ADD unit, this could mean FMA + ADD, FMA + MUL, or FMA + FMA units for Haswell. But the first two options can actually lower performance due to port contention. Hence two FMA vector units is the only viable configuration.
Also let's see if each Haswell core can sustain 2x256bit load and 1x256bit store operations per second and by which magic will intel extend the datapaths without blowing up the die size compared to IB...
It won't blow up the die size at all. Sandy Bridge already added 256-bit floating-point data paths, wider (and more) registers, VEX decoding support, a uop cache, more ROB and scheduler entries, etc. and yet the core size only increased by a measly 3.5% versus Westmere.

Also note that while a quad-core Haswell CPU would have a total of 64 scalar FMA 'compute units', a Xeon Phi can have up to 1000 such units. That's on a larger chip and at a lower clock frequency, but still, it shows that these units are quite tiny and so expecting Haswell to have 2 x 256-bit FMA units and other throughput computing features really doesn't require any "magic".
 

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
Intel managed to get by in SB/IB by dual using the existing load/store address ports in order to achieve needed L/S BW to feed the execution units. More so they reused the existing integer SIMD units (a neat trick) in order to not blow up die size and make a "true" 256bit AVX functionality. They will most definitely do the same for AVX2 and 256bit integer SIMD execution,but what will they do with FMA units? How will they achieve 2x256bit load bandwidth with haswell? They cannot reuse the existing load ports again so they need to expand the L/S unit (SB can do 3 memory requests: 2 of which can be up to 16bytes per cycle for load and 1 which can be 16bytes per cycle for store; Haswell will need 2x32bytes to L1 for true 256bit FMA units if I'm not mistaken).
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
From what I have run across, Haswell is only going to be 10% faster per clock vs. IB.
It is claimed that IPC will be 10% higher. That doesn't mean it will only be 10% faster. IPC (instructions per clock) is just one factor which determines performance. Not so long ago people were judging CPUs by clock frequency alone, but with all due respect it's equally stupid to judge them by IPC alone.

Haswell will double the vector throughput, reduce the multi-threading overhead, and add nifty bit manipulation instructions. Sure, each of these require updated software to benefit from it, but the kind of software that can benefit from a speedup (on top of the 10% IPC improvement) is frequently updated anyway!
 

BenchPress

Senior member
Nov 8, 2011
392
0
0
Intel managed to get by in SB/IB by dual using the existing load/store address ports in order to achieve needed L/S BW to feed the execution units. More so they reused the existing integer SIMD units (a neat trick) in order to not blow up die size and make a "true" 256bit AVX functionality. They will most definitely do the same for AVX2 and 256bit integer SIMD execution...
Indeed.
...but what will they do with FMA units? How will they achieve 2x256bit load bandwidth with haswell? They cannot reuse the existing load ports again so they need to expand the L/S unit (SB can do 3 memory requests: 2 of which can be up to 16bytes per cycle for load and 1 which can be 16bytes per cycle for store; Haswell will need 2x32bytes to L1 for true 256bit FMA units if I'm not mistaken).
First of all, benefiting from FMA doesn't strictly depend on increased L/S bandwidth. You can for instance evaluate a long polynomial where all the coefficients are stored in registers and so you can get twice the performance with FMA without requiring more bandwidth.

That said, lots of algorithms of course benefit from higher memory bandwidth when the arithmetic throughput increases, so it is highly expected that AVX2 will be accompanied by twice the bandwidth. That can relatively easily be achieved by doubling the width of the L/S ports. They've done it before, namely with Core 2 (going from 64-bit to 128-bit). That was at 65 nm, so with Haswell at 22 nm it shouldn't be much of a problem to have 256-bit load/store ports.

The real question is whether or not they'll increase the cache line size, and if they do, whether they'll double the number of banks or double the width of each bank. These trade-offs affect unaligned accesses and bank conflicts. I also wonder if they'll use three AGUs or stick with two...
 

CyborgNewtype

Junior Member
Aug 20, 2012
1
0
0
If my estimates are correct and the rumors are true, Haswell's GT3 iGPU sporting 40EU's will be faster than the Trinity APU and probably good competition for the next gen APU Kaveri.

HD 4000 and Llano were close on some benchmarks and Trinity raised the bar on Llano by 20 to 50%. Which in Intel terms would mean a 28~30EU iGPU, correct? Which means, that the next APU must do the same or a greater graphics jump like it did with the Trinity APU to be competitive with the GT3(and Haswell in the graphics department)?

Am I right or wrong? Just looking to get a new laptop in the future and Haswell seems to be the right choice at this moment in time.
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Haswell is very nice on paper but let's see those FMA units in practice before we make "double throughput" statements.

it was disclosed at IDF Spring 2012 that peak fp throughput will double with FMA, see slide 4 of BJ12_ARCS002_102_ENGf.pdf that you can download from intel.com/go/idfsessionsBJ
there is also an actual speedup value for FMA vs SSE at slide 62
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
expected that AVX2 will be accompanied by twice the bandwidth. That can relatively easily be achieved by doubling the width of the L/S ports.

and adding a new port

doubling the width of the existing ports will increase only 1.5x the peak L1D cache bandwidth (from 48B/clock to 64B/clock) with VEX.256 code
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
From what I have run across, Haswell is only going to be 10% faster per clock vs. IB. IMO that sucks balls, I was hoping for more like 25-30% bump... Ah well, have to wait and see... I know AVX2 is and TSX? is going to be a nice jump, but the 10% is actual CPU to CPU speed increase based off same clock speeds.

Why would you even imagine that? Maybe if it has massive overclocking headroom

IPC

1) Core 2 Duo Conroe/Kentsfield (E6600/Q6600) --> Nehalem/Lynnfield (i7 920/i7 860) = 15-17.5% on average (I think Anandtech tested this during one of their i7 920 launches)
2) Nehalem/Lynnfield --> Sandy Bridge/IVB (i5 2500k/2600k/3770k) = 15-17.5% on average (maybe less sometimes) (I am giving IVB 2-3% increase in IPC over SB, and using 14% increase from i5 760 vs. i5 2500k at the same clocks)

The only time we've seen a major increase was Netburst to Conroe/Merom.

Maybe if you account for IPC + Overclocking + new instructions, then it might be 25-30% faster than a 4.4ghz IVB. I would say 15% would be amazing since that's what Intel has been able to get last 2 major generations (Conroe --> Nehalem --> Sandy Bridge). Anything on top is just insane .

I just hope they fix the solder between the IHS and the CPU die that resulted in clamped overclocking of IVB and hot temperatures.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |