AVX2, FMA, TSX in Haswell

SunRe

Member
Dec 16, 2012
51
0
0
Ok, so actually what is the performance gain once programs start taking advantage of the new instructions Haswell brings?

I am aware it's a broad question to ask, but let's say we limit ourselves to
- rendering
- image processing software (Photoshop)
- cryptography
- video encoding

Are there any resources, figures, on the web about it? I couldn't find much.

This is not purely academic, I am trying to figure out whether to take the jump to a second hand Ivy/Sandy Bridge or wait and buy a Haswell. And given the leaked performance delta up until now I'd go with Ivy so it's up to the new instructions sets.

Thanks!
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
AVX2 should help all of these things a fair amount, if properly coded for. FMA will be able to help at least the first two. The latter two are likely to be more integer-centric and probably unaffected by the FMA units.

There's a good writeup of the impact AVX2 optimizations has had so far on x264:

http://www.scribd.com/doc/137419114/Introduction-to-AVX2-optimizations-in-x264

The gains are pretty huge in a lot of the functions, but with the caveat that half the run-time is in other scalar-ish code that won't be improved by it, putting an upper bound that's probably under 50% improvement. Note this code-base involves a lot of hand-optimized assembly, the level of which most projects won't undertake. It remains to be seen how much compiler optimizations will leverage the improvements on their own.

I don't have a good idea of how much TSX will help with these tasks, if at all.. but I doubt you'll see a lot of this sort of code using it any time soon. Especially if it's not on some common higher end products.
 
Last edited:

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
I dont think TSX will benefit these with anything at all, due to the scalable nature of those apps. But as said, specially AVX2 will be a great one. And that Haswell got 256bit paths to handle it full speed.
 

A5

Diamond Member
Jun 9, 2000
4,902
5
81
It's hard to throw out overall numbers. There are certain parts of the code that will benefit massively from this stuff, but it varies widely on how much of the program is that stuff.
 

mikk

Diamond Member
May 15, 2012
4,175
2,211
136
Around +10% is a realistic target for x264 AVX2 encoding according to some x264 devs.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
I dont think TSX will benefit these with anything at all, due to the scalable nature of those apps. But as said, specially AVX2 will be a great one. And that Haswell got 256bit paths to handle it full speed.

well, modern games are using 6-8 cores... and consoles will be using more than 6 all the time...
imo TSX will shine
 

psyq321

Junior Member
Jun 18, 2012
11
1
71
I managed to extract ~22% speedup clock-for-clock compared to Ivy Bridge generation using AVX2 (including usage of both FMA and Gather instructions)

Some benchmarks (neural network simulation) are available here:

http://www.digicortex.net/node/41 - Test #2 is relevant here

Please note that I did not have anything else than 3720QM (Retina Macbook Pro 15") to compare with 4770 (non-K), so despite making CPU frequency and DRAM speed equal, there is still LLC cache size difference (6 MB vs. 8 MB).

However, the benchmark is heavily memory bound so cache size should not matter too much. Despite the fact that the CPU spends most time waiting for data to arrive from DRAM memory, clock-for-clock gains are approx. 22%.

22% in these conditions is nothing short of impressive - I would say that Haswell delivers considerable gains when the software is able to use AVX2 features.
 

cytg111

Lifer
Mar 17, 2008
23,561
13,122
136
22% in these conditions is nothing short of impressive - I would say that Haswell delivers considerable gains when the software is able to use AVX2 features. .

- While 5-10% will be generic IPC gains it is still pretty amazing, and only the top of the iceberg. Nice! (especially the context of implementation.)
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
22% in these conditions is nothing short of impressive - I would say that Haswell delivers considerable gains when the software is able to use AVX2 features.

are the baseline scores with AVX code on the Ivy Bridge setup ? or only SSEx code ?
 

grimpr

Golden Member
Aug 21, 2007
1,095
7
81
well, modern games are using 6-8 cores... and consoles will be using more than 6 all the time...
imo TSX will shine

Dont tell him about more cores than 4 and the benefits of TSX, look at this cpu. :biggrin:
 

psyq321

Junior Member
Jun 18, 2012
11
1
71
are the baseline scores with AVX code on the Ivy Bridge setup ? or only SSEx code ?

Baseline on Ivy Bridge was already running AVX code. So the 22% increase is only due to jump to Haswell + AVX2 optimizations (and 2 MB of extra LLC cache like I said, but this is probably not impacting tests significantly)

I will also benchmark against SSE4 code-path tonight.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
well, modern games are using 6-8 cores... and consoles will be using more than 6 all the time...
imo TSX will shine

Sadly Intel has disabled TSX on the chips aimed at hardcore gamers, so I doubt many game developers will bother with it.
 

SunRe

Member
Dec 16, 2012
51
0
0
I managed to extract ~22% speedup clock-for-clock compared to Ivy Bridge generation using AVX2 (including usage of both FMA and Gather instructions)

Some benchmarks (neural network simulation) are available here:

http://www.digicortex.net/node/41 - Test #2 is relevant here

Impressive, nice work, thanks for sharing this.

Sadly Intel has disabled TSX on the chips aimed at hardcore gamers, so I doubt many game developers will bother with it.

This makes me think that I'd really consider a XEON with eDRAM in it and offcourse all the stuff activated, unlike the K parts.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Impressive, nice work, thanks for sharing this.



This makes me think that I'd really consider a XEON with eDRAM in it and offcourse all the stuff activated, unlike the K parts.

There are no Xeons with eDRAM.
 

SunRe

Member
Dec 16, 2012
51
0
0
I know, just that after looking at some of the benchmarks on 4950HQ got me thinking. I think a Xeon with eDram would make sense for certain tasks and would convince me not to go with the K part.

However an eDram part with clocks within 100-200 Mhz of 4770k is highly unlikely..
 

SunRe

Member
Dec 16, 2012
51
0
0
6M cache, 300Mhz lower base-clock, still no TSX, what's this craze with disabling TSX..

Nice find anyway, forgot about it. Looking to see a motherboard with this on it..
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
6M cache, 300Mhz lower base-clock, still no TSX, what's this craze with disabling TSX..

Nice find anyway, forgot about it. Looking to see a motherboard with this on it..

Actually, the 6MB L3 cache is common on all parts with Crystalwell- not sure if the "missing" 2MB is being used as tags for the eDRAM, or something like that. (Or it could be Intel playing silly buggers as per usual- see also TSX.)
 

bronxzv

Senior member
Jun 13, 2011
460
0
71
Baseline on Ivy Bridge was already running AVX code. So the 22% increase is only due to jump to Haswell + AVX2 optimizations.

thank you, it will be interesting to see how much of your 22% speedup is due to the increased IPC runing the very same AVX code path on Ivy and Haswell, then how much extra speedup is given by AVX2 over AVX on Haswell

I will be also interested to learn if gather instructions provide speedup in your case since with my own code gather instructions are slightly slower than an optimized software synthetized gather, I have now commented out all my gather code paths
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
I will be also interested to learn if gather instructions provide speedup in your case since with my own code gather instructions are slightly slower than an optimized software synthetized gather, I have now commented out all my gather code paths

Out of curiosity, what software do you develop and do you regularly compile on every Intel release or just on the tocks?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |