AVX2 and FMA3 in games

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
May 11, 2008
20,260
1,150
126
Well, you have me convinced. But not at one thing. It makes sense to me that Intel would only put effort into adding optimizing code path if AMD came up with the best performing sequences of instructions and paid for the hours Intel has to put into updating the compiler.
I like AMD very much, but i think it is unreasonable to ask of Intel to support their competitors like AMD and VIA out of the blue.
Everybody is still into making money. So, hours spend must be paid.
Let's say that AMD came up with the money and the optimization manual and Intel would deliberately produce inferior code that does not comply with what AMD requests, that is a whole different story. That is illegal.
 
May 11, 2008
20,260
1,150
126
And, really, at the bottom line of all this is the apparent fact that Intel eventually chose to make its compiler efficient for AMD cpus instead of artificially hobbling it with GenuineIntel — after the outcry over it happened and damage was done to AMD.

Intel's compiler now apparently produces the best-performing code for AMD processors. Intel didn't do that out of the goodness of its heart. It is selling that compiler for significant money and companies tend to change bad policy when it comes into the spotlight enough.

It won the war, though, of course. It doesn't look like AMD is going to be trying to innovate in the instruction set area anytime soon after being burned with SSE5.

Well, AMD did came up with x86-64bit (17 years ago already). And it has become highly successful so far.
https://en.wikipedia.org/wiki/X86-64

They came with GCN and with mantle.
 

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
Well, you have me convinced. But not at one thing. It makes sense to me that Intel would only put effort into adding optimizing code path if AMD came up with the best performing sequences of instructions and paid for the hours Intel has to put into updating the compiler.
I like AMD very much, but i think it is unreasonable to ask of Intel to support their competitors like AMD and VIA out of the blue.
Everybody is still into making money. So, hours spend must be paid.
Let's say that AMD came up with the money and the optimization manual and Intel would deliberately produce inferior code that does not comply with what AMD requests, that is a whole different story. That is illegal.

They don't optimize for specific CPUs, except in the cases of bugs (does the latter even happen?). The compiler optimizes for instruction sets. SSE2 is SSE2, whether it be on a Core 2 Duo or an FX 8350.

What Intel did, was use purposely slow SSE2 (or did it use mere x86/87?) code whenever the cpuid string wasn't GenuineIntel.
 
Reactions: Drazick
May 11, 2008
20,260
1,150
126
They don't optimize for specific CPUs, except in the cases of bugs (does the latter even happen?). The compiler optimizes for instruction sets. SSE2 is SSE2, whether it be on a Core 2 Duo or an FX 8350.

What Intel did, was use purposely slow SSE2 (or did it use mere x86/87?) code whenever the cpuid string wasn't GenuineIntel.
Well, yes, but i read somewhere that not all instructions being supported or that some AVX instructions are slower for on constructor cores than the sort of equivalent sse 2 versions. .
I can imagine that a safe practice default is chosen.
Do not know where it was anymore. Some forum.
There was a person with a prim95 logo stating that to get the best performance from constructor based AMD cpus one needs to mix sse2 and avx instructions together for best performance. I do not know if that is true. I cannot test it. I have visual studio but i have 0 experience in x86 assembly.
If true, that means that the programmer/user must optimize by hand and not let the compiler decide.
 

knutinh

Member
Jan 13, 2006
61
3
66
They don't optimize for specific CPUs, except in the cases of bugs (does the latter even happen?). The compiler optimizes for instruction sets. SSE2 is SSE2, whether it be on a Core 2 Duo or an FX 8350.
How can you be so confident? The Atom line of processors supports a given instruction set that may be similar to the big guys.

But due to a lack or re-ordering, cache size etc, "optimal" code might be quite different.

I think that Intel does whatever their resources and ingenouity allows them to make your code fast on one or more of their cpus. And results suggests that they are doing a better job than Microsoft or gcc.

-k
 

knutinh

Member
Jan 13, 2006
61
3
66
Wrong. If a company sells a piece of software that produces binaries that run on others' products that piece of software needs to get rid of support altogether or provide it.
"Needs to" in what sense? Legally? Morally? Market-wise?

I disagree. People (even Intel) gets to make compilers. They get to target whatever cpu they like. If they choose to not spend any time optimizing for competing hw manufacturers that is fine.

I believe that AMD have used ICC in PR material. I.e at least one time ICC was deemed the most optimal compiler for a given set of code and a given AMD release. That is interesting. Apparently Microsoft are happy with crippling MSVC with C89 to push users to their high-level code, while gcc is happy doing ... whatever it is that gcc does (#pragmas? We dont do that).

-k
 

knutinh

Member
Jan 13, 2006
61
3
66
...I have visual studio but i have 0 experience in x86 assembly.
If true, that means that the programmer/user must optimize by hand and not let the compiler decide.
Optimal assy is always going to be as fast or faster than intrinsics. The same relationship between intrinsics and code peppered with pragmas etc. The higher up the abstraction ladder, the more opportunities are off limits, and (best case) speed can only get worse.

Now, writing optimal anything is hard, assy worse than most. Getting the skills to out-do a compiler with significant margin without introducing bugs, updating your skills for each new hw, continually measuring against new compiler (-settings) because you never know if you hit the limit or you just lack the imagination. You could easily spend 2 months on a small piece of code that a compiler does in seconds. And maintaining the assy afterwards is a pain.

For most software it simply does not make sense. Decide what your software should do. Write code. Fix bugs. Let your users use it.

For small, well-structured code used in hw performance rating, it might make sense to go to this pain. In order to rate "true" hw performance, especially if the hw is new and compilers are immature. But I'd argue that for "general" software performance, the availability of good compilers is a part of the "performance".

-k
 

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
How can you be so confident? The Atom line of processors supports a given instruction set that may be similar to the big guys.

But due to a lack or re-ordering, cache size etc, "optimal" code might be quite different.

I think that Intel does whatever their resources and ingenouity allows them to make your code fast on one or more of their cpus. And results suggests that they are doing a better job than Microsoft or gcc.

-k

Baba-ping! http://www.agner.org/optimize/blog/read.php?i=49
 
Reactions: Drazick

Nothingness

Diamond Member
Jul 3, 2013
3,054
2,021
136
I think that Intel does whatever their resources and ingenouity allows them to make your code fast on one or more of their cpus. And results suggests that they are doing a better job than Microsoft or gcc.
Except for programs that vectorize well or for benchmarks that Intel specifically tuned their compiler for, gcc is very competitive. Even on SPEC, which Intel broke with their compiler, for half of the integer tests, gcc is as good as icc. What I call as good, is +/- 5%.

Also icc is more buggy than gcc (as an example, last time I tried, giving a wrong command-line resulted in a crash, come on Intel ), so as far as I'm concerned icc is useless (I insist, that's true for *my* usage ).
 

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
You said:
"They don't optimize for specific CPUs, except in the cases of bugs"

From your own link:
"the compiler or library can make multiple versions of a piece of code, each optimized for a certain processor and instruction set,"

-k

The reason is that the compiler or library can make multiple versions of a piece of code, each optimized for a certain processor and instruction set
 
Reactions: Drazick

knutinh

Member
Jan 13, 2006
61
3
66
Except for programs that vectorize well...
I admit that I am heavily biased towards problems that feature deep nested loops and that can execute really well on SIMD hw.

Being able to write c code using icc, instead of having to resort to inline assy using gcc means being more productive, having less bugs and that your code can be maintained afterwards.

It seems that open source projects (that cannot rely upon users having a proprietary compiler) that have a similar profile relies on assy (x264 being an example).

Of course, many (most) applications do not feature vector friendly code and/or outsource their vector friendly code to the gpu.

-k
 
May 11, 2008
20,260
1,150
126
Say do you guys have a link ready with all the results combined for different builds of blender with gcc and icc and msvc ? And under windows and linux ? And running on different processors ?

I cannot find the post. Been traversing and searching a bit but i do not see something like a summed up post.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
It makes sense to me that Intel would only put effort into adding optimizing code path if AMD came up with the best performing sequences of instructions and paid for the hours Intel has to put into updating the compiler.
Because Intel's compiler is free or inexpensive instead of being the most expensive compiler in the market, right? And, because x86 isn't a standard which includes the fact that Intel licensed it to AMD, right?
I like AMD very much, but i think it is unreasonable to ask of Intel to support their competitors like AMD and VIA out of the blue.
Think about it from the point of view of the customer. The customer who pays big money for that compiler needs to know if it is going to produce effective code for whatever it seems to be able to compile for. That does not mean stealth crippling of non-Intel x86 CPUs.

And, as I said, it hardly seems onerous for AMD to just tell Intel which instructions to use for its compiler. "Hey, Intel, our testing has found that AVX is slower than FMA4 and XOP. So, just go ahead and call those latter instructions instead of using AVX."

Or, Intel could just prevent it's expensive compiler from running at all on non-Intel CPUs and the binaries it produces.

A product is either supported or it isn't.
Well, AMD did came up with x86-64bit (17 years ago already). And it has become highly successful so far.
Was that before or after the SSE5 debacle? Intel clearly didn't want to have AMD lead again nor did it want AMD's CPUs to be able to be optimized for an industry standard instruction set.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
They don't optimize for specific CPUs, except in the cases of bugs (does the latter even happen?). The compiler optimizes for instruction sets. SSE2 is SSE2, whether it be on a Core 2 Duo or an FX 8350.

What Intel did, was use purposely slow SSE2 (or did it use mere x86/87?) code whenever the cpuid string wasn't GenuineIntel.
It didn't use SSE2 at all.

As far as I know, an intelligent compiler will tell binaries to execute specific instructions sets depending on the CPU make. It makes very little sense to charge that kind of money for a compiler that's too primitive to tell binaries to use FMA4 and XOP.
 
Reactions: MajinCry

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
"Needs to" in what sense? Legally? Morally? Market-wise?
I already explained that in detail.
People (even Intel) gets to make compilers. They get to target whatever cpu they like. If they choose to not spend any time optimizing for competing hw manufacturers that is fine.
Just restating the original argument, the one I rebutted, is not a rebuttal.
I believe that AMD have used ICC in PR material. I.e at least one time ICC was deemed the most optimal compiler for a given set of code and a given AMD release. That is interesting.
Yes, it is interesting considering the SSE2 GenuineIntel scandal.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Say do you guys have a link ready with all the results combined for different builds of blender with gcc and icc and msvc ? And under windows and linux ? And running on different processors ?

I cannot find the post. Been traversing and searching a bit but i do not see something like a summed up post.
Just search The Stilt's posting history and you'll find the builds. He does not post a lot so it's not hard to find the posts linking to the builds.

As for Linux, you can download the binary from the Blender site. The Ryzen blender demo file can be found easily with a Google search.

In terms of results, I tested on Skylake, Lynnfield, and Piledriver (with CMT on and off). I also tested stock Blender back quite a few versions. The results were clear.

• The SIMD build is the fastest build on all, except on Lynnfield where it trailed the outdated 2.75a build by a tiny bit.

• The AVX2 build is the second-fastest on all, except on Lynnfield where it was worst but only trailed the official 2.78a (latest) Blender build by a tiny bit.

• The stock Blender build is vastly slower on Skylake and Piledriver and a little bit slower on Lynnfield. On Lynnfield there isn't much speed different between any of the builds. This is because it's a 2009 processor that doesn't have AVX or other recent instructions.

The bottom line is that the stock Blender builds are useless for comparing performance now because they don't use modern instructions and leave way too much performance on the table to be relevant. Really the only thing they do is point to how much modern instructions can improve performance over 2009/2010 — why it's important to use them.
 
Last edited:

Nothingness

Diamond Member
Jul 3, 2013
3,054
2,021
136
The bottom line is that the stock Blender builds are useless for comparing performance now because they don't use modern instructions and leave way too much performance on the table to be relevant. Really the only thing they do is point to how much modern instructions can improve performance over 2009/2010 — why it's important to use them.
Has it been proven that the Windows build runs few AVX instruction? Because when I disassemble Blender executable I see AVX and FMA instructions... And looking at the source and various issues tracked, AVX kernels have been in Blender since 2014 and are compiled in when VS 2012 or upper is used, or the Windows build is built with VS 2012:
Code:
~/work/Benchmarks/zen/blender-2.78a-windows64$ strings blender.exe | grep -i 'microsoft visual'
Microsoft Visual C++ version 12.0
 
May 11, 2008
20,260
1,150
126
Just search The Stilt's posting history and you'll find the builds. He does not post a lot so it's not hard to find the posts linking to the builds.

As for Linux, you can download the binary from the Blender site. The Ryzen blender demo file can be found easily with a Google search.

In terms of results, I tested on Skylake, Lynnfield, and Piledriver (with CMT on and off). I also tested stock Blender back quite a few versions. The results were clear.

• The SIMD build is the fastest build on all, except on Lynnfield where it trailed the outdated 2.75a build by a tiny bit.

• The AVX2 build is the second-fastest on all, except on Lynnfield where it was worst but only trailed the official 2.78a (latest) Blender build by a tiny bit.

• The stock Blender build is vastly slower on Skylake and Piledriver and a little bit slower on Lynnfield. On Lynnfield there isn't much speed different between any of the builds. This is because it's a 2009 processor that doesn't have AVX or other recent instructions.

The bottom line is that the stock Blender builds are useless for comparing performance now because they don't use modern instructions and leave way too much performance on the table to be relevant. Really the only thing they do is point to how much modern instructions can improve performance over 2009/2010 — why it's important to use them.

So, the default blender is build as i expected. To run on as many cpus as possible.
It is understandable to reach a wide public. But it would be nice if the blender developers also hosted a simd optimized version.
Is the default blender build based on fpu only or does it uses the first sse version ?

And with SIMD build you mean sse2 or avx ?
 

Nothingness

Diamond Member
Jul 3, 2013
3,054
2,021
136
Has it been proven that the Windows build runs few AVX instruction? Because when I disassemble Blender executable I see AVX and FMA instructions... And looking at the source and various issues tracked, AVX kernels have been in Blender since 2014 and are compiled in when VS 2012 or upper is used, or the Windows build is built with VS 2012:
Code:
~/work/Benchmarks/zen/blender-2.78a-windows64$ strings blender.exe | grep -i 'microsoft visual'
Microsoft Visual C++ version 12.0
I dug further, and the > SSE2 kernels are only built for native builds, there's no CPU auto-detection

Now it remains to understand why there's AVX and FMA in the binary. Perhaps some linked in library?

So, the default blender is build as i expected. To run on as many cpus as possible.
It is understandable to reach a wide public. But it would be nice if the blender developers also hosted a simd optimized version.
Or even better: add code that detects the CPU it runs on and select the best paths...

Is the default blender build based on fpu only or does it uses the first sse version ?

And with SIMD build you mean sse2 or avx ?
SSE2 is always enabled for 64-bit builds. So I guess he means AVX.
 

knutinh

Member
Jan 13, 2006
61
3
66
I already explained that in detail.
It was not at all clear to me.
Just restating the original argument, the one I rebutted, is not a rebuttal.
I have not seen much in the way of arguments from you, mostly normative claims?

Please elaborate why a compiler manufacturer _must_ offer optimal performance on all platforms it supports, and how this relates to clearly not being the case for most products, be it compilers or office applications.
Yes, it is interesting considering the SSE2 GenuineIntel scandal.
If this is common (I don't know that it is), I would argue that it means that Intel probably makes better AMD compilers than MS or gcc.

-k
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
Has it been proven that the Windows build runs few AVX instruction? Because when I disassemble Blender executable I see AVX and FMA instructions...
The best people to ask are The Stilt and Blameless, I suppose.

The drastic performance enhancement is not just seen because Intel's compiler is special. The AVX2 build was done with Microsoft's as I recall, not Intel's. It also shows a drastic performance improvement over the stock Blender builds, except on Lynnfield. The "SIMD" build is just a bit better than the AVX2 build.
I have not seen much in the way of arguments from you
I'm not going to restate them yet again.
Please elaborate why
I've done that.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
This bit, from ExtremeTech, gives a clue, though.

As Anandtech wrote, “With the Pentium D, we had to give up a noticeable amount of single threaded performance…in order to get better multithreaded/multitasking performance, but with AMD, you don’t have to make that sacrifice. Everything from gaming to compiling performance on the Athlon 64 X2 4400+ was extremely solid.”

2003-2006 was a golden age for the company. Unfortunately, it didn’t last.

David vs. Goliath: It’s a nice story

One thing we know more about now than we did then is just how much pressure Intel brought to bear on everyone, behind the scenes. There were always off-the-record conversations with nervous motherboard vendors about why their AMD product samples shipped in plain white boxes, or why the motherboards lacked brand names. When SuperMicro introduced an Opteron motherboard in 2005, the company refused to acknowledge its existence. Intel’s own compilers refused to run SSE or SSE2 code on compatible AMD processors; applications would check for the “GenuineIntel” string when running these programs rather than simply checking to see if SSE2 was supported on the processor. That’s a particularly low blow considering AMD paid Intel for licenses.

In its 500-plus-page findings of fact, the European Union laid out repeated demonstrations of how Intel used predatory rebate practices to keep companies from carrying more than certain percentage of AMD hardware. The basic scheme worked like this: If an Intel chip normally cost $100, but you bought 90% Intel processors, Intel would cut you a $25 rebate check per chip at the end of the quarter. If, however, you sold 85% Intel processors, you got nothing. A company that sold 100,000 chips in a quarter and kept 90% Intel volume could expect a $22.5 million dollar rebate check. In order to compete with Intel’s rebates, AMD had to offer an equivalent price savings, but on a vastly smaller number of chips. In one situation, AMD offered to give HP a million processors, for free, if it would use them to build systems. HP responded that it couldn’t afford to do so, because the total value of a million free processors was smaller than the value of Intel’s rebates.
 

Nothingness

Diamond Member
Jul 3, 2013
3,054
2,021
136
The best people to ask are The Stilt and Blameless, I suppose.
As I later found and posted, Blender is configured at compile time for a given ISA extension. So this more or less answers my questions

The drastic performance enhancement is not just seen because Intel's compiler is special. The AVX2 build was done with Microsoft's as I recall, not Intel's. It also shows a drastic performance improvement over the stock Blender builds, except on Lynnfield. The "SIMD" build is just a bit better than the AVX2 build.
If you want to measure the impact of new instructions, you have to use the same compiler for all of the builds or you are creating distortions.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |