AVX2 and FMA3 in games

Ken g6 · Jan 12, 2017

Carfax83 said:
I'm no programmer, but I was under the impression that extensions such as AVX2 were backward compatible with older extensions. For example, a new CPU like Haswell or Skylake would run the fastest codepath with AVX2, while a CPU like Sandy Bridge would use the same codepath but with less throughput/performance due to lacking AVX2..

That's more or less the way AVX and AVX2 should have been done. Get info from the chip on the width of its AVX and work from there. (And maybe have an instruction to limit that width, but I'm not sure if that makes sense.)

But that's not how it works.

Tuna-Fish · Jan 12, 2017

knutinh said:
As ShintaiDK said, it is possible to distribute binaries that follow different code-paths depending on hardware. I think that makes a lot of sense. Now, should there be 2 or 10 code paths, what is the "sweet spot"? Would users accept that their game binary download is 2 GB instead of 512MB only in order to increase performance by 5% on half of the worlds computers? I don't know.

The binary is rarely more than a few (think 10) megabytes, and the actually varying parts are probably less than a kilobyte. Space usage isn't the issue.

However, binaries with multiple code paths are extremely rare in practice. Some of the very most optimized software, like x264, do this, but in practice almost everyone has decided that the multiple codepaths are not worth the hassle, specifically as they only give more speed where it's not really needed. If you are writing a game or something, what you optimize for is for as many people as possible being able to experience your game at the minimum level where the CPU doesn't hamper it, typically 60fps or so. You spend time optimizing the game for are the low end, so you can sell more games. After you have done that, even without any special optimization the high-end machines will probably run your game way faster than they need to. It makes no sense to add the deployment and testing headaches of multiple builds when it only helps those machines that didn't need it anyway.

DrMrLordX · Jan 12, 2017

Ken g6 said:
That's more or less the way AVX and AVX2 should have been done. Get info from the chip on the width of its AVX and work from there. (And maybe have an instruction to limit that width, but I'm not sure if that makes sense.)

But that's not how it works.

It does if you use a SIMD-aware JVM or similar. Yay Java!

NTMBK · Jan 12, 2017

Good job that Zen is focusing on FP128 throughput, not FP256.

knutinh · Jan 12, 2017

William Gaatjes said:
Oh, i agree. Image and video processing. Sound processing. Simulation programs. There are enough applications besides games.

Agreed.

But in these situations also, a program must query the cpu what it is capable of. The issue with different hardware configurations is that you cannot assume that a given option is present.

But that is merely a question of technical convenience. Do you explicitly detect hw and code different paths? Do you rely on ICC to do everything for you?

-k

knutinh · Jan 12, 2017

Ken g6 said:
That's more or less the way AVX and AVX2 should have been done. Get info from the chip on the width of its AVX and work from there. (And maybe have an instruction to limit that width, but I'm not sure if that makes sense.)

But that's not how it works.

Is that not how ARMs scalable vector extensions are designed? Write for a 2048-bit hypothetical target, get the execution of whatever the hw is capable of:

https://www.community.arm.com/proce...or-extension-sve-for-the-armv8-a-architecture

Someone with more knowledge than me told me (some years ago) that ARM had a beautiful instruction set, while Intel had a messy one. From a programmer-functional point of view. While from a performance PoV, it was the other way around, Intel had just the parts that allowed you to get the thing done fast.

I think there is a danger that when SW developers dream of HW functionality, we do not (usually) understand the trade-offs done when offering said functions.

-k

knutinh · Jan 12, 2017

Tuna-Fish said:
The binary is rarely more than a few (think 10) megabytes, and the actually varying parts are probably less than a kilobyte. Space usage isn't the issue.

However, binaries with multiple code paths are extremely rare in practice. Some of the very most optimized software, like x264, do this, but in practice almost everyone has decided that the multiple codepaths are not worth the hassle,

If the compiler does this automatically, it is not that much more hassle. You need to get a decent compiler and set it up but that is pretty much a given if you want performance anyway.

A different twist is offered by the FFTW library. Say that you want to run FFTs a million times a second for ten years. Then it makes sense to get the fastest implementation possible for your hardware setup. Problem is, you might not be a optimizer guru, and the developers of that library can't visit every user. So they have made a flexible solution that sort of self-modifies based on actual profiling at the user:
http://www.fftw.org/faq/section4.html#whyfast

specifically as they only give more speed where it's not really needed. If you are writing a game or something, what you optimize for is for as many people as possible being able to experience your game at the minimum level where the CPU doesn't hamper it, typically 60fps or so. You spend time optimizing the game for are the low end, so you can sell more games. After you have done that, even without any special optimization the high-end machines will probably run your game way faster than they need to. It makes no sense to add the deployment and testing headaches of multiple builds when it only helps those machines that didn't need it anyway.

If we take your argument to the extreme, games should be written for 80286 with 8MB ram. Clearly, they are not, so there must be some "sweetspot". Where a sufficient number of users will have a sufficiently good experience, while a minority will have a really good experience.

-k

William Gaatjes · Jan 12, 2017

knutinh said:
Agreed.

But that is merely a question of technical convenience. Do you explicitly detect hw and code different paths? Do you rely on ICC to do everything for you?

-k

Mind you, i am not a real programmer, just for a hobby. I know some stuff but not all tricks and i have no experience with ICC.
But ICC is just a compiler like all compilers.
You cannot rely on ICC if you have physically different cpu models. If an older model for example does not have avx or avx2, and you want to run a program on that cpu you can choose to let the compiler create machine code that does not use avx.
Now you have the problem solved that the older cpu model can run the code. And the code will run on newer cpu models.
But now you have the problem that a newer model with avx, the avx cannot be utilized because the machine code you created just does not have instructions for it.
So you have to compile functions especially with avx and the same functions without avx. And during runtime you select the functions based if the cpu supports avx or not.
I assume it works for all compilers the same. You write a for example c module. Only you compile the c module 2 times for two object files. One while telling the compiler that avx must be used.
And one where you tell the compiler to use generic x86 instructions. Now you have two object files with functions that you can turn into a library. And then write the code that calls the functions in the library.
That is the way of letting the compiler and linker handle it.
Or you go hardcore and use inline assembly code and optimize by hand. But that is the hard way.

C code > assembly code > machine code (real instructions for the cpu).

nismotigerwvu · Jan 12, 2017

Well it isn't all that surprising ARM has a cleaner ISA. Beyond the whole RISC versus CISC thing (I'm not coming anywhere near that can of worms) x86 (and by association x64) has a much older heritage and places a huge premium on essentially complete backwards compatibility. Additionally, the types of devices you find ARM chips in (and the OS on them) don't really penalize breaking comparability with older models the same way x86 does. Look at the fuss kicked up when some recent games launched without support for Phenom II due to lacking support for some SIMD instructions. The Phenom II launched 9 years ago now (and the last new models were introduced ~6 years ago). In about that same time frame we've seen parts running 3 major revisions of the ARM ISA.

Ken g6 · Jan 12, 2017

knutinh said:
Is that not how ARMs scalable vector extensions are designed? Write for a 2048-bit hypothetical target, get the execution of whatever the hw is capable of:

Yeah, that sounds like what I wanted.

DrMrLordX said:
It does if you use a SIMD-aware JVM or similar. Yay Java!

Does Java have an API that allows you to specify arrays, apply functions to those arrays, and perform all the functions in a loop perfectly sized for your SIMD implementation? (Meaning to do all the functions serially, one SIMD-sized chunk at a time.) Probably not. It probably just guesses at SIMD-capable areas, like a C compiler does.

That would be a neat API to develop, though, in any language.

Tuna-Fish · Jan 12, 2017

knutinh said:
If the compiler does this automatically, it is not that much more hassle. You need to get a decent compiler and set it up but that is pretty much a given if you want performance anyway.

The only compiler that does this well in a no-hassle way is ICC. For whatever reason, the games industry uses MSVC on windows instead.

If we take your argument to the extreme, games should be written for 80286 with 8MB ram.
-k

No, that's not what I'm saying at all. My argument is not that you are willing to throw away everything in order to run on the very lowest systems, it's that optimization only makes sense when expands the systems you can target. Simply because if your design document says that it has to run on a Pentium G-series, you tweak it until it does, and at that point any beefier systems don't really need any optimization.

William Gaatjes · Jan 12, 2017

nismotigerwvu said:
Well it isn't all that surprising ARM has a cleaner ISA. Beyond the whole RISC versus CISC thing (I'm not coming anywhere near that can of worms) x86 (and by association x64) has a much older heritage and places a huge premium on essentially complete backwards compatibility. Additionally, the types of devices you find ARM chips in (and the OS on them) don't really penalize breaking comparability with older models the same way x86 does. Look at the fuss kicked up when some recent games launched without support for Phenom II due to lacking support for some SIMD instructions. The Phenom II launched 9 years ago now (and the last new models were introduced ~6 years ago). In about that same time frame we've seen parts running 3 major revisions of the ARM ISA.

My opinion:
I can understand that people would love to see their ancient hardware running forever.
At a certain moment, hardware is just ancient. I can fully understand that at a certain moment a given processor is just too old for modern software.
I mean, some people complain over phenom II while the gpu ages at a rapid rate, leaving games unplayable anyway.
Why should a cpu last forever and may a gpu live for a much shorter time like for example 3 years. That is IMHO just weird.

EDIT:
I looked it up.
Then again, phenom II is barely 5 years old. I would think that different code paths in the game would be relevant in this case.

knutinh · Jan 12, 2017

William Gaatjes said:
You cannot rely on ICC if you have physically different cpu models.

My recollection is that you write your code once, tell ICC what set of targets you want to optimize for, and it will generate a binary that automatically choose the right code for you.

-k

knutinh · Jan 12, 2017

Tuna-Fish said:
The only compiler that does this well in a no-hassle way is ICC. For whatever reason, the games industry uses MSVC on windows instead.

ICC is integrated with Visual Studio, so that your project is still MS, but parts of it will just run faster.

Now, ICC costs money and equipping each project member with that in order to build your code is cumbersome. Setting up and comprehending compilers is unpleasant.

My guess is that games devs focus on function (they should) and on GPU performance. VS has some nice debugging tools, probably integrates nicely with DirectX SDKs, libraries,...

-k

William Gaatjes · Jan 12, 2017

knutinh said:
My recollection is that you write your code once, tell ICC what set of targets you want to optimize for, and it will generate a binary that automatically choose the right code for you.

-k

Oke, but how does the game code choose the correct code for a given cpu in run time ?
I am curious.
Because would that not make it slow if that has to be done every time avx is called that a test must be done what cpu it is ?

knutinh · Jan 12, 2017

William Gaatjes said:
Oke, but how does the game code choose the correct code for a given cpu in run time ?
I am curious.
Because would that not make it slow if that has to be done every time avx is called that a test must be done what cpu it is ?

I would assume that a cpu identification is carried out when the binary is executed, and the result is kept in some state until it is done.

https://computing.llnl.gov/?set=code&page=intel_vector

The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, ..., SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -axSSE4.2 -xSSE2. In this case, when run on an AMD Opteron processor, the baseline SSE2 execution path will be taken. When run on an Intel Westmere processor, the SSE4.2 execution path will be taken. When run on an Intel Sandy Bridge processor, the AVX execution path will be taken.

superstition · Jan 12, 2017

NTMBK said:
You know this thread is a year old, right?

superstition already said:
Anyway, the reason I'm bringing back this topic is because now it has been a year and having AVX2 code seems to make more sense for high-end games, provided it can be useful enough to warrant inclusion.

Although, I also brought up two other issues of interest.

Nothingness · Jan 13, 2017

nismotigerwvu said:
Well it isn't all that surprising ARM has a cleaner ISA. Beyond the whole RISC versus CISC thing (I'm not coming anywhere near that can of worms) x86 (and by association x64) has a much older heritage and places a huge premium on essentially complete backwards compatibility. Additionally, the types of devices you find ARM chips in (and the OS on them) don't really penalize breaking comparability with older models the same way x86 does. Look at the fuss kicked up when some recent games launched without support for Phenom II due to lacking support for some SIMD instructions. The Phenom II launched 9 years ago now (and the last new models were introduced ~6 years ago). In about that same time frame we've seen parts running 3 major revisions of the ARM ISA.

ARM has become old enough that legacy matters. It's less encumbered than x86, but you still have to support the 3 ISA: ARM, Thumb and AArch64. I wonder if Apple will switch to AArch64 only in the future, after all since they have a good control over all of their ecosystem it's likely doable. OTOH for Android, and hence most ARM chip makers, I have little hope...

Carfax83 · Jan 13, 2017

NTMBK said:
Good job that Zen is focusing on FP128 throughput, not FP256.

I'm curious as to why you think this is a good thing?

Nothingness · Jan 13, 2017

Carfax83 said:
I'm curious as to why you think this is a good thing?

There are many more SW using SSE than AVX (thank Intel for segmentation ) and sometimes 256-bit vectors are not directly usable, so you go to 128-bit vectors. So you'd better have 4 128-bit FP ALU (Zen) rather than 2 256-bit FP ALU (Intel).

Carfax83 · Jan 13, 2017

Ah, thanks for the clarification!

William Gaatjes · Jan 13, 2017

knutinh said:
I would assume that a cpu identification is carried out when the binary is executed, and the result is kept in some state until it is done.

https://computing.llnl.gov/?set=code&page=intel_vector

The Intel compiler can generate a single executable with multiple levels of vectorization with the -ax flag, which takes the same options as the -x flag (i.e., AVX, ..., SSE2). This flag will generate run-time checks to determine the level of vectorization support on the processor and will then choose the optimal execution path for that processor. It will also generate a baseline execution path that is taken if the -ax level of vectorization specified is not supported. The baseline can be defined with the -x flag, with -xSSE2 recommended. Multiple -ax flags can be specified to create several options. For example, compile with -axAVX -axSSE4.2 -xSSE2. In this case, when run on an AMD Opteron processor, the baseline SSE2 execution path will be taken. When run on an Intel Westmere processor, the SSE4.2 execution path will be taken. When run on an Intel Sandy Bridge processor, the AVX execution path will be taken.

If i read about that it raises some red flags for me.
Maybe real x86 programmers can give me a clue if i am right or wrong when it comes to performance about avx,sse2 and sse4.2.

Because opteron is like a catch all name for several different versions of AMD cpu.
I have to assume here that avx code runs faster than sse2 on piledriver. But that may not be the case in all avx<> sse2 situations and then it is understandable.
Let say we have a piledriver based opteron. It supports AVX, i do not know how fast and well it runs, but it works.
Because we cannot ask from Intel to do validation of machine code generation of their ICC compiler for a cpu from another company, this choice will most likely present reduced performance for AMD or VIA processors in comparison to Intel cpus.

And then there is also SSE4.2. But again, i do not know how well it runs on piledriver cpus.
But it is available. Even on the AMD jaguar cpu. Might come in handy for games
Me as a noob and outsider would see that SSE4.2 would be beneficial over sse2 when it can be used.

It is perhaps the wisest choice to do it as i described before. Create separate binaries and create game code that does the choice by cpu query and makes the decision to load the appropriate library functions.
It is not as if the whole game is riddled with avx or sse4.2 or sse code. It is just several functions that are called over and over again.
(When you optimize for speed, you take the functions that are used most often first and see if it applies.)
It creates a bit more effort of course but when done once, this tactic can always be used every time over and over again.
Plus when comparing performance, it can just be switch over from one supported mode into another supported mode while debugging and see which path speeds up most. Great for testing and profiling.
And customers will be happy.

https://en.wikipedia.org/wiki/SSE4

DrMrLordX · Jan 13, 2017

Ken g6 said:
Does Java have an API that allows you to specify arrays, apply functions to those arrays, and perform all the functions in a loop perfectly sized for your SIMD implementation?

I don't think it can adjust loop size dynamically. If it can, that's news to me.

superstition · Jan 13, 2017

William Gaatjes said:
I have to assume here that avx code runs faster than sse2 on piledriver. But that may not be the case in all avx<> sse2 situations and then it is understandable.

The original Bulldozer (8150) had bugs with AVX that made it not worth pursuing as far as I know. But, I haven't seen the same thing said about Piledriver (not with any data to back it up). However, it may be that AVX is still worse than other options in Piledriver such as FMA3, FMA4, and XOP.

The key here may be to utilize the best-performing bits in Piledriver instead of just using whatever Intel's CPUs support as if that's all there is. I assume that the reason The Stilt's SIMD build runs so well on Piledriver is because it takes advantage of things other than AVX, like FMA4 and/or XOP. A smart compiler will get the instructions that offer the best performance to run rather than inferior ones.

William Gaatjes said:
Let say we have a piledriver based opteron. It supports AVX, i do not know how fast and well it runs, but it works.

PD also supports FMA3 (which Intel does as well), FMA4 (which only Piledriver/Steamroller/Excavator support as far as I know), and XOP (another AMD-only thing).

All this nonsense is because Intel apparently pulled a fast one on AMD and made a moving target out what was supposed to be SSE5. Instead of getting an industry standard SSE5 we got a mess. Read Agner Fog's blog about it. He knows more about it than I do.

William Gaatjes said:
Because we cannot ask from Intel to do validation of machine code generation of their ICC compiler for a cpu from another company

Wrong. If a company sells a piece of software that produces binaries that run on others' products that piece of software needs to get rid of support altogether or provide it. Think about all the other examples. Apple sold printers at one time. Did it write versions of Mac OS that caused 3rd party printers to run much much slower than they should? Apple also sold monitors. Did they write the Mac OS so that only their monitors would display a good-quality image?

And then we come to general software, like word processors. If a software company produces a piece of software and also sells printers, scanners, and other pieces of hardware should that piece of software not run correctly on anything but that company's hardware? Mass market products have to be compatible. If a company is going to be brazen enough to create a walled garden then it needs to explicitly drop support. In the case of a compiler it would fail to run without GenuineIntel.

Or we can have word processors that only print on one company's printers and nonsense like that. x86 is a standard that Intel licensed to AMD. It is not Intel's little baby. It is a licensed cross-company standard.

A product is either supported or it isn't. If your compiler produces binaries that run on AMD then you have the responsibility to make sure they leverage the best-performing instructions. People are trying to argue having one's cake and eating it, too. Intel gets to sell an expensive "industry standard" compiler that produces "AMD-compatible" binaries but which cripple performance on AMD CPUs just because AMD is a competitor (even though it has a license for x86)? No. If you sell a product to a customer you owe them explicit support for anything that seems supported. You don't pull a fast one on them. At the very least Intel could have asked AMD which instruction set the compiler should tell the CPU to run.

William Gaatjes said:
this choice will most likely present reduced performance for AMD or VIA processors in comparison to Intel cpus.

If a compiler instructs a CPU to run instructions it's up to the CPU vendor to make those instructions run well. AMD wouldn't have had things like FMA4 and XOP if not for Intel's nonsense regarding SSE5. We would have a single SSE5 standard.

It's also up to the compiler vendor to make sure the best-performing instructions are used.

William Gaatjes said:
And then there is also SSE4.2. But again, i do not know how well it runs on piledriver cpus. But it is available. Even on the AMD jaguar cpu. Might come in handy for games

Take a look at how much better the SIMD build for Blender runs the Ryzen demo — on Intel and AMD CPUs. Games are definitely not the only thing that can benefit from improved compilation efficiency.

superstition · Jan 13, 2017

And, really, at the bottom line of all this is the apparent fact that Intel eventually chose to make its compiler efficient for AMD cpus instead of artificially hobbling it with GenuineIntel — after the outcry over it happened and damage was done to AMD.

Intel's compiler now apparently produces the best-performing code for AMD processors. Intel didn't do that out of the goodness of its heart. It is selling that compiler for significant money and companies tend to change bad policy when it comes into the spotlight enough.

It won the war, though, of course. It doesn't look like AMD is going to be trying to innovate in the instruction set area anytime soon after being burned with SSE5.

AVX2 and FMA3 in games

Programming Moderator, Elite Member

Golden Member

Lifer

Lifer

Member

Member

Member

Lifer

Golden Member

Programming Moderator, Elite Member

Golden Member

Lifer

Member

Member

Lifer

Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Lifer

Platinum Member

Platinum Member