Depends on how important single-thread, general application performance is.
They _could_ settle for "fast enough" single-thread performance (a souped up iPhone core), then add multi-threaded/simd performance in number cruncher models by ways of simple multi-core/simd/GPU/ML units (scaling...
Given that clock speed has stagnated and that most people are perfectly happy with 10year old PCs (or tablets) for updating their FB profile or writing Office documents, what are Intel to do?
The biggest obstacle for SIMD usefulness is programmers bothering to use it - through libraries...
Some applications are compute bound, some bandwidth.
Increasing the compute capability of a «well-rounded» cpu by 2x without increasing bandwidth will give better performance in some applications, while a larger percentage of applications will be bandwidth limited.
I find it more interesting...
Anyone aware of somewhat sensible comparisions of compilers for eg HPC workloads for x86 and ARM?
I am most interested in the speed of open source compilers (gcc) vs the cpu manufacturers compiler.
-k
I think that you underestimate how much SIMD is used for number crunching. Either because the application programmer used assembler/intrinsics/a vectorizingncompiler (intel) or because they rely on some library (blas, fftw,...) that is vectorized.
If we expect our hardware to do function «A» is...
I'd suggest that the Intel compiler is a pretty good reason to go with Intel hardware if the OP wants to write "clean" code and not mess with dirty optimization.
I don't know much about Fortran, but I assume that the situation is similar to C.
Also, no-one mentioned Xeon Phi? They are being...
My experience with icc and gcc suggests that I would use icc if compiling numerically heavy code to run on Intel hardware that I had to pay for.
The alternative with gcc might be to pepper the code with inline assembly in order to get vectorization. Bug-prone, non future-proof and resource...
I believe that SIMD calculation amounts to a minute fraction of the cpu area, and a larger (but still small) fraction of the cpu power budget.
I have seen energy breakdowns on fetching 64 bits from memory, doing a double-precision multi-acc, and storing the result back to memory. Turns out that...
I would be curious to know what the potential would be for Adobe products (Photoshop, Lightroom) if they hired competent programmers/optimizers and targeted AVX512 + high core counts due to the iMac pro being used by "media professionals".
Is "pixel processing" the bottle neck in those products...
So Intel has 2x the peak AVX FMA throughput of AMD. Even with a memory bandwidth of 2x, I would not necessarily expect a 2x speedup of even something like "professional rendering". Perhaps something really streamlined and FMA-centric like matrix multiply or convolution.
For maximum performance...
So how much better performance:watt does a state of the art ARM core offer vs a state of the art x86 core, say at an operating point of 0.5W? My gut-feeling is that they should be quite similar, and that other factors are more relevant. Such as:
1. Does Apple like to have a credible bargaining...
It was not at all clear to me.
I have not seen much in the way of arguments from you, mostly normative claims?
Please elaborate why a compiler manufacturer _must_ offer optimal performance on all platforms it supports, and how this relates to clearly not being the case for most products, be it...
I admit that I am heavily biased towards problems that feature deep nested loops and that can execute really well on SIMD hw.
Being able to write c code using icc, instead of having to resort to inline assy using gcc means being more productive, having less bugs and that your code can be...
You said:
"They don't optimize for specific CPUs, except in the cases of bugs"
From your own link:
"the compiler or library can make multiple versions of a piece of code, each optimized for a certain processor and instruction set,"
-k
Optimal assy is always going to be as fast or faster than intrinsics. The same relationship between intrinsics and code peppered with pragmas etc. The higher up the abstraction ladder, the more opportunities are off limits, and (best case) speed can only get worse.
Now, writing optimal...
"Needs to" in what sense? Legally? Morally? Market-wise?
I disagree. People (even Intel) gets to make compilers. They get to target whatever cpu they like. If they choose to not spend any time optimizing for competing hw manufacturers that is fine.
I believe that AMD have used ICC in PR...
How can you be so confident? The Atom line of processors supports a given instruction set that may be similar to the big guys.
But due to a lack or re-ordering, cache size etc, "optimal" code might be quite different.
I think that Intel does whatever their resources and ingenouity allows them...
I would assume that a cpu identification is carried out when the binary is executed, and the result is kept in some state until it is done.
https://computing.llnl.gov/?set=code&page=intel_vector
ICC is integrated with Visual Studio, so that your project is still MS, but parts of it will just run faster.
Now, ICC costs money and equipping each project member with that in order to build your code is cumbersome. Setting up and comprehending compilers is unpleasant.
My guess is that games...
My recollection is that you write your code once, tell ICC what set of targets you want to optimize for, and it will generate a binary that automatically choose the right code for you.
-k
If the compiler does this automatically, it is not that much more hassle. You need to get a decent compiler and set it up but that is pretty much a given if you want performance anyway.
A different twist is offered by the FFTW library. Say that you want to run FFTs a million times a second for...
Is that not how ARMs scalable vector extensions are designed? Write for a 2048-bit hypothetical target, get the execution of whatever the hw is capable of:
https://www.community.arm.com/processors/b/blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture...
Agreed.
But that is merely a question of technical convenience. Do you explicitly detect hw and code different paths? Do you rely on ICC to do everything for you?
-k
I think that Photoshop/image processing should be an excellent candidate for applications that matter for a reasonable number of users (i.e. quite a lot of people own it, many would like it to be faster). x264. Encryption. Machine learning.
I think it is more interesting to list the...
As ShintaiDK said, it is possible to distribute binaries that follow different code-paths depending on hardware. I think that makes a lot of sense. Now, should there be 2 or 10 code paths, what is the "sweet spot"? Would users accept that their game binary download is 2 GB instead of 512MB only...
Either ICC or assembly (typically open-source projects). Or pushing high-complexity work into 3rd party libraries that may have been compiled this way or another.
-k
Having twice the vector width (AVX vs SSE or AVX512 vs AVX), given that the actual hw resources behind the scene scales, should more than offset the slight reduction in clock that Intel more or less automatically now use for AVX code, given that the problem solved by the code maps well to wide...
Quirks like that are allready difficult on one platform (e.g. Windows on x86 using a single compiler). Even switching to Linux (using the same hardware), default memory philosophy could expose nasty assumptions made by the programmers on the basis of "works for me". An emulator that lacks the...
I would be interested in:
1. Simple low-level tests. How many float32 adds or mults or multiply-accumulate or divs can be carried out per second when data is hot in the cache. Code should be hand-optimized for the architecture.
2. Representative cache/memory efficiency tests. I don't know what...
Having a "unified" instruction set would benefit purchasers of the expensive units as well. More software would be optimized for the fancy instructions, and performance would probably be better on a high end cpu for generic software.
This all assumes that Intel can somehow...
I think that Microsoft/Apple/Google are concentrating on higher-level languages/libraries, and that the bulk of "Apps" (as seen by consumers at large) have transitioned from projects where much of the complexity lies in low-level things (i.e. printer drivers, extended memory management), to...
I have two computers:
1. An office computer used (among other things) for Adobe Lightroom. It is an Intel i7 2600 w/12GB of DDR3 and 120GB SSD.
2. A living room computer/HTPC with a core2 duo and 2GB of ram, spinning drive.
Both running Windows 7 64.
I have thought about updating both...
The more I learn about software optimization, the more sceptical I am about cpu benchmarks.
Usually, you are testing a particular software implementation, a compiler and some piece of hardware jointly. Trying to compare two pieces of hardware this way is hard.
My experience is that software...
I think that your thought is interesting.
If my hardware has an overall efficiency boost of 10%, then I expect it to apply to all of my applications (on average). If any one of my applications have a 10% speedup, then that will be only for this single application. Thus, while it might be worth...
I fully agree that for 99% of the applications, 99% of the time, ASM is not the solution. Lots of pain, lots of bugs, and the overall speedup may not be "enough".
The fact that these see performance improvements tells us nothing about how much ASM would have mattered? If some JavaScript app...
Then you are argueing against straw men. I said that _optimal_ ASM will be at least as fast as compiled code. That fact is evident: ASM is a strict superset of compiled C code, it occupies a larger "space". Whatever a compiler does with C code, a team of monkeys and lots of time could (in...
I see that compilers mentions all kinds of features, still it is possible to beat them in some cases using simple inline asm. Thus the compiler will not always and cannot always out perform a dedicated programmer.
Sure, realistically, one cannot expect "optimal" assembler code ala "guaranteed...
I have experienced that gcc chose to pepper my intrinsics with useless memory transfers back and forth. Simply stripping away the nonsense and reusing the output as inline assembly caused a significant speedup. How is that not a point in favour of ASM?
Well, yeah, but I have a harder time...
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.