Does anyone even know what MASM is anymore ?

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
Downloading NASM is a lot of work for a lazy programmer. Since we don't care about Linux performance we'll some them some work and just let them use MASM.

Did it yesterday while playing with FLAC. A couple min tops including the path change and reboot. How lazy can we be?

@SIMD

With regards to vectorizing compilers. They can only go so far. The true gains come not from ASM but from better algorithm design. Those who get to this ASM level usually take the time to properly evaluate their stream and path.

For example: If one is to interleave swizzled data, an entire routine can parallel multiple data into the pipeline with one code pass. In addition to guaranteeing independent data, costly operations such as transcendentals and divides are mitigated by this one to many relationship. This however is really limited to linear fixed path routines (FFTs, matrix operations, interpolation, etc)
 

Mike64

Platinum Member
Apr 22, 2011
2,108
101
91
...just been reading through this thread and much of it made sense until I got to this one. ...what on earth are you talking about ?

regards, Richard
Pay no attention, it's just an odd bug he has up his butt. For someone who thinks the forum is on its last legs, he sure spends a lot of time posting here.
 

selni

Senior member
Oct 24, 2013
249
0
41
Instead of guessing, let us introduce some example figures:

http://shervinemami.info/armAssembly.html


http://hilbert-space.de/?p=22
grayscale conversion for images


See this for x86:
http://www.agner.org/optimize/

Now, the examples above are surely cherry-picked and hardly representative of broad application suite, but it does indicate that certain tasks can be done a lot faster by going low-level. This also makes me sceptical about benchmarks applied to new hardware. Without knowing how the developers of 7zip or x264 or whatever wrote and compiled their binary, it is hard to extrapolate some performance comparision to other applications or some kind of future performance limit for any given platform.

If your computer is going to do the same, limited task a gazillion times it might be worth time and effort to speed up that task for the particular hardware. Assembly might be needed, but there is a list of other options that should be tried before jumping to assembly. Because assembly takes many man-hours, is error-prone, platform-specific, hard to maintain, etc.

Be sure to know the task that needs to be done, the capabilities of your hardware and estimate how close to the hw bounds your current implementation is. If memory throughput is the bottle-neck, fiddling with micro-optimization is probably not the solution.

-k

It's important to understand what you're measuring in these sorts of examples.

Autovectorisation was terrible when that first link was written and is only slightly less terrible now - if that's what was done in the first example (there's not much detail), it's not surprising. Consider intrinsics if you want to use vector instruction sets.

The second appears to be due to a GCC bug that's been fixed for years (check the comments).
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Consider intrinsics if you want to use vector instruction sets.

Back when I was first testing it (this was around four years ago, mind you) GCC wasn't even good with NEON intrinsics. I mean, a lot better than what the auto-vectorization would give you, but it'd insert completely unnecessary extra instructions and work to make the scheduling worse than how you naturally ordered it.

Intrinsics are good since they save the programmer from having to perform register allocation and scheduling. But this can sometimes be a valuable step in refining the algorithm/instruction selection. And sometimes the compiler isn't really that good at it.

Granted, you could change what uarch it's optimized for with a different switch vs rewriting it, but I'm not convinced that it makes that much of a difference with modern compilers. And this isn't something you'll want to do very readily unless you're targeting distribution to a fixed or embedded platform.

Although there is one nice extra in using intrinsics for NEON, which is the header file Intel provides to convert it to SSE intrinsics. But this probably results in some pretty suboptimal code a lot of the time, maybe even worse than auto-vectorization sometimes.
 

lamedude

Golden Member
Jan 14, 2011
1,206
10
81

KCfromNC

Senior member
Mar 17, 2007
208
0
76
This number gets repeated over and over even though it has been proven wrong. In the best case you do not get 10% speedup for a specific section. In the best case reported so far in this thread they got 30x speedup.

I just repeated that same test using a compiler released in this decade. The compiled code is now about 2x slower than pure ASM rather than 30x. Still a win for the ASM code I guess, if you exclude development time, the fact it doesn't actually do the same thing as the C code, automatically getting fast 64-bit code by simply changing a compiler flag, ease of development and maintenance, and so on.
 

Cogman

Lifer
Sep 19, 2000
10,278
126
106
I just repeated that same test using a compiler released in this decade. The compiled code is now about 2x slower than pure ASM rather than 30x. Still a win for the ASM code I guess, if you exclude development time, the fact it doesn't actually do the same thing as the C code, automatically getting fast 64-bit code by simply changing a compiler flag, ease of development and maintenance, and so on.

Frustrating, right? Yet he still spouts off the "30x" improvement like it is a fact.

Compilers are evolving beasts, you can't point at a 10 year old benchmark and say "See, 30x improvement!".

I haven't yet gotten to the point where breaking out ASM would be a good use of time for me. In pretty much all of my performance problems, a smarter algorithm has provided me with the 10x improvements that I'm seeking. Generally, after getting those sorts of gains, doing more work on that particular section of code doesn't result in faster execution. Usually some other piece of code becomes the dominating performance problem, so my time is better spent looking at it rather than wasting more time trying to get a 15x improvement.
 

knutinh

Member
Jan 13, 2006
61
3
66
Frustrating, right? Yet he still spouts off the "30x" improvement like it is a fact.
Well, it is a fact that the guy linked by me did get a 30x improvement.

If you think that I have not provided honest disclaimers, then I would appreciate concrete advice on my disclaimers:
knutinh said:
*the examples above are surely cherry-picked and hardly representative of broad application suite
*In the best case reported so far in this thread they got 30x speedup.
*If you want to argue some kind of "average" case then that number is going to be somewhere between 1.0x (no speedup) or "pick some large number". It all depends. On the hardware, the compiler, the algorithm and the programmer.
*there is a list of other options that should be tried before jumping to assembly. Because assembly takes many man-hours, is error-prone, platform-specific, hard to maintain, etc.

Compilers are evolving beasts, you can't point at a 10 year old benchmark and say "See, 30x improvement!".
Sure you can.

Compilers are evolving, but so is hardware. As long as asssembly is a superset of what can (realistically) be expected from a compiler given C code, (_perfectly written_) assembly will AFAICT always be at least as fast, and usually faster than compiler-generated code. The speedup may be 1.0x or it may be 100x, that is dependent on a number of factors. What we are seeing is that hw speed and compiler intelligence are improving to the point of "good enough" for many applications, while software developer salary is more or less constant. Thus, doing large parts of code in assembly is making _less and less_ sense. This cannot be extended to saying that doing asm _never_ makes sense.

In pretty much all of my performance problems, a smarter algorithm has provided me with the 10x improvements that I'm seeking. Generally, after getting those sorts of gains, doing more work on that particular section of code doesn't result in faster execution. Usually some other piece of code becomes the dominating performance problem, so my time is better spent looking at it rather than wasting more time trying to get a 15x improvement.
That generalization may be correct for you and what you (or even most people) are doing, but surely cannot be extrapolated to everyone else. If your computer is spending 90% of its time in an inner-loop forecasting weather or whatever, going to great pains to optimize that loop _can_ be worth it. Doing ASM should (practically speaking) be tried only after a range of other options have been tried.

-k
 
Last edited:

knutinh

Member
Jan 13, 2006
61
3
66
Back when I was first testing it (this was around four years ago, mind you) GCC wasn't even good with NEON intrinsics. I mean, a lot better than what the auto-vectorization would give you, but it'd insert completely unnecessary extra instructions and work to make the scheduling worse than how you naturally ordered it.
That has been my experience as well. GCC does not seem to "understand" that Neon is some sort of coprocessor and that transfering data back and forth can be very expensive in some cases.

There is also the issue of a very large list of compiler switches. Some control generic optimization, while others hint to the compiler about the hardware capabilities. I find that searching this list is quite unintuitive at times. It seems that I am not the only one, these guys are running a genetic algorithm to search for switches:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4625477&tag=1

An impressive feat that I have seen in others is the ability to write the algorithm in hardware (either actually or using pen-and-paper), then fiddle with C-code and compiler options until the compiler generated code matches (or exceeds) you manual effort. Then you have performance that is inline with your analysis of hw capabilities, you'll have functionally generic code (will port to other platforms, unknown performance though), and the code will hopefully be readable (although possibly not as clean as a straight-forward implementation). It requires a level of sophistication in both C standards and compiler inner-workings that I do not possess myself, though.

-k
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
One thing I find interesting is comparing common attitudes towards software efficiency improvements vs hardware efficiency improvements.

At this point, new generations of desktop/laptop processors are increasing performance on average roughly 10% for a uarch update. perf/W may be increased somewhat more, but not a lot more. While many people do find something like Ivy Bridge good enough there was still enough demand in something like Haswell, at least enough to justify what had to have been hundreds of millions of dollars revising the core uarch (as opposed to uncore, packaging, etc). You would think there'd have to be at least some set of programs where that relatively low performance and perf/W improvement was justified. Should it not also be the case that there's some set of software where such incremental improvements would also be justified? But for most people, the idea of improving any software's performance by 10% every other year or so is viewed as absurd. Even improving things by 40-50% isn't very interesting, while in the hardware world that's enough to make AMD look like a joke in single threaded performance to the eyes of many.

Granted, improving software performance doesn't always mean improving perf/W. There could even be cases where perf/W goes down. But generally perf/W will probably improve too, and if that's more the interesting optimization point you're probably not going to get something with great perf/W by accident either.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
What most of us are saying with regard to the (more than trivial) x speedup is, it just isn't possible anymore. The reason previous generations of software, hardware combinations were able to see such a speedup was because the penalties for missing certain things were so great.

Back in the day a finely crafted fp stack algorithm could use the fxchg instruction to get operations down to 1 instruction per tick, now with OOP and register renaming this more than par for the course.

The margins between small code and data were also enough to elevate certain penalties that could cost 100s of ticks. Now the chances of a few bytes here and there overflowing something is just about nil. This isn't going to go farther than the L2 and certainly not to main memory. If this does happen, it certainly isn't the instruction encoding that pushes this to the brink, it's the algorithm.

Nowadays there really isn't an algorithm that is exclusive to ASM and most if not all things can be crafted in a higher level language without a measurable penalty.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
What most of us are saying with regard to the (more than trivial) x speedup is, it just isn't possible anymore.

Except for the programs where it has been and continues to be, demonstrably. Depending on your definition of trivial, of course.

The compilers themselves also continue to improve in generated code quality, if slowly. You don't see compiler developers today taking the stance that it's no longer possible to do better than what it's currently out there and therefore no longer worth trying. If compilers can still beat other compilers surely there are assembly authors who can too.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
Depending on your definition of trivial, of course.

I prob could have phrased that better.

In this context and the previous arguments, I was inferring less than 2x as trivial. More than trivial 3-100x.
 
Last edited:

KCfromNC

Senior member
Mar 17, 2007
208
0
76
That has been my experience as well. GCC does not seem to "understand" that Neon is some sort of coprocessor and that transfering data back and forth can be very expensive in some cases.

The penalties vary depending on the chip you're running on. The compiler might not handle it well, but neither does the ASM - there's only a single code path rather than one ASM function tuned for each microarchitecture. Not sure how this is a point in favor of the ASM code.

There is also the issue of a very large list of compiler switches.

-Ofast is pretty simple to remember.
 
Last edited:

KCfromNC

Senior member
Mar 17, 2007
208
0
76
As long as asssembly is a superset of what can (realistically) be expected from a compiler given C code, (_perfectly written_) assembly will AFAICT always be at least as fast, and usually faster than compiler-generated code.

Unless your assembler can automatically do inlining, loop unrolling, function specialization, whole program optimizations and all of the other tricks modern compilers do this is not going to generally be the case. Sure, I guess you can do this by hand but are you going to manually re-do e.g. register allocation each time you inline it a function to fit the structure of the calling code? If not, then the compiler can and will out perform you.

As code gets to any reasonable size, this problem quickly becomes intractable for a human to handle without falling back on patterns which will reduce performance of the ASM code.
 

knutinh

Member
Jan 13, 2006
61
3
66
The penalties vary depending on the chip you're running on. The compiler might not handle it well, but neither does the ASM - there's only a single code path rather than one ASM function tuned for each microarchitecture. Not sure how this is a point in favor of the ASM code.
I have experienced that gcc chose to pepper my intrinsics with useless memory transfers back and forth. Simply stripping away the nonsense and reusing the output as inline assembly caused a significant speedup. How is that not a point in favour of ASM?
-Ofast is pretty simple to remember.
Well, yeah, but I have a harder time remembering this:
-mfloat-abi=name
Specifies which floating-point ABI to use. Permissible values are: ‘soft’, ‘softfp’ and ‘hard’.
Specifying ‘soft’ causes GCC to generate output containing library calls for floating-point operations. ‘softfp’ allows the generation of code using hardware floating-point instructions, but still uses the soft-float calling conventions. ‘hard’ allows generation of floating-point instructions and uses FPU-specific calling conventions.

The default depends on the specific target configuration. Note that the hard-float and soft-float ABIs are not link-compatible; you must compile your entire program with the same ABI, and link with a compatible set of libraries.

-march=name
This specifies the name of the target ARM architecture. GCC uses this name to determine what kind of instructions it can emit when generating assembly code. This option can be used in conjunction with or instead of the -mcpu= option. Permissible names are: ‘armv2’, ‘armv2a’, ‘armv3’, ‘armv3m’, ‘armv4’, ‘armv4t’, ‘armv5’, ‘armv5t’, ‘armv5e’, ‘armv5te’, ‘armv6’, ‘armv6j’, ‘armv6t2’, ‘armv6z’, ‘armv6zk’, ‘armv6-m’, ‘armv7’, ‘armv7-a’, ‘armv7-r’, ‘armv7-m’, ‘armv7e-m’, ‘armv7ve’, ‘armv8-a’, ‘armv8-a+crc’, ‘iwmmxt’, ‘iwmmxt2’, ‘ep9312’.
-march=armv7ve is the armv7-a architecture with virtualization extensions.

-march=armv8-a+crc enables code generation for the ARMv8-A architecture together with the optional CRC32 extensions.

-march=native causes the compiler to auto-detect the architecture of the build computer. At present, this feature is only supported on GNU/Linux, and not all architectures are recognized. If the auto-detect is unsuccessful the option has no effect.

-mtune=name
This option specifies the name of the target ARM processor for which GCC should tune the performance of the code. For some ARM implementations better performance can be obtained by using this option. Permissible names are: ‘arm2’, ‘arm250’, ‘arm3’, ‘arm6’, ‘arm60’, ‘arm600’, ‘arm610’, ‘arm620’, ‘arm7’, ‘arm7m’, ‘arm7d’, ‘arm7dm’, ‘arm7di’, ‘arm7dmi’, ‘arm70’, ‘arm700’, ‘arm700i’, ‘arm710’, ‘arm710c’, ‘arm7100’, ‘arm720’, ‘arm7500’, ‘arm7500fe’, ‘arm7tdmi’, ‘arm7tdmi-s’, ‘arm710t’, ‘arm720t’, ‘arm740t’, ‘strongarm’, ‘strongarm110’, ‘strongarm1100’, ‘strongarm1110’, ‘arm8’, ‘arm810’, ‘arm9’, ‘arm9e’, ‘arm920’, ‘arm920t’, ‘arm922t’, ‘arm946e-s’, ‘arm966e-s’, ‘arm968e-s’, ‘arm926ej-s’, ‘arm940t’, ‘arm9tdmi’, ‘arm10tdmi’, ‘arm1020t’, ‘arm1026ej-s’, ‘arm10e’, ‘arm1020e’, ‘arm1022e’, ‘arm1136j-s’, ‘arm1136jf-s’, ‘mpcore’, ‘mpcorenovfp’, ‘arm1156t2-s’, ‘arm1156t2f-s’, ‘arm1176jz-s’, ‘arm1176jzf-s’, ‘generic-armv7-a’, ‘cortex-a5’, ‘cortex-a7’, ‘cortex-a8’, ‘cortex-a9’, ‘cortex-a12’, ‘cortex-a15’, ‘cortex-a17’, ‘cortex-a53’, ‘cortex-a57’, ‘cortex-a72’, ‘cortex-r4’, ‘cortex-r4f’, ‘cortex-r5’, ‘cortex-r7’, ‘cortex-m7’, ‘cortex-m4’, ‘cortex-m3’, ‘cortex-m1’, ‘cortex-m0’, ‘cortex-m0plus’, ‘cortex-m1.small-multiply’, ‘cortex-m0.small-multiply’, ‘cortex-m0plus.small-multiply’, ‘exynos-m1’, ‘marvell-pj4’, ‘xscale’, ‘iwmmxt’, ‘iwmmxt2’, ‘ep9312’, ‘fa526’, ‘fa626’, ‘fa606te’, ‘fa626te’, ‘fmp626’, ‘fa726te’, ‘xgene1’.
Additionally, this option can specify that GCC should tune the performance of the code for a big.LITTLE system. Permissible names are: ‘cortex-a15.cortex-a7’, ‘cortex-a17.cortex-a7’, ‘cortex-a57.cortex-a53’, ‘cortex-a72.cortex-a53’.

-mtune=generic-arch specifies that GCC should tune the performance for a blend of processors within architecture arch. The aim is to generate code that run well on the current most popular processors, balancing between optimizations that benefit some CPUs in the range, and avoiding performance pitfalls of other CPUs. The effects of this option may change in future GCC versions as CPU models come and go.

-mtune=native causes the compiler to auto-detect the CPU of the build computer. At present, this feature is only supported on GNU/Linux, and not all architectures are recognized. If the auto-detect is unsuccessful the option has no effect.

-mcpu=name
This specifies the name of the target ARM processor. GCC uses this name to derive the name of the target ARM architecture (as if specified by -march) and the ARM processor type for which to tune for performance (as if specified by -mtune). Where this option is used in conjunction with -march or -mtune, those options take precedence over the appropriate part of this option.
Permissible names for this option are the same as those for -mtune.

-mcpu=generic-arch is also permissible, and is equivalent to -march=arch -mtune=generic-arch. See -mtune for more information.

-mcpu=native causes the compiler to auto-detect the CPU of the build computer. At present, this feature is only supported on GNU/Linux, and not all architectures are recognized. If the auto-detect is unsuccessful the option has no effect.

-mfpu=name
This specifies what floating-point hardware (or hardware emulation) is available on the target. Permissible names are: ‘vfp’, ‘vfpv3’, ‘vfpv3-fp16’, ‘vfpv3-d16’, ‘vfpv3-d16-fp16’, ‘vfpv3xd’, ‘vfpv3xd-fp16’, ‘neon’, ‘neon-fp16’, ‘vfpv4’, ‘vfpv4-d16’, ‘fpv4-sp-d16’, ‘neon-vfpv4’, ‘fpv5-d16’, ‘fpv5-sp-d16’, ‘fp-armv8’, ‘neon-fp-armv8’, and ‘crypto-neon-fp-armv8’.
If -msoft-float is specified this specifies the format of floating-point values.

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

-mfp16-format=name
Specify the format of the __fp16 half-precision floating-point type. Permissible names are ‘none’, ‘ieee’, and ‘alternative’; the default is ‘none’, in which case the __fp16 type is not defined. See Half-Precision, for more information.
 

knutinh

Member
Jan 13, 2006
61
3
66
Unless your assembler can automatically do inlining, loop unrolling, function specialization, whole program optimizations and all of the other tricks modern compilers do this is not going to generally be the case. Sure, I guess you can do this by hand but are you going to manually re-do e.g. register allocation each time you inline it a function to fit the structure of the calling code? If not, then the compiler can and will out perform you.
I see that compilers mentions all kinds of features, still it is possible to beat them in some cases using simple inline asm. Thus the compiler will not always and cannot always out perform a dedicated programmer.
As code gets to any reasonable size, this problem quickly becomes intractable for a human to handle without falling back on patterns which will reduce performance of the ASM code.
Sure, realistically, one cannot expect "optimal" assembler code ala "guaranteed to be the fastest execution possible for the given hardware". If the inner-loop is small enough and represents a high enough percentage of the total application execution time, it might still be worth studying it real hard, reading the compiler output, messing with compiler switches, pragmas and intrinsics and (as a last effort), attempting to write your own.

As long as you have a good/relevant timing testbench, you can have some confidence that if your asm measures 2x faster than the compiler, then it really is 2x faster than the compiler.

People can and will outperform the compiler for certain tasks, no matter what you call the strategies they use. This is my main point: as long as asm is a superset of what compilers can generate (given infinite time and wisdom, anything is possible in asm, while compilers are limited by the language and limited knowledge about the job that needs to be done). Thus, the best-case performance for asm is >= best-case performance for compiled code.

You are right that best-case performance may not be relevant for a given problem, as it may mean using more resources on a problem than makes economic sense. For many real-world problems, the more relevant metric is the expected performance for finite time, finite wisdom implementations. In that case, programmers have uniformly voted that higher level languages are to be preferred for the vast majority of cases (but not all).


Look, let's agree that asm is _nearly always_ not the right solution. I am simply reacting to those who say that it is _exactly always_ not the right solution. That is patently wrong.

-k
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I prob could have phrased that better.

In this context and the previous arguments, I was inferring less than 2x as trivial. More than trivial 3-100x.

Right, okay. So a ~2x performance improvement is trivial if it comes from software changes, but absolutely earth shattering for a hardware update.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
Right, okay. So a ~2x performance improvement is trivial if it comes from software changes, but absolutely earth shattering for a hardware update.

Yup. Please take what I said in context. Do not pull one figure and try and (ironically) trivialize it.

I was explaining why the really high improvements were available in the past and would not really be available today. If you want to argue this please comment on the points I made in the original post. Do not just focus on the numbers.
 

zir_blazer

Golden Member
Jun 6, 2013
1,184
459
136
Actually, I would believe that today there is a lot of untapped performance. As I stated before, you have a whole bunch of APIs and HALs to deal with, none of that existed back in the DOS era. And even if compilers are smarter, you also have a whole bunch of new Instruction Set Extensions (SSE1-4, AVX1-2, etc). Complexity also increased drastically, compilers can be just as good as the human that did them. You are expecting too much out of the compiler, and too little of the human brain.

And yes, I agree with the previous comment. If someone says that the next Processor generation has 10-15% more performance, it is considered a great achievement, but if someone says that he can achieve those results using ASM, people thinks that its a waste of time - even if you could get that speedup in your current system. Considering how little performance we get from each generation and how complex the Hardware itself has become, I would actually think that by going back to the metal, we could get vastly increased performance without the need to wait for faster Hardware. As Single Threaded performance seems to be hitting a diminishing returns hard cap, its programmers turn to optimize the Software. Those things that can't scale throwing more Cores at the thing and needs the highest possible per-core performance, should eventually be hand tuned in ASM.
 

Cogman

Lifer
Sep 19, 2000
10,278
126
106
Actually, I would believe that today there is a lot of untapped performance. As I stated before, you have a whole bunch of APIs and HALs to deal with, none of that existed back in the DOS era. And even if compilers are smarter, you also have a whole bunch of new Instruction Set Extensions (SSE1-4, AVX1-2, etc). Complexity also increased drastically, compilers can be just as good as the human that did them. You are expecting too much out of the compiler, and too little of the human brain.

And yes, I agree with the previous comment. If someone says that the next Processor generation has 10-15% more performance, it is considered a great achievement, but if someone says that he can achieve those results using ASM, people thinks that its a waste of time - even if you could get that speedup in your current system. Considering how little performance we get from each generation and how complex the Hardware itself has become, I would actually think that by going back to the metal, we could get vastly increased performance without the need to wait for faster Hardware. As Single Threaded performance seems to be hitting a diminishing returns hard cap, its programmers turn to optimize the Software. Those things that can't scale throwing more Cores at the thing and needs the highest possible per-core performance, should eventually be hand tuned in ASM.

Meh, Network/Disk/and memory access have a much larger negative impact on performance than various APIs and abstractions do. If you are looking for higher performance your first and best bet will be to optimize one or all of those things before even thinking about ASM. Most applications have little to no optimizations that take those things into consideration. Heck, many programmers don't even know how to run a profiler. It often surprises me how little horse sense many programmers have when it comes to performance.

ASM isn't important when it comes to performance. It isn't even the 10th step when you are looking at optimizing something. Don't believe me? Then look at how little ASM is in performance critical things such as V8, Spidermonkey, the linux kernel, most VMs. Yet these things are constantly seeing performance improvements and gains.

These are things that have performance at the top of their priority lists, yet they don't write things in ASM. Why? Because the gains are minimal/nonexistent for most application logic. On top of that, it excludes the application from doing compiler optimizations on the ASM block.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
On top of that, it excludes the application from doing compiler optimizations on the ASM block.

Even more inserting an ASM block often requires a strict calling convention that forces the compiler to push variables onto the stack while routines in the compilers native language can retain register precedence over the function's call scope.
 

KCfromNC

Senior member
Mar 17, 2007
208
0
76
I have experienced that gcc chose to pepper my intrinsics with useless memory transfers back and forth. Simply stripping away the nonsense and reusing the output as inline assembly caused a significant speedup. How is that not a point in favour of ASM?

I'm pretty sure that bug has been fixed for years. And I certainly wasn't seeing the problem in vectorized C code, at least in the example in this thread.

Well, yeah, but I have a harder time remembering this:

I didn't need to worry about any of these to get the C code to within 2x of the ASM code.

And are you seriously using half-precision FP types in your code? Or switching floating point ABIs in different binaries?

Are you tuning your ASM code for each of the microarchitectures listed in the docs? If not, then the fact you can by changing a single command line option is a point in favor of high level code. If you are, I guarantee it is a lot more work than changing cortex-a53 to cortex-a57 on the command line.

I see that compilers mentions all kinds of features, still it is possible to beat them in some cases using simple inline asm. Thus the compiler will not always and cannot always out perform a dedicated programmer.

Never said it could. I was just disputing the claim that ASM will always be at least as fast as higher level languages. Many of the cool tricks compilers play are simply too difficult to do by hand at any significant scale.
 
Last edited:

Cogman

Lifer
Sep 19, 2000
10,278
126
106
Even more inserting an ASM block often requires a strict calling convention that forces the compiler to push variables onto the stack while routines in the compilers native language can retain register precedence over the function's call scope.

Yup. I was thinking more along the lines of loop unrolling, function inlining, and many const deductions.

Beyond that, the compiler can do a whole bunch of neat tricks with memory and register allocation that are really hard to do as a programmer (easy to get wrong). Using profile guided optimizations, the compiler can even do crazy things like ordering conditionals to optimally order branching logic to help out the CPU branch predictor.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |