Does anyone even know what MASM is anymore ?

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Cogman

Lifer
Sep 19, 2000
10,278
126
106
Instead of guessing, let us introduce some example figures:

http://shervinemami.info/armAssembly.html


Now, the example above is surely cherry-picked and hardly representative of broad application suite, but it does indicate that certain tasks can be done a lot faster by going low-level.

-k

SIMD is certainly a place where compilers tend to fall on their faces. However I would point out that the post is several years old now. Code vectorization has only gotten better in the last few years.
 

knutinh

Member
Jan 13, 2006
61
3
66
SIMD is certainly a place where compilers tend to fall on their faces. However I would point out that the post is several years old now. Code vectorization has only gotten better in the last few years.

I see that the very latest version of GCC (5.1) adds "basic autovectorization" for Skylake server added by people from Intel, Corp:
https://gcc.gnu.org/gcc-5/changes.html

The general auto-vectorizer in gcc seems to have seen little attention since 2011:
https://gcc.gnu.org/projects/tree-ssa/vectorization.html

It seems to me that open-source projects using gcc with lots of performance sensitive multimedia tends to use some ASM. I take this as an indicator that those developers do this for a (good) reason:
http://git.videolan.org/?p=x264.git...dd97ba9d4548a9e5ec0763076233da9c561cb;hb=HEAD

-k
 

Cogman

Lifer
Sep 19, 2000
10,278
126
106
I see that the very latest version of GCC (5.1) adds "basic autovectorization" for Skylake server added by people from Intel, Corp:
https://gcc.gnu.org/gcc-5/changes.html

The general auto-vectorizer in gcc seems to have seen little attention since 2011:
https://gcc.gnu.org/projects/tree-ssa/vectorization.html

You are pointing at the branch that initially introduced auto vectorization. It hasn't seen much action because it is part of the GCC compiler suite.
4.3, 4.5, and 4.7 look to all have at least a few significant changes to the autovectorizer. All of these happened after the original authors code. Beyond that, it looks like auto vectorization wasn't enabled by default until 4.3 (It wouldn't have been turned on by his -O3 flag).

It seems to me that open-source projects using gcc with lots of performance sensitive multimedia tends to use some ASM. I take this as an indicator that those developers do this for a (good) reason:
http://git.videolan.org/?p=x264.git...dd97ba9d4548a9e5ec0763076233da9c561cb;hb=HEAD

-k

I'm not trying to say that GCC has a perfect autovectorizer. In fact, I said the opposite, it is a sore spot. But what I am saying is that 30x improvement probably isn't accurate especially since the c code was compiled with a, now, pretty old compiler without auto vectorization enabled.

x264 is a good example of where breaking out the ASM is necessary. Those guys optimized the hell out of x264. They worked to get every last ounce of performance out of it. I wish it was still up, but one of the main authors wrote in depth about doing SSE2 optimizations that beat the pants off of what the GCC could do using some really clever register handling.

For x264, where the name of the game is video encoding, it does make sense to break out the asm. That, however, is a corner case application.

A vast majority of applications aren't spending time in code that would benefit in vectorization improvements. You need to be very math heavy before everything starts getting bottlenecked by vectorization performance.
 

WhoBeDaPlaya

Diamond Member
Sep 15, 2000
7,414
401
126
Performance is cheap. I used to program assembly, but now mainly do SCALA, ANTLR, Tcl and Perl for my day job.
 

knutinh

Member
Jan 13, 2006
61
3
66
For x264, where the name of the game is video encoding, it does make sense to break out the asm. That, however, is a corner case application.

A vast majority of applications aren't spending time in code that would benefit in vectorization improvements. You need to be very math heavy before everything starts getting bottlenecked by vectorization performance.
I fully agree that assembly is not the first thing one should consider in a general application. Not the second either. One man's corner case is anothers ... though. Video is a pretty common performance thing for many consumers, I should think.

Note that assembly (and intrinsics) is not only about vectorization. C-language lacks some expressions that map well to hardware capabilities (like popcount, saturated arithmetic). Some compilers just "randomly" dump registers to/from memory and co-processors (gcc).

Compilers have to be very conservative about correctness (as limited by the language constructs) at the cost of performance. Some compilers allow #pragmas or switches (e.g. -ffast_math) where the programmer explicitly sacrifice generability (e.g. loop count will always be 2,4,6,...) or precision (e.g. full ieee floating-point is not needed. Give me what you have).

My experience is that gcc lags behind intel and ARM when it comes to performance.

-k
 
Last edited:

beginner99

Diamond Member
Jun 2, 2009
5,223
1,598
136
I'm going to speculate but games usually use one of the per-existing game engines like Unreal, Frostbite or Unity. I find it plausible that at least in these engines the most critical path found to be bottlenecks were in fact hand coded in assembly. But no assembler would not surprise me as well. Why?

But yeah I once read an article that showed that compilers are often superior to even ASM programmers. And in the best cases you get if you are lucky a 10-20% increase in speed for a specific section. However in the total execution time of the program it will maybe be 0.01% an not at all worth the hassle. Really small, critical sections that use a huge amount of the total execution time are IMHO very rare.

And last is to tge guy that mentioned Java being slow. In fact it isn't. That was true 20 years ago but now it's often just as fast (or even faster due to JIT) than C++. Main problem of Java AFAIK is there is almost no auto-vectorization (there is in some situations but depends on the JVM) or you have to already optimized your code so the JIT can better identify possible use of SIMD. But then your probably faster just using JNI...
http://developers.opengamma.com/articles/DGEMV.pdf
 
May 11, 2008
20,068
1,292
126
Well, it is much easier to program in a higher programming language like for example C. As an amateur, i like assembly, but the benefits and reduction of development time when using a higher programming language are clear. Also, compilers are really good at creating optimized code. Only in rare cases, hand written and hand optimized assembly will be better. Usually the best of both worlds can be used by making use of C and then hand optimize by writing in assembly some specific functions or algorithms that are called often. Also, the programming style can severly handicap performance in a function written in c. The compiler may not be able to optimize properly. Or a really bad algorithm is used in the c function.
Badly written c code can really hurt performance because the compiler is not able to optimize the code properly.
 
May 11, 2008
20,068
1,292
126
I should note that it can be very handy to be able to read assembly even if you only program in c or other language. If the c compiler has difficulty translating a high level function into assembly or truncating values that should not be truncated even when a cast is applied properly. In the embedded world, it has happened that a compiler will not produce the expected code. These kinds of bugs are very rare but they do happen.
 

knutinh

Member
Jan 13, 2006
61
3
66
But yeah I once read an article that showed that compilers are often superior to even ASM programmers. And in the best cases you get if you are lucky a 10-20% increase in speed for a specific section.
This number gets repeated over and over even though it has been proven wrong. In the best case you do not get 10% speedup for a specific section. In the best case reported so far in this thread they got 30x speedup.

If you want to argue some kind of "average" case then that number is going to be somewhere between 1.0x (no speedup) or "pick some large number". It all depends. On the hardware, the compiler, the algorithm and the programmer.

If all of the worlds software were implemented as "perfect" assembly, the average speedup would probably be moderate, while the development time would sky-rocket. There is the 80/20 rule of thumb, saying that 80% of the execution time is spent in 20% of the code lines. If you decide to do something, do it to those lines (and again: assembly should be the last resort).
However in the total execution time of the program it will maybe be 0.01% an not at all worth the hassle. Really small, critical sections that use a huge amount of the total execution time are IMHO very rare.
If you are transcoding a video, the bulk of cpu cycles may be spent in a small section of code. If you are photoshopping, the same may be the case. In a game engine something similar might be true.

Now, office applications are a very different beast.
And last is to tge guy that mentioned Java being slow. In fact it isn't.
I love it when people throw out "in fact..". AFAIK, Java can be both fast and slow. Depending. Choose your tools wisely and all will be good.
That was true 20 years ago but now it's often just as fast (or even faster due to JIT) than C++. Main problem of Java AFAIK is there is almost no auto-vectorization (there is in some situations but depends on the JVM) or you have to already optimized your code so the JIT can better identify possible use of SIMD. But then your probably faster just using JNI...
http://developers.opengamma.com/articles/DGEMV.pdf
My understanding is that Intel are writing the very best compilers for Intel hardware. My understanding is (again) that they write for FORTRAN first. Then C. Then C++. Then the rest.

-k
 
Last edited:

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
Pacman was coded in 24 kilobytes.

If you want to see some really tiny programs check out the Hugi size coding competitions:

http://www.hugi.scene.org/compo/compoold.htm

I like these vs traditional demo coding competitions because they make it all about size and not making something artistically interesting or impressive. A reference program is provided for the competition and the goal is to do one that's functionally equivalent (to within whatever parameters are specified, usually everything but maybe run time) but as small as possible.
 
Last edited:

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
Only people that don't know how to write software think that "write this in assembly" is the solution
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
Only people that don't know how to write software think that "write this in assembly" is the solution

Yeah, and there are at least tens of thousands of firmware developers out there who thank you for this absurdly non applicative comment.

Sure the vast majority of software engineers have no need for assembly code, but that statement is just silly.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I gave some examples earlier about how assembly has helped improve overall performance in performance sensitive applications. Here are two examples of embedded programming where writing in assembly was necessary for the program to work reliably at all:

1) Device has an EEPROM with a strict requirement that writes are no more than a certain number of microseconds apart. If this is violated the programming will fail. On the other hand, interrupts coming in at a rate of 10,000 per second plus another at a rate of once per second and UART interrupts that can come in randomly must be serviced. And because of real time requirements a failure to service any of them almost immediately would be catastrophic to the system. Because of this, the worst case for the interrupts had to not exceed the EEPROM write limits. I doubt I would have been able to develop within this constraint without writing in assembly (the processor has a cycle trace buffer that made it a lot easier to verify). Fun fact: we observed the EEPROM failures later when code was running from a PROM, causing the cache miss times to be much higher; this had to be altered by preloading the PROM code into SRAM for execution.

2) A SPI-like protocol was being bit-banged because the processor used lacked a SPI or similar controller. The first clock pulse ended up being way too long because of instruction cache misses (issues like this is a good reason why microcontrollers for realtime applications tend to execute code from uncached embedded memories, but our options for this application were very limited). Unfortunately, the processor had no instructions to preload the instruction cache. What it did have was a diagnostic interface to manually set cache lines and cache tags, which I was able to use to manually construct a preload. This involved writing things in assembly - both to know what instructions to write to the cache and for the diagnostic instructions themselves.
 

Hulk

Diamond Member
Oct 9, 1999
4,377
2,256
136
I have only had experience programming in assembly back in high school on my Atari 800 computer. There were things that could only be done in assembly... like changing color registers between scan lines to "create" more colors that weren't actually addressable outside of the assembly environment. At least to my knowledge. Of course in this example the problem to be worked around in assembly was the fact that in the 2nd highest resolution mode, the one all games were written (or nearly all) you only had two bits/pixel, or 4 colors. Using this technique you could fake more colors but only in horizontal bands across the screen so if the game design fit this template it could have a lot of colors. Also each of the 4 player-missiles could have a color as well.

Anyway my point is this. Is it possible that many of these types of hardware limitations don't exist today, coupled with the improvement of compilers that has made assembly kind of obsolete?

I remember marveling at the first IBM computers that could put up 16 colors simultaneously at the then insanely high resolution of 640x480. I had been struggling with 240x192 at 2bits/pixel and there it was 16 colors at VGA resolution with no tricks!
 

knutinh

Member
Jan 13, 2006
61
3
66
Only people that don't know how to write software think that "write this in assembly" is the solution
I don't understand why there are so many stupid comments in this thread. Why don't you approach the subject with a little curiosity, I am sure that someone with a better understanding than yourself could teach you a thing or two?

-k
 

knutinh

Member
Jan 13, 2006
61
3
66
Is it possible that many of these types of hardware limitations don't exist today, coupled with the improvement of compilers that has made assembly kind of obsolete?
I don't know of any current examples that fit with your Atari example.

But things like:
*Count leading zeros (return the count of '0' before the first '1' in an int) and
*Popcount (count the number of '1's in an integer)
*Bit-reversal (popular in FFT calculations)

can be done very efficiently in hardware, while there may be no straight way to express this is normal programming languages. The solution is to throw in intrinsics that maps directly to the instruction in question.

Gcc offer "builtins" that tries to abstract this capability across platforms, but then you'd introduce compiler dependencies instead, and there is no guarantee that gcc won't (stupidly) fall-back to a generic implementation (possibly slower than your own implementation) on some platform even though the platform supports the given function in hardware.
http://hardwarebug.org/2010/01/14/beware-the-builtins/

ARM has scalar/vectorized instructions that does intricate combinations of add/subtract, multiply shift, saturate on integers in a single instruction. If you want those operations to fly (i.e. they are a bottle-neck), then you'd either have to write in some assembly/intrinsics, or (effectively) reverse-engineer the compiler behaviour in order to write the statements in C that "happens" to produce the corresponding instructions. Of course, when your code is ported to a new platform, you'd have to do the work all over.

In "system code", you might come across inline assembly that sets hardware/OS properties that are system specific and typically not available from plain C. If you want to start the cycle-timer on ARM, you need to do stuff like
Code:
    // program the performance-counter control-register:
    asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(value));
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
http://stackoverflow.com/questions/...ram-execution-time-in-arm-cortex-a8-processor
-k
 
Last edited:

Dufus

Senior member
Sep 20, 2010
675
119
101
Why MASM and not ASM in general?

IMO basically comes down to right tools for the right job. ASM has it's place as do other programming languages.
 

lamedude

Golden Member
Jan 14, 2011
1,206
10
81
Downloading NASM is a lot of work for a lazy programmer. Since we don't care about Linux performance we'll some them some work and just let them use MASM.
 

Dufus

Senior member
Sep 20, 2010
675
119
101
There are other assemblers, YASM and FASM for example. FASM is a small download.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |