Trying to out smart yourself can make things worse. For example using conditional moves to eliminate a branch that is going to hit 99 million times out of 100 and not penalize you anyway makes your code a few bytes bigger and cross a cache page.
Yes, that would be a big fail. The whole idea of the cache is lost then.
The idea of having part of the cache as local scratch pad memory is also very handy. I can remember that long ago with the nintendo 64, all interrupt handlers where hand optimized and stored always in part of the cache for fast access. The advantage of having some manual control over part of the cache would be to conserve power. Because no logic would then be needed to check if that part of the cache(where all interrupt handlers are stored) has useless data or instructions and can be used to fill with useful data or instructions until it gets flushed again. Thus part of the cache is used as local storage with fast access when compared to slow main memory.
I think the cell processor of the PS3 also has some of these scratch pad ram abilities.
The biggest limitation of a compiler is they tend to work serially one statement or block at a time as independant units, they can't grasp the bigger picture of an algorithm or abstract purpose of the code they are compiling and think outside the box and say "oh this is really what the developer is trying to do, I'm going to completely rewrite this to work fast on x CPU" like a human can. A compiler that can work like that and see the whole picture would be on par with computer vision and AI.
This indeed. Imagine that the one who finds the secret ingredient to AI also will be able to write perfect compilers. How much possible scenario's can be created depends on how much possible input variables a system can hold and process in parallel. It is all about the parallel input processing and not about raw serial processing power. This can also be seen in humans when comparing "normal" persons with "idiot savants" geniuses who can calculate faster than a modern cpu.
At Ross Ridge :
Indeed you are right.
One of the strangest tricks of x86 is the register renaming.
Instead of having enough registers to work with, the cpu needs to use internal swapping of registers to circumvent the lack of registers for the processing of the code. With IA-64 this was solved by using a lot of registers(I think 128). But there seems to be a sweet spot with the number of usable registers. 16 seems to be that magic number(economy and efficiency trade off), although it all depends on the instruction set and the compiler.
Here is an idea (which i am sure mr Seymour Cray already figured out 50 years earlier) :
When thinking of the stack of an mcu or cpu that uses a c- compiler, i would think it would be a good idea if there was just some enhanced capability of shadowing registers during calls. That the cpu instead of storing registers on a serial stack, would use a parallel stack. Just shadow copying registers by swapping register banks in and out instead of serially pushing onto a stack. Of course in reality, both stack systems would be used. But a program having control of these features could when programmed correctly have a large speed advantage
while saving a lot of power.
IMHO:
This would give 2 advantages that would conserve power but does cost real estate on the die :
When enough registers are present, then no register renaming would be necessary(with -86x,64bit with 16 registers solved i think).
Swapping of registers would take less time and less energy because of shadow banking register sets directly on die. No transferring to cache
(on die but with more control lines).
Just switching internal address/ bank select enable lines. No register renaming and thus a simpler pipeline.
The power saving comes purely from not having to copy data and that a lot of registers are sitting idle. When the OS and the applications are written with a proper compiler taking advantage of such features, there would be a speed up and power saving.
But it is expensive to do, because a lot of transistors are standing by doing nothing(static power consumption only which is already solved). However, when looking at current generation processors with billions of transistors it is not really an issue. And speed and power binning can be done as well. Because defective registers means less parallel stacking and more serial stacking.