@CTho9305: Good post - the x86 architecture is complicated enough that you can find strange things everywhere you look (although op code prefixes and variable length opcodes up to 15 bytes [theoretically at least] still win in my book)
But I think you agree that you can handle most of it while decoding (and making sure that modern compilers use only instructions that are heavily optimized and get around the compability overhead) and I think Intel handles most of it that way as well. And one thing's sure I wouldn't want to be the person having to verify their decoders.
I actually disagree. The decoder can't protect the backend from any of the issues I mentioned, because there are things you can't figure out at decode time.
You could conceptually limit compilers to hardware-friendly operations (emulate the rest, 10X slower), but the risk is that your competitor's compiler will suddenly start emitting operations your chip is deoptimized for and you'll look really bad on the benchmarks they're involved with. People here are welcome to argue that a compiler writer would never be evil enough to do that...but in the real world that concern comes up.
I think the problem in the desktop/laptop space is, that what do you prefer: A 5% more expensive CPU that can run all your programs you already have or a marginally cheaper CPU for which there are only a handful of programs available? I - and I'm pretty sure the majority of people - would take the first choice. On the other hand who'd write a program for a ARM cpu that doesn't target a smartphone?
With the advance of bytecode interpreters that problem should become smaller and smaller, but atm we're not there I think.
The reason why people can port games from consoles to PCs is that there's already a framework that handles most of the complexity - for other applications these don't exist.
I don't think everybody can jump to ARM tomorrow, but I see a trend. I didn't even consider bytecode.
As for "programs I already have", much of what I do is in a web browser; outside the browser, I watch TV (well, when it's Hulu that's in the browser too), play games, and use TuxGuitar and Audacity (both are already cross-platform). Browsers exist on every platform, and as previously discussed many games are developed on PowerPC. I think the growing power of web browsers is actually a relatively large threat to Windows and x86. Once it works in Firefox, OS and instruction set become irrelevant.
Power users here may have more applications that are problematic, but as people like me jump to new platforms (e.g. a $200 impulse-buy tablet/netbook/whatever), some will be developers, and they'll solve the "missing application" problems. Really old legacy applications can be addressed with emulation. Heck, Microsoft already uses that solution even for x86 on x86 because software compatibility is imperfect.
Does anyone have information on how x86 decoding scales for smaller cores like atom and bobcat?
Would it be safe to assume less x86 decoder area is needed for a smaller cpu core?
I don't think I can give a satisfying answer because I don't think there's sufficiently-detailed public information to work from. However, I stand by my statement that there's overhead beyond the decoder (and a lot of that doesn't scale down much).
I keep wondering why atom has such a decreased performance per watt compared to other laptop and desktop processors? IO power budget scaling differences vs. x86 decoder scaling differences vs. other differences?
Given the perf-per-watt of Bobcat versus Atom, it would seem that Intel did something wrong for Atom. It's plausible that they started the design aiming at sub-watt operation, realized they blew the budget, and decided to crank the frequency and push it into netbooks in hopes of making money until they could fix the design. When you take an architecture well beyond its design target (either by pushing a server chip into a laptop, or a laptop chip into a server), you end up with horrible perf/watt.