Let's try this another way, what does NVIDIA's PR material say that their optimizer actually does?
Yup, sounds like pretty much the exact same things that a compiler for Itanium's IA-64 does. Oh, and the most important thing given what you're portraying it as:
ALL of these things are influenced by dynamic context! You can't just say well both X and Y compilers do this so they're going to have the same result. Profiling inputs matter.
Besides that, stuff like "Sinks uncommonly executed computation" doesn't even really make sense in static context. Another thing that you're totally missing is that directly executing or lightly optimizing less hot code and applying much more space expensive optimizations on hot code balances out icache use a lot better than just optimizing everything to the same extent.
There's a couple really big things here that nVidia hasn't said a lot about, but can make a big difference in such trace-based dynamic VLIW approaches:
1) Dealing with memory disambiguation/alias prediction in some low-overhead fashion, where heavily faulting blocks are translated back into less aggressively reordered ones
2) Tying the trace execution mechanism in with branch prediction so traces that cross branches are fetched ahead of time
There's some good potential for synergy between hardware features and code that's translated in hot traces, things that are not just "let's do the compilation later"
Not "adapt to the nuances of how someone uses a program as they're using it", just optimize once and then use it. Which is typically how I'd define compilation.
If you'd read the documents in-depth beyond clipping out a buzzphrase you'd see that it repeatedly runs the compiler long after new ARM instructions are introduced and repeatedly performs more vital optimizations on the hottest code. And if you go back and look at Transmeta docs (which Denver is very, very clearly evolving, some key people also worked on both) you'll see it has iterative recompilation as well.
When nVidia says "optimize once, use many times" the point they're trying to get across is that for each one optimization pass the code will be executed many times. Not that it'll only translate any given code once.
And even if it DID only translate code once it'd STILL be better than what you're saying because it'd be warmed with profiling data specific to the user's run.
Okay, so patents hint that it might use run-ahead execution on memory stalls which makes it nothing like Itanium? Sure, we can just ignore all the ways in which they look the same. (Of course, I would be interested to read those hinting patents seeing as how memory stalls are a bane of in order architectures.)
Have you ever used TI's C6x? Does it look the same as Itanium to you too because it's VLIW and in-order? It's nothing even close to the same.
As for your analysis of the 'hard parts' being done on the ARM binary I'd tend to disagree, at least in terms of the context of keeping all of Denver's execution resources busy. Which I'd expect is being done because otherwise there wouldn't be much cause for having that many execution resources. Of course most modern cores do similarly with respect to instruction re-ordering, but that's at a finer grained level than what NVIDIA appears to be describing here.
Look at what any compiler does, most of the work isn't spent in scheduling. It's simply misleading to say that Itanium will have the compilation done upfront while Denver will have the compilation done at run-time.
And I don't see how you can make the leap from ideal, VLIW pre-compiled code being at worst equal to a run-time generated version to OoO execution having no value. Sure ideal, pre-compiled code would take care of the instruction re-ordering benefit of OoO execution that makes it easier to run multiple instructions in parallel, but it does nothing to protect against cache misses. (Which is the reason why Itanium 2 onwards made use of some manner of multi-threading.)
It can provide some latency hiding from cache misses, given a non-blocking cache you can more aggressively push back loads from where they're needed. But there's a disadvantage just doing this uniformly everywhere because it increases register pressure and code size, especially when you start pushing them before branches, or on both sides of a branch. But it can still be worthwhile for blocks that statistically miss a lot.
Retraining the blocks vs changing trends in branch prediction at a coarse grain is also better than just going with PGO or worse, nothing.
I think maybe you should go read some more articles about dynamic optimization, like for instance Dynamo - then start thinking about how much it can be extended when designing a uarch around these principles.