My big question: How did virtualization and per core server licensing factor into that strategy?
Not all that much, at least not together. More cores for virtualization without oversubscription is not a bad way to go. However, I doubt licensing by processor has, does, or will, affect how either AMD or Intel decide to make the CPUs.
The market has choices, so what will happen is companies that balk at the cost will begin to move away from getting their software from companies that use extortive licensing. Such licensing is a managerial decision, to try to get higher margins from companies with deeper pockets, who have more to lose by rocking the boat than to gain by not paying higher software costs.
So (using Oracle) the old 12 core MC Opterons were twice as expensive to license as a hexcore Intel Xeon.
Meanwhile, if you were even mildly worried about that sort of thing, you would have never done more than politely accept the HPOC's business card. To want Oracle
in the first place, software licensing costs need to be pretty low on your priority list. Microsoft wants to also extract high margin from those types of customers, yet they also need to not piss off those smaller customers that find coupling hardware performance to software cost a strange idea.
All in all, software licensed per processor core is a medium/big business, vertical market thing, not the common case. Where it is common, software will tend not to account for a large portion of the TCO of the servers, or at least the software in question won't. What's 100k/yr for software when your employees supporting it cost a few million, when customers using it make you a few million, or when moving to something new would cost you a few million?
The way I see things now, the "per core" server software licensing issue wasn't an issue with MS SQL at the time.
MS said they wouldn't go per-core. Hence, it wasn't a problem until they announced licensing changes that would include per-core options. Some businesses will be corner cases for the new licensing, after expecting no worse licensing options that per socket plus CAL. Per socket is generally accepted, because more than 2 sockets/server is quite uncommon for a smaller business to bother with.
But could virtualization have helped a "High IPC CPU core" strategy regain any losses suffered from Pollack's rule?
No. Virtualization let's you gain from any performance improvements, when you have servers that, on average, don't need all the hardware you can give them, or for which the manageability is useful. Why buy 10 servers when you can have a single server do all the same work? In the process, you also get the ability to live system backups, and the ability to bring any of said servers back online on another physical computer, in very short time frame, either a as part of load balancing, or trying to get back online after TSHTF.
As you can see from the chart, a much higher IPC AMD core design would have suffered a performance per watt penalty.
Compared to an ideal narrower core, an extremely wide core designed to extract very high ILP, or with code that perfectly scales out to many cores. Reality keeps those from being common, so the moderately bigger core, with moderately better IPC, will end up with better performance per Watt
across a whole task's execution (which may, especially in servers, encompass many programs, hence not limiting the statement to WPP), with modest increases in core counts over time. It's a compromise, and there is some point where a beefier core will be detrimental more than beneficial. Scaling is not perfect, so single threads matter, and will continue to matter.
This accompanied by a decrease in frequency due to the increase width.
Except that frequency has, as of late, been dominated by power consumption, so there's an effective speed ceiling. Or, more accurately, points on a speed v. power curve that they just can't cross, if they want people to buy it. BD, FI, has been OCed very high, already,
but it is too hot for normal users at those high speeds, especially those in areas with high electricity costs.
However, it would not have suffered an Amdahl's law penalty and probably not much of a turbo penalty. (due to a greater amount of TDP already concentrated in each cpu core.)
That brings us to the effect of vitualization on wider, higher IPC cores? Could running more "OS instances" on each of these hypothetical AMD Jumbo cores improved cpu utilization?.....helping to regain performance per watt decrease brought about by Pollack's rule?
Each virtual instance gets its own processes, they then get own memory spaces, and their own threads. So, depending on workload and hypervisor OS, you'll get a minor hit in performance (or not), and otherwise get the effect of several separate servers running in one box, which uses less power than many boxes, and adds useful features for your sysadmins. Virtualization is about consolidating and managing computer resources. Virtualization adds its own set of performance quirks, which are worth worrying about, but a faster CPU is a faster CPU is a faster CPU. You can't get more out of a processor's functional units with virtualization.
For example, I have read that 10 OS instances can be run on a Quad core Server CPU.
You can run about as many as you have RAM for. Just that each server you add over the cores that they can have to themselves effectively reduces peak performance of each virtual server, should they both be highly active at the same time.
What would happen if those CPU cores became even wider with even greater IPC? Could a greater number of "OS instances".....say 16 to 20.... be run on that hypothetical high IPC quad core? (increasing "cpu core saturation"<---A word I just made up for lack of a better term)
Scheduling many instructions effectively is hard (IE, more R&D costs, more die space, and power consumption, and then may reach a point where a miracle is needed on top of that). As the core is widened, it will just get harder. As the core has to run so many more cycles out from memory, it will need deeper instruction windows, making it again harder; then it will deeper structures to handle instruction completion and data writing out to cache, then the same for main memory; then more cache is needed, which needs to be made faster to be effective, or the core needs to be even further beefed up to handle longer latencies of a larger/farther cache, and such features tend to induce penalties when things don't go perfectly, or if your program performs data accesses that don't conform to the CPU designers' targeted common cases.
That whole time, nothing gets done about times where you sitting around waiting for something to do, which happens pretty much all of the time. At those times, all the wider and deeper structures get you nothing but more leakage power to deal with, and add more cycles of latencies all over the place, potentially increasing the chances of the CPU waiting to do something, which is what you were trying to prevent by all of this! The answer is to add complexity to other parts of the CPU to reduce those unhidable latencies (IMC, fast on-die networking, cache coherency protocol improvements, larger/better prefetcher and branch predictor histories, etc.), which, well, still makes the thing bigger, hotter, and costlier all around.
After all that, IPC returns are diminishing, for scalar processing, such that even idealistic simulations with extremely large instructions windows and perfect caches have trouble getting higher than ~6-8 IPC, even with what would seem like ideal high-ILP program loops. Much code simply does not have high ILP to begin with, and when that is the case, all you can do is try to reduce latencies. You can't increase IPC when the ILP is not there. Realistically, 2 has historically been a good spot to be at, and 4 a good brick wall to try to reach.
If you used SMT to try to fill all of those execution units with more threads, you'd still have all the complexity (power consumption, R&D costs) and latencies to deal with, and still would be fighting the negatives of SMT. If you ideally implement SMT to prevent that situation (IE, Niagara), then each thread suffers too much compared to faster CPUs made to primarily run single threads.
All put together, TANSTAAFL.
Big wide RISC cores with many GPRs and perfect compilers, that only ran loopy code, were going to rule the world. In the end, it just doesn't work for general purpose computing. Which flows right into...
In reference to post #133, I just wonder if Itanium is part of the reason Intel doesn't want to provide a large increase in IPC or single threaded performance to x86? (Too much overlap between x86 and Itanium CPU performance could lead to an even stronger reduction in Itanium sales)
No. Itanium was destined to be a mass market failure. It succeeded in helping to kill of PA-RISC and Alpha, due in part to management being convinced that Itanium was going to be good, but it is not. Everything bad about RISC went in, and perfect compilers did not come out. Intel has, over the last several years, been implementing useful RAS features from IA64 in high-end x86 parts. If you were building your software systems today from nothing, x86 today offers all the RAS you'll need (what it doesn't have, you can make up for in software).
Intel doesn't want to provide a large increase in IPC or single-threaded performance because even they, the mighty giant Intel, lack infinite funds, manpower, and creativity with which to attempt to do so
(unsolved engineering and math problems often need more than money and time to be solved), and as a business, they like making money. They have to choose a point within the good fast and cheap triangle, just like everybody else, and try to make what customers want, without bankrupting themselves, just like everybody else (well, maybe not AMD ). As it stands now, the fastest general purpose computer processors in the world are x86 CPUs designed and manufactured by Intel.