AMD: It Won't Be About 'AMD vs. Intel' Anymore

cbn · Dec 6, 2011

IntelUser2000 said:
95W CPU in CPU intensive applications = 95W all to CPU

It would be interesting to find out how many Mhz that would net?

I have read that power consumption is proportional to the cube of the operating voltage and linearly with the clockspeed.

With these little cpu cores already pushing high clocks (in stock configuration) how much clock speed is really left in that TDP budget?

Cerb · Dec 6, 2011

Computer Bottleneck said:
More cores? Why do we need more cores?

To increase performance by a means that is affordable and known possible. It's not like anyone can magically make faster single threaded processors, by making them bigger and hotter. Those properties are the consequences of making them faster, because with slow memory, faster basically means being able to perform more complicated state changes in parallel over some period of time.

While they could surely make them a bit faster per clock than today, if power use weren't such a concern...well, power use is a major concern, so you can forget about that. High thermal density increases failure rates (in theory, you can design around it, but in reality computers blow capacitors and VRMs), people like quiet, people like long battery life, and most of the world pays a pretty penny for their kiloWatts. On top of that, excessive heat to exhaust from servers finally reached a point where processor and storage density started taking a back seat to making air conditioning easier and cheaper, on the med/big business side of things.

One thing that concerns me about Intel and AMD is that they promote and/or use additional non cpu technologies to do tasks that were originally performed on the cpu. In the case of AMD it is the gpu that will replace cpu cores for transcoding.....with Intel it is "Quick Sync".

Yet they want to sell us on the idea of more cpu cores?

Quicksync is an ASIP. It's only good for transcoding quickly. The GPGPU is good for performing vector operations over a large serial data stream. The CPU needs to be good for anything that fits on the tape. They are not swappable. Fixed-function hardware is faster at what it is made for, being simpler due to not having to do much else. GPGPU is somewhere in between the two.

This makes me wonder what is going to happen with future Intel enthusiast sockets (ie, LGA 2011 successors)? Are we going to see those turn into "many core" arrangements without IGP? <----Just wondering how many people are going to want "many core" ATX towers if other (non cpu) techs are trying to obsolete this concept?

Surely Intel has a much higher IPC "non mobile" CPU core on the horizon (specifically for the enthusiast socket)? With these small nodes they certainly have the silicon die space to pull this off if processor design does not include IGP.

Sure, but what about paying all the engineers to do the R&D? R&D costs a lot of money not just because implementing the final design is tedious, but because they will also need to verify what good ideas won't work for them, and what ideas that look good on paper suck in reality. The cost and time needed keeps going up and up and up, because it is not easy to do. It can be done, and Intel is making their CPUs ever faster per clock, but they can't just make a bigger one and watch it go really fast. Rather, they can spend a ton of money, and make one a bit faster, and watch it be bigger and maybe hotter. But, hotter means slower, today, which only adds to the difficulty.

Computer Bottleneck said:
Before we say the market won't pay more, I'd like to see some analysis done on the potential for adding an additional level of AMD x86 Server products.

AMD can't sell the levels they have now (not much, anyway). They probably have one too many, and if they stick with the x86 server market (please do, please do, please do), should probably drop down to one server socket and be done with it.

People would pay more in the past for AMD server CPUs, and Intel's, for that matter. AMD was able to command high prices for their faster Opterons until the Core 2 ones came out, and then were still able to keep somewhat of a lead until Nehalem ones, due the Core 2's being limited by FSB. Today, people hardly want free Opterons.

One thing I noticed in my brief reading of "per core" server licensing is that adding additional threads with SMT does not count as additional cores. For example, a large/wide high IPC AMD CPU core with four way SMT would still count as only one core.

SMT offers very little, thus won't count as anywhere near a whole set of execution units. Even now that Intel has gotten HT working very well, typical gains from using it are <20%, and OLTP and OLAP are just the sorts of work where it may be eschewed in favor of per-thread performance. Also, again, per-core licensing doesn't matter to the overwhelming majority of the market.

Computer Bottleneck said:
What I feel is needed is higher IPC design with greater width (with the appropriate compensations engineered into the CPU to make design work).

http://en.wikipedia.org/wiki/Alpha_21464

Before compilers got good enough to not need many registers, that was the future (oh, irony: it was to be that compilers would make better use of many GPRs; instead, predicted advancements occurred, but had the opposite result: wasted GPRs, and faster x86!).
Before electricity costs drove CPU development, that was to be the future.
Before the Prescott got a welcome of , and the market at large told Tejas it wasn't welcome, that was to be the future.

The fact is, it's too hard to do well, will run too hot, and not perform well enough to be worth it. Separate narrower execution units work better, all else considered.

With Poulson Itanium increasing to a whopping 12-wide, maybe Intel will permit this change to happen?

I could have sworn it was 6, which is closer to 2, but still, no. AMD brought the Hammer down, and COTS won. X86 got the RAS it needed to hang with the big boys, Linux became a serious sever OS (Linux/FOSS greatly removes hardware/software lock-in, benefiting X86 and ARM), and now it is too late*.

* On top of that, x86 was already good enough for software built for faulty hardware (about anything made with Erlang, FI), putting IA64 in a position of lower actual reliability, just due to the amount of work put into quality support for x86.

denev2004 · Dec 6, 2011

SMT is just a temporary design from the way through ILP to TLP
You can't used it to fix all the problem happens in a wide ILP processor

Cerb said:
X86 got the RAS it needed to hang with the big boys

I once read a passage talking about Nehalem-EX RAS said that Nehalem-EX support some RAS features that were once only belongs to Itanium. But they do not support ALL RAS features. I don't know wether Westmere-EX is still the same.

Cerb · Dec 6, 2011

denev2004 said:
I once read a passage talking about Nehalem-EX RAS said that Nehalem-EX support some RAS features that were once only belongs to Itanium. But they do not support ALL RAS features. I don't know wether Westmere-EX is still the same.

I think explicit support for logic fault checking is still IA64-only. Every other feature that even might be useful is on the high-end Xeons, and most of those are supported by RHEL.

denev2004 · Dec 6, 2011

Cerb said:
I think explicit support for logic fault checking is still IA64-only. Every other feature that even might be useful is on the high-end Xeons, and most of those are supported by RHEL.

Oh really. That sounds like IA64 is not that useful now.

Maybe I'm a litte bit off topic

beginner99 · Dec 6, 2011

Cerb said:
Enterprise is for businesses who are happy as can be to pay a mint, as long as it works, and has every feature their DBAs asked for. Smaller businesses, who can't afford that, but might want to scale out, are going to go FOSS from the start, if they have any intelligence, so they will be irrelevant to SQL Server's licensing schemes.

Yeah but certain Applications require a certain Database, meaning it will run on Oracle only and not MySQL. And in certain niche markets there aren't any alternatives besides inventing the wheel yourself.

cbn · Dec 6, 2011

denev2004 said:
Actually the one who told me that 4-core verson of Haswell will go back to 95W claims that the 18W difference from IVY is caused by the new IGP in Haswell...

That makes sense to me.....

At 45nm, 95 watt power budget for "4-wide" quad core was not unreasonable.

Even at 32nm, 95 watts is not unreasonable for "4-wide" quad core (although even with the iGPU included in that budget we are starting to see Sandy Bridge push some pretty high stock clocks for the cpu)

At 22nm Trigate, I feel that 95 watts for "4-wide" quad core is probably too much resulting in a "less than desirable performance per watt trade-off" if all that TDP were focused purely into the CPU cores. Thus, we are seeing Intel invest even more silicon die space into iGPU....in an effort to boost the 77 watt TDP of Ivy Bridge back up to 95 watts with Haswell.

cbn · Dec 6, 2011

Cerb said:
Sure, but what about paying all the engineers to do the R&D? R&D costs a lot of money not just because implementing the final design is tedious, but because they will also need to verify what good ideas won't work for them, and what ideas that look good on paper suck in reality. The cost and time needed keeps going up and up and up, because it is not easy to do. It can be done, and Intel is making their CPUs ever faster per clock, but they can't just make a bigger one and watch it go really fast. Rather, they can spend a ton of money, and make one a bit faster, and watch it be bigger and maybe hotter. But, hotter means slower, today, which only adds to the difficulty.

I completely agree with you about paying engineers to design x86 CPUs. That is money well spent!

So this is why I brought up MS SQL 2012 Enterprise licensing changes....as a start.

Hopefully the necessary experts, engineers and number crunchers can get together and tame this dragon which is the high end x86 CPU Budget.

With that being said, I'd really like to see AMD implement some new type of x86 plan. Maybe they can change strategy from being a "little Intel" to one that forges its own cpu destiny....a company that operates more like a "big start-up."

Vesku · Dec 6, 2011

It would make me a bit sad if billions in R&D is spent trying to eek out gains within some artificial database licensing restrictions rather than pursue much more productive avenues of development. Not saying it couldn't happen but it seems licensing is the more malleable of the two.

cbn · Dec 6, 2011

Vesku said:
It would make me a bit sad if billions in R&D is spent trying to eek out gains within some artificial database licensing restrictions rather than pursue much more productive avenues of development. Not saying it couldn't happen but it seems licensing is the more malleable of the two.

Well, you make a really good point.

But I wonder if other types of artificial factors aren't already involved with the politics of x86 decision making.

Case in point---> Itanium.

Intel has stated in the past they wanted this processor design to the flagship with x86 being the budget server CPU.

Therefore I just wonder how much influence Intel's Itanium directive has on Intel's x86 development budget plans?

I noticed in the Wikipedia link here that back in 2006 $10 billion dollars would be allocated to the Itanium budget by 2010. That seems like some serious money and effort to keep Itanium on top. I just wonder where desktop would be today if just some of that budget (and other past Itanium money) were allocated to x86 instead?

IntelUser2000 · Dec 6, 2011

Computer Bottleneck said:
AFAIK, the general concept I am thinking about shouldn't make for a less efficient desktop CPU core.

Oh, no it does. Sandy Bridge had a limited gain because rather than expanding all execution resources without thought they decided to enhance those that brought big gains, to increase efficiency. Just increasing the amount of decoders don't gain you anything, because other parts are holding you back. You need to expand everything like memory execution units, out of order resources, then you are back at the mercy of Pollack's rule.

What I feel is needed is higher IPC design with greater width (with the appropriate compensations engineered into the CPU to make design work). With Poulson Itanium increasing to a whopping 12-wide, maybe Intel will permit this change to happen?

They don't cripple their x86 chips to artificially make Itanium look better anymore. That's why Xeon has been kicking ass, and getting most of the RAS features.

BTW, Poulson is still a 6-issue core on the front end, you know, the "decoder side".

The main goal would be a higher end x86 quad core server. The E-class desktops are just a sub market.

Why go for higher quad server, when multiple of what they have is way better? HPC, Cloud, Database. If anything, its the E-class desktops that have the most to gain from increasing IPC at the cost of everything else.

cbn · Dec 6, 2011

IntelUser2000 said:
Oh, no it does. Sandy Bridge had a limited gain because rather than expanding all execution resources without thought they decided to enhance those that brought big gains, to increase efficiency. Just increasing the amount of decoders don't gain you anything, because other parts are holding you back. You need to expand everything like memory execution units, out of order resources, then you are back at the mercy of Pollack's rule.

That makes sense to me and I did mention that caveat back in post #175

Computer Bottleneck post #175 said:
What I feel is needed is higher IPC design with greater width (with the appropriate compensations engineered into the CPU to make design work).

Okay, now with out of the way......I have no idea how they (Intel/AMD) will fix the problem.

But...... from what I gather x86 is entering a period where heat density will be a major concern.

Therefore it would seem to me spreading some of that heat (and work) out over a larger amount of silicon area could make for a higher performance and more energy efficient core. (even if Pollack's rule is invoked). This rather than trying to push up or even maintain frequencies to focus the same amount of work through the same narrow core.....on the smaller node.

Now how will Intel/AMD accomplish this goal?

1. Maybe larger cores running at slower speeds with higher IPC is one part of the answer? (to help spread the watts out over a greater amount of silicon) In the case of Intel, I'd imagine the answer and plan for this inevitable future would be Itanium. But what will AMD's answer to this be?

2. Maybe 3D xtors are part of the answer? Wouldn't they help increase surface area?

3. In a quad core format maybe the OS can schedule threads so "cpu core 0" isn't always the first one to get used? Same for Turbo boost. To help explain this concept, lets use the example of five tasks coming single threaded in series. In this scenario, core 0 would be the first core used, then core 1, then core 2, then core 3, then back to core 0. (rather than those five serial single threaded tasks repeatedly pounding core 0)

Nemesis 1 · Dec 6, 2011

Emulating X86 is in our future.

cbn · Dec 6, 2011

Nemesis 1 said:
Emulating X86 is in our future.

Emulating vs clock gating front end components? Which approach is more efficient? (I have no clue , but I thought I would include this post from CPU architect for the sake of discussion)

http://forums.anandtech.com/showpost.php?p=32630722&postcount=17

CPU architect said:
The next step is to do even more work per instruction. This can be achieved by executing AVX-1024 instructions on the existing 256-bit units, in four cycles. The throughput remains the same but the instruction rate is reduced so the power hungry front-end can be clock gated for 3/4 of the time.

Clocking gating the front end.....for power savings. Essentially with this type of strategy we begin to see the beginnings of micro managing heat around within a cpu core.

Maybe this step could be taken one step further? Maybe each time only a single decoder is needed.....that decoder would not be used again till the other decoders got their fair share of the next tasks?

Furthermore, with respect to work per instruction being increased....I think that idea meshes very nicely with the idea of micro-managing heat within the cpu, but don't we still have the problem of doing work in a silicon space about 1/2 the size of the previous space when a new node is used? Maybe a work around to this is to over-provision various decoders, execution units, etc. <----Then program the CPU so each one doesn't get over used in comparison to other identical units?

Do you see what could be happening with our CPUs as these nodes progress? Silicon area decreases by 50%, but power efficiency only increases by 25%. Essentially we run into the problem where just keeping the same single threaded performance (As the old node) becomes an accomplishment in itself! <----This is yet another reason why some kind of strategy involving wider cores for these small (14nm and beyond nodes) makes a degree of sense to me.

CTho9305 · Dec 6, 2011

Computer Bottleneck said:
Nemesis 1 said:

Emulating X86 is in our future.

Click to expand...

Emulating vs clock gating front end components? Which approach is more efficient? (I have no clue , but I thought I would include this post from CPU architect for the sake of discussion)

http://forums.anandtech.com/showpost.php?p=32630722&postcount=17

Clocking gating the front end.....for power savings. Essentially with this type of strategy we begin to see the beginnings of micro managing heat around within a cpu core.

Maybe this step could be taken one step further? Maybe each time only a single decoder is needed.....that decoder would not be used again till the other decoders got their fair share of the next tasks?

Furthermore, with respect to work per instruction being increased....I think that idea meshes very nicely with the idea of micro-managing heat within the cpu, but don't we still have the problem of doing work in a silicon space about 1/2 the size of the previous space when a new node is used? Maybe a work around to this is to over-provision various decoders, execution units, etc. <----Then program the CPU so each one doesn't get over used in comparison to other identical units?

Do you see what could be happening with our CPUs as these nodes progress? Silicon area decreases by 50%, but power efficiency only increases by 25%. Essentially we run into the problem where just keeping the same single threaded performance (As the old node) becomes an accomplishment in itself! <----This is yet another reason why some kind of strategy involving wider cores for these small (14nm and beyond nodes) makes a degree of sense to me.

I think it's a questionable tradeoff to widen the vector x86 extensions much more. You're spending a lot of energy running operations speculatively out of order in an x86 core...if you have a task that parallelizes to 1024 bits, does it really not fit much better on a proper vector machine like a GPU, which can hide latency in an energy-efficient way? (What I'm asking is, are there that many tasks that parallelize to more than 256 bits that don't go all the way to GPU widths? Each time you double the CPU width you're helping fewer and fewer tasks). I wouldn't dispute a claim that Intel is going to do it anyway, but they have their own motivations that don't always lead them to the best technical solution.

wlee15 · Dec 7, 2011

Also remember that Sandy Bridge was able to reuse the integer SIMD datapaths to elegantly expand to 256-bit. Going to 512-bit and 1024-bit won't have that advantage, so while the option is there to expand the length of the AVX units, there's no guarantee that will actually happen.

Cerb · Dec 7, 2011

beginner99 said:
Yeah but certain Applications require a certain Database, meaning it will run on Oracle only and not MySQL. And in certain niche markets there aren't any alternatives besides inventing the wheel yourself.

Reinventing the wheel happens all the time, for just that sort of reason (we can do it cheaper, we can do it better, we won't be able to afford many per-server licenses if we start growing, we don't want to worry about future upgrade costs, etc.). Generally, though, such applications are very expensive, just like the commercial DBs, and people with appropriate certifications for said applications and/or DBs will be more expensive than not (or the expense of the application includes good support), so it will work out. If the money is there, and it would have a high likelihood of causing a result of greater (or just quicker) profit, then it is likely worth paying for.

Computer Bottleneck said:
I completely agree with you about paying engineers to design x86 CPUs. That is money well spent!

So this is why I brought up MS SQL 2012 Enterprise licensing changes....as a start.

Hopefully the necessary experts, engineers and number crunchers can get together and tame this dragon which is the high end x86 CPU Budget.

It's time-consuming and expensive not merely to deisgn all the common x86 CPUs, but then to extend that to a good server chip. Some of the lower Xeons are just desktops w/ ECC, but the upper end have started integrating other useful low-level Itanium features, like multi-level ECC for registers, and logic paths much more resistant to soft errors. Now, is there a lot of profit, there? Sure. Too much? Maybe, but AMD would need to have high-end competition again to make a difference in the prices, and it doesn't look like they are on their way towards doing that again. Personally, I'd rather see the RAS from the EX trickle down to every Xeon (maybe not all of the A and S parts--hot-pluggable CPUs, RAM, and cards can stay in the high end), than worry about a paying $2k+ per CPU for a 4P system that is already likely to be $10k+ before the CPU, just in purchase costs (operating costs are likely so significantly exceed that).

The prices are higher than they could be with competition, but they're really not all that bad, and are much better than in the past.

cbn · Dec 8, 2011

CTho9305 said:
I think it's a questionable tradeoff to widen the vector x86 extensions much more. You're spending a lot of energy running operations speculatively out of order in an x86 core...if you have a task that parallelizes to 1024 bits, does it really not fit much better on a proper vector machine like a GPU, which can hide latency in an energy-efficient way? (What I'm asking is, are there that many tasks that parallelize to more than 256 bits that don't go all the way to GPU widths? Each time you double the CPU width you're helping fewer and fewer tasks). I wouldn't dispute a claim that Intel is going to do it anyway, but they have their own motivations that don't always lead them to the best technical solution.

I'm not a computer scientist and my understanding is extremely limited, but I see the point you are making. The general idea you propose is "Why spend time running floating point on a CPU when the GPU is better provisioned to do the task?"

Now with respect to heat density....

1. How about "dark silicon strategies" within the CPU core applied to raising integer performance (rather than floating point)?

The way I see things, smaller manufacturing nodes are great. But we are nearing a point where everyday people are receiving more cpu cores than they know what to deal with. Using the GPU to transfer load off the FPU makes sense, but what about the effect of the rest of the cpu core to feed those ever increasing GPGPU units? How will Intel, AMD and ARM speed up single threaded performance (for that purpose) while keeping heat density manageable?

2. I have seen it mentioned that "dark silicon" is an unfortunate workaround to the problem of heat density. Maybe there is a creative way or future vision to convert that dark silicon back into fully active silicon later on? Maybe a plan, scheme, or layout that allows room for future instruction set extensions to make use of those extra decode/execution units as the silicon matures?

cbn · Dec 9, 2011

Does the possibility exist AMD could do something unique with the single threaded order of Bulldozer's cores at 22nm?

Maybe a concept similar to a V8 engine's firing order http://en.wikipedia.org/wiki/Firing_order

With this type of scheme could an "eight core" end up being better than "quad core" when a series of eight or more single threaded tasks is released in quick succession? (re: each core could dissipate a greater amount of heat due to the fact the rest time for each core doubles with a move to "octa core" from "quad core")

Then maybe at 14nm, AMD could then add various "dark silicon" heat density management strategies within the Bulldozer CPU core itself? (further enhancing the ability of the design to handle more stress)

cbn · Dec 9, 2011

Here is the Bulldozer floor plan.

I would almost say that looks like a V-8 with the two banks of four cores.

Then maybe the 14nm "dark silicon" strategy (within the cpu cores/other parts of the module) could be called "spark-plug" (or "glow plug" in the case of a diesel analogy) <----to emphasize the low power electrical nature of the technology

To sum things up these would be the technologies:

Turbo core (this is already present in Bulldozer)
V-8 (Future firing order of the cores @ 22nm for maximum efficiency and speed with serial single threaded tasks)
Spark plug/glow plug (dark silicon tech @ 14nm or 16nm within the core and other parts of the modules)

AMD: It Won't Be About 'AMD vs. Intel' Anymore

Lifer

Elite Member

Member

Elite Member

Member

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Elite Member

Lifer

Lifer

Lifer

Elite Member

Senior member

Elite Member

Lifer

Lifer

Lifer