AMD Realizes Significant Reduction in Power Consumption by Implementing Cyclos Resona

GaiaHunter · Feb 22, 2012

Arzachel said:
It needs a more mature process, less cache with tighter latencies, clocks somewhere around 4ghz and not bouncing p-states like crazy, if it still does that. The reworked threading in Win 8 would also be nice.

Considering some "noises" that AMD is re-thinking their scheduler approach, I think a better scheduler might do more.

Idontcare · Feb 22, 2012

Arkadrel said:
If this tech works on CPUs.... any chance AMD can use it on their graphic cards as well?

Maybe 8xxx series of card? or the 9xxx ones?

I would say there is no technical reason not to, but consider the fact that the same could be said of AMD's other power-consumption reducing technology tricks which they limited to using only with CPU's: SOI

It always surprised me that neither AMD nor Nvidia migrated to using SOI for their 200-300W GPU's given how power-constrained the clocking and cooling challenges are.

But it never happened, and presumably the reason it never happened comes down squarely on cost-effectiveness and profitability.

So when you factor that into the equation, not so sure this technique would show up in GPUs. It could, but it might not.

Arzachel · Feb 22, 2012

Does TSMC even provide the option of using SOI?

exar333 · Feb 22, 2012

The quickness of the implementation of this tech is a little suspect IMHO. I doubt we will see a huge power decrease, but any decrease helps.

It's very few and far between that you spend years and years on something and then just out of the blue you can plug-in a new tech and magically save 25% power. To be realistsic, we are probably looking at <5% real-world power savings. Again, anything helps of course.

pelov · Feb 22, 2012

Idontcare said:
I would say there is no technical reason not to, but consider the fact that the same could be said of AMD's other power-consumption reducing technology tricks which they limited to using only with CPU's: SOI

It always surprised me that neither AMD nor Nvidia migrated to using SOI for their 200-300W GPU's given how power-constrained the clocking and cooling challenges are.

But it never happened, and presumably the reason it never happened comes down squarely on cost-effectiveness and profitability.

So when you factor that into the equation, not so sure this technique would show up in GPUs. It could, but it might not.

If it is a cost issue, then wouldn't the current climate where low end GPUs are being cannibalized by CPUs/APUs offer a better opportunity for that to be implemented? The cost of high end GPUs shouldn't be decreasing significantly and with their rise in HPC you'd figure making them even more efficient and powerful despite increase in cost would actually be profitable? NV/AMD can't rely on volume to make up a good portion of their profits anymore so I'd figure it may be time to invest quite a bit on GPUs, particularly with their HSA agenda.

Another quick question,

AMD is going to 20nm gate-last after their 28 gate-first SOI. Intel decided to skip on SOI on 22nm and instead opt for tri-gate, but was that due to cost effectiveness? What would the difference had been if they stuck with SOI for Ivy other than higher cost? The last one might be difficult to answer considering the chips aren't out yet. If so, just ignore that one

IntelUser2000 · Feb 22, 2012

blckgrffn said:
Haha, 2x the cores to level the playing field btw, isn't that what cinibench etc. show right now?

On the contrary, Cinebench is the benchmark that AMD performs best relatively. R11.5 performs even better relative on AMD than R10 shown in the following link.

http://www.anandtech.com/bench/Product/289?vs=80

The i3-2100 places itself farely well between 3.0GHz Phenom II 940 and 3.2GHz Phenom II 955.

It's not 2x cores vs 1x core. It's 2x cores vs 1x core + Hyperthreading. After Bulldozer its AMD core + Multithreading vs Intel core + Multithreading.

CPUarchitect · Feb 22, 2012

pelov said:
Another quick question,

AMD is going to 20nm gate-last after their 28 gate-first SOI. Intel decided to skip on SOI on 22nm and instead opt for tri-gate, but was that due to cost effectiveness? What would the difference had been if they stuck with SOI for Ivy other than higher cost?

Tri-Gate achieves pretty much the same goal as SOI. Because the channel forms a fin structure, the parasitic capacitance of the bulk silicon is greatly reduced.

But it achieves more than that. Because the gate area is larger for the same top-down area, they can make the gate dielectric slightly thicker, reducing leakage due to tunneling. So Intel will have an advantage over AMD's planar SOI.

Eventually AMD is expected to use FinFET on an SOI substrate, which has interesting properties of its own (mind you that this is a presentation by the SOI consortium so there could be some talking up of the technology). And by that time Intel will have taken things to the next level as well...

pelov · Feb 22, 2012

Thanks for the quick response and shedding some light. I figured FinFET was inevitable but coupled with SOI it should certainly be interesting to see what results.

What do you think the cost difference is between an SOI and SOI-less design and manufacturing? all other possible things being equal. If the cost isn't that detrimental you'd figure AMD may look to make the same transition IBM introduced into GPUs as well if the performance benefits outweigh the costs.

Phynaz · Feb 22, 2012

ExarKun333 said:
The quickness of the implementation of this tech is a little suspect IMHO. I doubt we will see a huge power decrease, but any decrease helps.

It's very few and far between that you spend years and years on something and then just out of the blue you can plug-in a new tech and magically save 25% power. To be realistsic, we are probably looking at <5% real-world power savings. Again, anything helps of course.

Just because the announcement was unexpected doesn't make implementation quick.

Considering chip design cycle times, this has been in the works for years.

pm · Feb 22, 2012

One of the funny or ironic things about this is that I distinctly remember talking about a scheme similar to this in the hallway about 8-10 years ago with a couple of co-workers.

We were talking about it today - same guys - in the hallway again and were talking about issues. Like you'd have a hard time tuning it based on process since capacitance can vary so much, how do you tune it? The test program to tune the thing - or the microcontroller on-chip - must be complex. And there must be real problems doing any kind of stop-clock based scan testing. I guess tATPG is out of the question... but even ATPG must be a bit of a challenge. And how do they handle things like overshoot due to the "Q" - do you clamp it? use some sort of analog automatic gain control? And then what about things like "turbo-mode" or schemes that slow/speed up the clock... you could retune the LC network, but then you'd need to wait a couple of clocks. I bet the guys who worked on the post-silicon debug of it had fun (and no, that's not a sarcastic "fun". I bet it was great).

Overall, it's a really neat thing. I'll be especially interested in reading the paper when it's available.

exar333 · Feb 22, 2012

Phynaz said:
Just because the announcement was unexpected doesn't make implementation quick.

Considering chip design cycle times, this has been in the works for years.

The way the blurb was stated, it sounded like this component was not there from the beginning. Makes me think this option was being exercised somewhat later-on in the project to mitigate power issues they were experiencing.

AMDs 4+ GHz x86-64 core code-named Piledriver employs resonant clocking to reduce clock distribution power up to 24% while maintaining the low clock-skew target required by high-performance processors. Fabricated in a 32nm CMOS process, Piledriver represents the first volume production-enabled implementation of resonant clock mesh technology. We were able to seamlessly integrate the Cyclos IP into our existing clock mesh design process so there was no risk to our development schedule, said Samuel Naffziger, Corporate Fellow at AMD. Silicon results met our power reduction expectations, we incurred no increase in silicon area, and we were able to use our standard manufacturing process, so the investment and risk in adopting resonant clock mesh technology was well worth it as all of our customers are clamoring for more energy efficient processor designs.

blckgrffn · Feb 22, 2012

ExarKun333 said:
The way the blurb was stated, it sounded like this component was not there from the beginning. Makes me think this option was being exercised somewhat later-on in the project to mitigate power issues they were experiencing.

AMDs 4+ GHz x86-64 core code-named Piledriver employs resonant clocking to reduce clock distribution power up to 24% while maintaining the low clock-skew target required by high-performance processors. Fabricated in a 32nm CMOS process, Piledriver represents the first volume production-enabled implementation of resonant clock mesh technology. We were able to seamlessly integrate the Cyclos IP into our existing clock mesh design process so there was no risk to our development schedule, said Samuel Naffziger, Corporate Fellow at AMD. Silicon results met our power reduction expectations, we incurred no increase in silicon area, and we were able to use our standard manufacturing process, so the investment and risk in adopting resonant clock mesh technology was well worth it as all of our customers are clamoring for more energy efficient processor designs.

Those highlighted comments made me think this was more like buddy backscratching that anything meaty (when it comes to scope/schedule/budget). Hell, it sounds like a press release prepped for Cyclos, it might even have been part of their contract.

Sweepr · Feb 22, 2012

SLK said:
Ill take a power efficient 8 core that has 15% better IPC than bulldozer over a 4 core (i5 based) Ivy. I rather have the multithreaded performance personally because if applications start getting written for more than 4 threads, all of the 4 core Intel CPUs are going to tank in performance.

AMD says 10-15% better performance, 1/3 of it coming from IPC improvements (3-5%).

http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-9.html

pelov · Feb 22, 2012

Sweepr said:
AMD says 10-15% better performance, 1/3 of it coming from IPC improvements (3-5%).

http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-9.html

That was back in October of '11. Since then they've revised their estimates and stated that it should be 20%

http://www.xbitlabs.com/news/cpu/di...ce_Projections_for_Next_Gen_Trinity_APUs.html

The 10-15% just wouldn't be enough to allow Trinity to compete with IB, but a 20% makes it far more appealing and a worthy upgrade from Llano. It still won't match Ivy with regards to CPU performance, nor does it have to, but it should be respectable while the graphical performance should be top notch. Intel has miles to go to make their graphics respectable, and I can tell you from personal experience from using my SB laptop. It plays 1080p video just fine, but the settings are lackluster, it's even slow to load menus and it can't game for crap. Then there's the skeleton that's waiting to come out of the closet called "Intel graphics drivers" which have a well deserved reputation of being absolutely horrendous.

If Trinity improves even slightly upon Llano's CPU performance while making a significant stride in graphics and can do that while providing enough chips at a low cost and great power savings, they'll certainly have a winner.

frostedflakes · Feb 22, 2012

That's performance improvement over the old Stars cores in Llano. The 10-15% is projected performance per watt (not absolute performance) improvement of Piledriver over Bulldozer.

I'm also pretty skeptical they can improve x86 performance 25% in Trinity compared to Llano, but we'll see, these Trinity chips should be coming out in the next couple months. Maybe this new clock mesh design they licensed allowed AMD to clock them a lot higher and still keep power consumption pretty low.

Abwx · Feb 22, 2012

pm said:
Like you'd have a hard time tuning it based on process since capacitance can vary so much, how do you tune it? The test program to tune the thing - or the microcontroller on-chip - must be complex. And there must be real problems doing any kind of stop-clock based scan testing. I guess tATPG is out of the question... but even ATPG must be a bit of a challenge. And how do they handle things like overshoot due to the "Q" - do you clamp it? use some sort of analog automatic gain control? And then what about things like "turbo-mode" or schemes that slow/speed up the clock... you could retune the LC network, but then you'd need to wait a couple of clocks. I bet the guys who worked on the post-silicon debug of it had fun (and no, that's not a sarcastic "fun". I bet it was great).

Overall, it's a really neat thing. I'll be especially interested in reading the paper when it's available.

Varying frequency is not a problem as this circuit is used to increase
high frequency clock bufferings efficencies.

At low frequencies it s likely that the CPU can work without
the resonnance help , so the few buffers will be enough to charge/discharge the parasistic capacitance with few losses,
while at the same time the resonnant circuit , wich is in fact a kind
of bootstrap , will likely be disabled.

Overshoots can be eliminated by design since the buffers have
intrinsical resistances , the real circuit is rather an RL + RC circuit ,
thus the inductances and capacitance values can be be optimized
to provide a criticaly damped circuit that have a quasi first order
transfer function.

Of course , for such designs , a very precise knowledge of the
components parameters is mandatory for the simulators to give
exploitable results.

Agree that such an idea , despite the apparently simple principle ,
did surely require quite a lot of FP calculations crunching....

Sweepr · Feb 22, 2012

pelov said:
That was back in October of '11. Since then they've revised their estimates and stated that it should be 20%

http://www.xbitlabs.com/news/cpu/di...ce_Projections_for_Next_Gen_Trinity_APUs.html

The 10-15% just wouldn't be enough to allow Trinity to compete with IB, but a 20% makes it far more appealing and a worthy upgrade from Llano. It still won't match Ivy with regards to CPU performance, but it should be respectable while the graphical performance should be top notch.

That's Trinity vs Llano. Llano works at very low clocks on notebooks. Their fastest quads with 35W TDP run at ~1.5/1.6GHz (even dual-cores dont pass 2GHz). I wouldnt be surprised to see a 2.2-2.8GHz 2M/4C Trinity, easily beating Llano even with lower IPC. I was talking about their performance estimates for Bulldozer vs Piledriver though.

http://www.anandtech.com/show/5488/amds-2012-2013-server-roadmap-abu-dhabi-seoul-delhi-cpus

frostedflakes said:
That's performance improvement over the old Stars cores in Llano. The 10-15% is projected performance per watt (not absolute performance) improvement of Piledriver over Bulldozer.

I'm also pretty skeptical they can improve x86 performance 25% in Trinity compared to Llano, but we'll see, these Trinity chips should be coming out in the next couple months. Maybe this new clock mesh design they licensed allowed AMD to clock them a lot higher and still keep power consumption pretty low.

Me too. The FX4100 (2M/4C BD @ 3.6GHz; 4Mb L3) has a hard time beating the A8 3850 (2.9GHz) in many CPU tasks. I dont think the new A10 5800K (2M/4C Piledriver @ 3.8GHz) will do much better without L3 cache.

http://www.legitreviews.com/article/1766

Arzachel · Feb 22, 2012

frostedflakes said:
That's performance improvement over the old Stars cores in Llano. The 10-15% is projected performance per watt (not absolute performance) improvement of Piledriver over Bulldozer.

I'm also pretty skeptical they can improve x86 performance 25% in Trinity compared to Llano, but we'll see, these Trinity chips should be coming out in the next couple months. Maybe this new clock mesh design they licensed allowed AMD to clock them a lot higher and still keep power consumption pretty low.

Llano was clocked dog slow, even if they only match Llano's ipc they should easily hit 25% faster clock speeds.

guskline · Feb 22, 2012

I took the time to read the Cyclos white paper. Some of the info was way over my head but I think I have the basic theory down. The resonant clock meshing will add some cost to the CPU, but one of the big drawbacks to the Bulldozer was that to get acceptable performance that neared the Sandy Bridge you had to OC in the high 4Ghz range(4.6 to 4.9) and this produced too much heat. The new process for the Piledriver, if it addresses this, will allow the higher OCs without the excessive power draw. I'M hopeful that this works out. Maybe Piledriver will be like the Phenom II. No need to buy a Bulldozer. I'll wait for the Piledriver.

pm · Feb 22, 2012

Abwx said:
Varying frequency is not a problem as this circuit is used to increase
high frequency clock bufferings efficencies.

It's a tuned parallel LC tank resonating at the harmonic frequency of the clock network. I don't understand why you would think that varying the frequency wouldn't be a problem.
See page 10+ of this: http://www.cyclos-semi.com/pdfs/time_to_change_the_clocks.pdf

The resonant formula is on page 10. f=1/(2*pi*((LC)^0.5)). Since PI is a constant and L and C are usually fixed values - the capacitance and inductance of the clock mesh are normally constant values - then you are stuck with one resonant frequency.

The only way to fix this is to change L or C or else to suffer a power penalty when not driving the network at the tuned resonant frequency. And if they are changing L or C (if it were me, I'd try to change L), then I wouldn't think they have such a huge range of control, and the whole thing would need to be characterized pretty carefully to know how to tune it. Unless the designers are ok with an efficiency loss.

At low frequencies it s likely that the CPU can work without the resonnance help , so the few buffers will be enough to charge/discharge the parasistic capacitance with few losses, while at the same time the resonnant circuit , wich is in fact a kind of bootstrap , will likely be disabled.

How do you disable resonance in a wire mesh?

Overshoots can be eliminated by design since the buffers have intrinsical resistances , the real circuit is rather an RL + RC circuit , thus the inductances and capacitance values can be be optimized to provide a criticaly damped circuit that have a quasi first order transfer function.

But this affects the efficiency of the system. In a true parallel LC tank circuit, you don't want any R at all. While you are correct that it's effectively an RL+RC circuit, for maximum efficiency, you don't want R at all. The more you resistance there is, the less effect this resonance will have, the less efficiency improvement there is over a standard non-resonant clock mesh. And I don't agree that overshoots are eliminated by the resistance of the clock drivers. I'd still think that it would be a problem. For clock drivers you want crisp edges and that means large/fast drivers with low resistance.

Of course , for such designs , a very precise knowledge of the components parameters is mandatory for the simulators to give exploitable results.

It's been my experience - having been one of the key designers of an h-tree clock network for an extremely large microprocessor - that measuring capacitance on a clock network isn't super precise. Beyond just standard process variation, you also run into crosstalk-like effects from neighboring metal, and in particular upper and lower metals that are very hard to measure. One of the key advantages of a clock mesh is the fact that you don't need a very precise measurement of anything. You just lay it all down and tie it all together.

Arzachel · Feb 22, 2012

They're probably using something similar to this http://www.faqs.org/patents/app/20110260819 (Props for Kedas from SA forums for digging this up)

grimpr · Feb 22, 2012

Here the final submission paper to ISSCC 2012 with more technical details.

http://www.eecs.umich.edu/eecs/about/articles/2012/ISSCC_2012_Piledriver_final_submission.pdf

pm · Feb 22, 2012

grimpr said:
Here the final submission paper to ISSCC 2012 with more technical details.

http://www.eecs.umich.edu/eecs/about/articles/2012/ISSCC_2012_Piledriver_final_submission.pdf

Thanks, grimpr.

CTho9305 · Feb 22, 2012

Skurge said:
All of this is over my head. Could someone explain this to me in layman's terms?

A water analogy works fairly well (a more complex water analogy works really well, but I'd need to draw pictures for that one).

Pipes are wires
- You can fill a pipe up with high pressure water by connecting it to a pump (your power supply).
- You can empty a pipe by dumping the water out (open a valve that lets the water drain out).
Anybody who wants to receive the value on a wire inserts a "T" junction and sticks a balloon over the base of the T (let's put the T upside down so gravity helps). When the balloon fills up, they read a 1, and when it drains, they read a 0.

The clock is a pipe that gets distributed over a large area, with many receivers connected to it (e.g. all your flip flops - the clocked elements in a digital logic circuit). Each cycle, it goes to 1 for a while, then 0 for a while.

In a normal system, you fill the clock pipe up from your high-pressure water pump, wait half a cycle, drain all the water out, wait half a cycle, and repeat. The system consumes a LOT of water (in particular, it consumes (volume_of_pipe_in_gallons+volume_of_all_balloons_in_gallons) * cycles_per_second gallons per second). The pump takes a lot of power to provide that much pressurized water.

In a resonant system, it works a little differently. Instead of filling the pipe from a pump and dumping the water on the ground, you fill the pipe from a special extra-large balloon. When you connect that balloon to the pipe, it fills up the pipe and all the other balloons. Now, because water has inertia, the big balloon will empty completely - it won't stop when the pressure in the big balloon is balanced with the pressure in the little balloons. At this point, all the little balloons are full, and the big balloon is empty, so the little balloons shoot the water back into the big balloon.

If you design the system carefully, you can balance the inertia of the water against the capacity of the balloons, and set it up so the water will slosh back and forth many times all on its own before the sloshing dies down. You're not repeatedly pumping in large amounts of water and then dumping it all out. Instead, you just need to give the system little nudges to make up for inefficiencies (in practice you do that by pumping just a small amount of water into the pipe at the right time, or dumping a small amount out at the right time).

The inertia of the water is quite similar to inductance in electrical systems (if you ignore inductive coupling, for which I don't have a great water analogy).

Hope that makes sense.

If anyone wants to have a go at a more detailed explanation of FET-like water analogies, I made this a long time ago. The one on the left acts like pmos; when you apply pressure to the top, it'll close. The one on the right acts like nmos; when you apply pressure to the top, it'll open. The weird curvy cylinders are supposed to be springs, to keep the sluice gates in the right position when no pressure is applied to the top. An air model works too - the compressibility changes things, but I find it harder to visualize air sloshing.

grimpr said:
Here the final submission paper to ISSCC 2012 with more technical details.

http://www.eecs.umich.edu/eecs/about/articles/2012/ISSCC_2012_Piledriver_final_submission.pdf

Thanks.

Phynaz · Feb 22, 2012

So assuming we take their numbers at face value - a 25% reduction in clock power consumption results in a 10% reduction in chip power consumption...Elementary math then says that clocks account for up to 40% of a cpu's power consumption.

True?

AMD Realizes Significant Reduction in Power Consumption by Implementing Cyclos Resona

Diamond Member

Elite Member

Senior member

Diamond Member

Diamond Member

Elite Member

Senior member

Diamond Member

Lifer

Elite Member Mobile Devices

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Senior member

Diamond Member

Elite Member Mobile Devices

Senior member

Golden Member

Elite Member Mobile Devices

Elite Member

Lifer