AMD Realizes Significant Reduction in Power Consumption by Implementing Cyclos Resona

Dresdenboy · Mar 5, 2012

Idontcare said:
And what specific instruction is being referred to?

We've got a lot to choose from, and the execution latencies varies widely across them in any given microarchitecture.

[pic removed]

So we'd need to define the instruction under consideration, or if it is to be more than one instruction then we must define the instruction mix (and weightings).

In short, its not simply an academic matter, that would actually be easier than the errand we are setting ourselves upon here.

We would be talking about defining our own Bapco sysmark or passmark with which "effective IPC" would be characterized, and it would be "workload class" dependent. IPC for office apps is different than IPC for mathlab apps because the instruction mix, and their weightings, are so different.

That's the definition problem in more detail.

I assume internally AMD measures IPC while running SPEC benchmarks since they are mostly referring to them in regard of performance. This would include recompilation, which in turn means lower performance for unchanged "legacy" code. Since SPEC results have been published we even might verify their claim.

Idontcare said:
What the heck would "constant IPC" mean? To me it means "steady-state IPC", you process the exact same execution loop indefinitely and measure the average IPC that comes from doing so. But what instructions? And perhaps more importantly, to what end?

The ISSCC paper "40-Entry Unified Out-of-Order Scheduler and Integer" by M. Golden et. al. states:

Compared to previous AMD x86-64 cores [3-6], project goals reduce the number of FO4 inverter delays per cycle by more than 20%, while maintaining constant IPC, to achieve higher frequency and performance in the same power envelope, even with increased core counts.

So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".

Idontcare · Mar 5, 2012

Dresdenboy said:
The ISSCC paper "40-Entry Unified Out-of-Order Scheduler and Integer" by M. Golden et. al. states:

So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".

Ah, makes more sense now.

Did they succeed? (anyone publish Everest instruction latency numbers for a clock-equivalent zambezi and PhII?)

intangir · Mar 5, 2012

Dresdenboy said:
So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".

Thanks Dresdenboy, that's exactly what I was telling Abwx. Also, in multiple quotes of Mike Butler around the web, the phrase "hold the line" was used to express the same goal.

http://techreport.com/articles.x/21813/1

TechReport said:
Yet Chief Architect Mike Butler told us the engineering team's goal with Bulldozer was to "hold the line" on instructions per clock (presumably at about the same rate as the Phenom II) and to "aggressively pursue higher frequencies."

http://arstechnica.com/gadgets/news/2011/10/can-amd-survive-bulldozers-disappointing-debut.ars

ArsTechnica said:
For Bulldozer specifically, additional design influences came into play. In the words of Chief Architect Mike Butler, AMD's goal was to "hold the line" on IPC (presumably meaning to keep it at around the same level as in Phenom II) but to increase the clock speed, thereby achieving improved single-threaded performance, too.

http://www.tomshardware.co.uk/fx-8150-zambezi-bulldozer-990fx,review-32295-6.html

TomsHardware said:
With its Bulldozer architecture, AMD's architects say it was their goal to “hold the line” on IPC and create hardware that’d scale to much higher frequencies. Given what we already know about the FX-8150's specifications, significantly higher frequencies aren’t being realized today, so before we even run any benchmarks, we have to assume similar IPC throughput, fairly comparable clocks, and then cross our fingers for better scaling across multiple cores if Bulldozer has any hope at all of outperforming the 3.7 GHz Phenom II X4 980 or Turbo Core-equipped Phenom II X6 1100T.

"Hold the line" cannot be interpreted in any way as "reduce variability". "Hold the line" has the meaning of "not giving up ground", which in this context means not giving up the levels of IPC of their previous chip. The quotes above all agree with AMD's stated goals in Golden's ISSCC 2011 paper, which was to increase the frequency while keeping the same IPC. Of course, Mike Butler admits they failed:

http://www.hardocp.com/article/2011/11/29/hardocp_readers_ask_amd_bulldozer_questions/

HardOCP said:
3. It seems that the idea of modules and cores sharing parts is brilliant, but the idea of increasing frequency while lowering IPC seems like a step backwards. Why was this decided on?

Mike Butler, Senior Fellow Design Engineer, AMD - Clearly, IPC is an important factor in processor performance, and IPC has decreased slightly in this first instantiation of "Bulldozer."

taltamir · Mar 5, 2012

even if AMD would have succeed in holding the line on IPC while agressively increasing clocks, they would have ended up with a dud.
The resulting chip would be faster then its predecessor for single and multi threaded performance. But not enough to dislodge intel and potentially not enough to matter.
From a laptop and server standpoint it would be a disaster (just not as big as it is now) due to the exponential increase in power consumption with clockspeed increase (even if they held IPC they would lose IPW)

I think the real purpose was exactly the same as intels P4. Marketing dictated that they need "higher mhz" because "higher mhz sells for more money to ignorant buyers"

pelov · Mar 5, 2012

The next paragraph states

The new CPU core delivers higher frequency while maintaining IPC, improved multi-thread (parallel computing) performance, instructions per watt, advanced boost functionality, new x86 instruction sets, and over-clock capabilities never seen in previous microarchitectures. We believe these enhancements will show positive lift for end-users as new operating systems and software applications take advantage of the new features inside "Bulldozer." And, looking forward, as process technology matures over time, the core is well structured for potential increased frequencies in the future.

So while Butler gave the answers to some of the questions, it's also unfortunate that they weren't his answers

2. Why are the integer operation benchmarks so low compared to even previous AMD 4 cores?

Mike Butler, Senior Fellow Design Engineer, AMD - "Bulldozer" is a new microarchitecture that differs in several ways from previous generations. The "Bulldozer" architecture uses both dedicated and shared resources, allowing for a more efficient design, improved instructions per watt and maintaining IPC over our widest operating range – from top boost frequencies in unlocked desktops to throughput server workloads at lower voltage, a range unmatched in prior AMD architecture generations.

Despite stating it stayed the same and then not and then that it did again, I think the general public (reviewers in this case) claimed a ~10% dip.

Abwx · Mar 5, 2012

intangir said:
Thanks Dresdenboy, that's exactly what I was telling Abwx. Also, in multiple quotes of Mike Butler around the web, the phrase "hold the line" was used to express the same goal.

Yet , maintaining constant IPC in respect of the previous
architecture mandate a more constant IPC in the new
architecture given its reduced integer execution units
from 3 in K10 to 2 in a bulldozer half module.

Even with only 80% of the K10 core integer throughput,
the new uarch has higher sustained IPC in respect
of the available exe ressources.

intangir · Mar 5, 2012

pelov said:
So while Butler gave the answers to some of the questions, it's also unfortunate that they weren't his answers

Despite stating it stayed the same and then not and then that it did again, I think the general public (reviewers in this case) claimed a ~10% dip.

Heh, yes. The text does seem to give the impression of schizophrenia based on conflicting editorial goals. Still, it seems to me that "IPC has decreased slightly" is the least obscured fragment of the statements there, and the rest of the "maintain IPC" verbiage shows a lot of redactional fatigue, so is less to be trusted as communicating Butler's exact words.

intangir · Mar 5, 2012

Abwx said:
Yet , maintaining constant IPC in respect of the previous
architecture mandate a more constant IPC in the new
architecture given its reduced integer execution units
from 3 in K10 to 2 in a bulldozer half module.

Even with only 80% of the K10 core integer throughput,
the new uarch has higher sustained IPC in respect
of the available exe ressources.

That's obviously not what was intended. Microarchitects have a perfectly fine term for the concept you are using, which is "utilization". Since they didn't refer to it in those terms, that indicates that what they meant was "performance per clock".

pelov · Mar 5, 2012

intangir said:
Heh, yes. The text does seem to give the impression of schizophrenia based on conflicting editorial goals. Still, it seems to me that "IPC has decreased slightly" is the least obscured fragment of the statements there, and the rest of the "maintain IPC" verbiage shows a lot of redactional fatigue, so is less to be trusted as communicating Butler's exact words.

Yea. The impression on [H] was that you've got to sift through marketing jargon to find the bits of actual informative statements that were embedded within some of those answers. The fact that they stated it was designed with the server in mind is a prime example of them saying outright that it's not designed for the desktop in mind. I think we'd have all liked to see a bit more honesty, though. I guess some verbal dancing is to be expected It's like discussing policy with a politician.

Abwx · Mar 5, 2012

intangir said:
That's obviously not what was intended. Microarchitects have a perfectly fine term for the concept you are using, which is "utilization". Since they didn't refer to it in those terms, that indicates that what they meant was "performance per clock".

Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.

intangir · Mar 5, 2012

Abwx said:
Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?..

Not necessarily. It is more than possible to have higher utilization, but more variable IPC. Utilization is a measure of how occupied a CPU's resources are; IPC is more relevant to customer needs in that it measures only useful work done. Two examples of ways in which a CPU could keep its execution units busy while doing no useful work are branch mispredictions, and replay.

The Pentium 4 kept its execution resources excessively busy, but since most of the work being done was actually thrown away, due to mispredicted branches or instructions whose data were not yet ready but were repeatedly sent through the pipeline again on the off-chance that they would be ready the next time around, the IPC did not trend in the same direction as the utilization.

For more detail, see xbitlab's article here: http://www.xbitlabs.com/articles/cpu/display/replay.html

Dresdenboy · Mar 5, 2012

@Idontcare:
Everest latency/throughput numbers:
http://instlatx64.atw.hu/

I also did an analysis of general latency/throughput trends here (for different subsets of the ISA):

http://www.planet3dnow.de/vbulletin/showthread.php?t=399118&garpg=2#content_start

translated: http://translate.google.com/transla....de/vbulletin/showthread.php?t=399118&garpg=2

The first diagram shows average latency and the second shows throughput (as IPC).

Abwx said:
Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.

That's correct. But the average IPC for x86 code still hovers around 1. There are code loops which will reach higher levels, but keep in mind, that IPC is influenced by many factors, for example:

L1 I$
branch prediction
avg. decoder throughput
branch misprediction penalties
L1+L2 D$ latencies, throughputs, hit rates, way of handling misses (hit-under-miss etc.)
scheduler efficiency
issue width and efficiency
available EUs
load/store forwarding capabilities
partial register handling penalties
retirement throughput
unit sharing
memory access organisation (MMU, x86 page table walking, TLBs)
memory prefetch efficiency
and so on...

So the number of available EUs is just limiting IPC if all other possible limitations didn't come into play. That could be short phases of a few cycles or even a few hundred in case of an unrolled loop.

TuxDave · Mar 5, 2012

Abwx said:
Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.

Not necessarily because they can put in hardware that reduces the number of uops per instruction and therefore accomplish the same number of instructions per cycle with the same utilization but with less hardware.

AMD Realizes Significant Reduction in Power Consumption by Implementing Cyclos Resona

Dresdenboy

Golden Member

Idontcare

Elite Member

intangir

Member

taltamir

Lifer

pelov

Diamond Member

Abwx

Lifer

intangir

Member

intangir

Member

pelov

Diamond Member

Abwx

Lifer

intangir

Member

Dresdenboy

Golden Member

TuxDave

Lifer

TRENDING THREADS