AMD Realizes Significant Reduction in Power Consumption by Implementing Cyclos Resona

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
And what specific instruction is being referred to?

We've got a lot to choose from, and the execution latencies varies widely across them in any given microarchitecture.

[pic removed]

So we'd need to define the instruction under consideration, or if it is to be more than one instruction then we must define the instruction mix (and weightings).

In short, its not simply an academic matter, that would actually be easier than the errand we are setting ourselves upon here.

We would be talking about defining our own Bapco sysmark or passmark with which "effective IPC" would be characterized, and it would be "workload class" dependent. IPC for office apps is different than IPC for mathlab apps because the instruction mix, and their weightings, are so different.
That's the definition problem in more detail.

I assume internally AMD measures IPC while running SPEC benchmarks since they are mostly referring to them in regard of performance. This would include recompilation, which in turn means lower performance for unchanged "legacy" code. Since SPEC results have been published we even might verify their claim.

What the heck would "constant IPC" mean? To me it means "steady-state IPC", you process the exact same execution loop indefinitely and measure the average IPC that comes from doing so. But what instructions? And perhaps more importantly, to what end?
The ISSCC paper "40-Entry Unified Out-of-Order Scheduler and Integer" by M. Golden et. al. states:
Compared to previous AMD x86-64 cores [3-6], project goals reduce the number of FO4 inverter delays per cycle by more than 20%, while maintaining constant IPC, to achieve higher frequency and performance in the same power envelope, even with increased core counts.
So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
The ISSCC paper "40-Entry Unified Out-of-Order Scheduler and Integer" by M. Golden et. al. states:

So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".

Ah, makes more sense now.

Did they succeed? (anyone publish Everest instruction latency numbers for a clock-equivalent zambezi and PhII?)
 

intangir

Member
Jun 13, 2005
113
0
76
So in the context of a comparison with previous cores, they had one goal to maintain IPC. This should at least make clear that it's not about "steady-state IPC".

Thanks Dresdenboy, that's exactly what I was telling Abwx. Also, in multiple quotes of Mike Butler around the web, the phrase "hold the line" was used to express the same goal.

http://techreport.com/articles.x/21813/1
TechReport said:
Yet Chief Architect Mike Butler told us the engineering team's goal with Bulldozer was to "hold the line" on instructions per clock (presumably at about the same rate as the Phenom II) and to "aggressively pursue higher frequencies."

http://arstechnica.com/gadgets/news/2011/10/can-amd-survive-bulldozers-disappointing-debut.ars
ArsTechnica said:
For Bulldozer specifically, additional design influences came into play. In the words of Chief Architect Mike Butler, AMD's goal was to "hold the line" on IPC (presumably meaning to keep it at around the same level as in Phenom II) but to increase the clock speed, thereby achieving improved single-threaded performance, too.

http://www.tomshardware.co.uk/fx-8150-zambezi-bulldozer-990fx,review-32295-6.html
TomsHardware said:
With its Bulldozer architecture, AMD's architects say it was their goal to “hold the line” on IPC and create hardware that’d scale to much higher frequencies. Given what we already know about the FX-8150's specifications, significantly higher frequencies aren’t being realized today, so before we even run any benchmarks, we have to assume similar IPC throughput, fairly comparable clocks, and then cross our fingers for better scaling across multiple cores if Bulldozer has any hope at all of outperforming the 3.7 GHz Phenom II X4 980 or Turbo Core-equipped Phenom II X6 1100T.

"Hold the line" cannot be interpreted in any way as "reduce variability". "Hold the line" has the meaning of "not giving up ground", which in this context means not giving up the levels of IPC of their previous chip. The quotes above all agree with AMD's stated goals in Golden's ISSCC 2011 paper, which was to increase the frequency while keeping the same IPC. Of course, Mike Butler admits they failed:

http://www.hardocp.com/article/2011/11/29/hardocp_readers_ask_amd_bulldozer_questions/
HardOCP said:
3. It seems that the idea of modules and cores sharing parts is brilliant, but the idea of increasing frequency while lowering IPC seems like a step backwards. Why was this decided on?

Mike Butler, Senior Fellow Design Engineer, AMD - Clearly, IPC is an important factor in processor performance, and IPC has decreased slightly in this first instantiation of "Bulldozer."
 
Last edited:

taltamir

Lifer
Mar 21, 2004
13,576
6
76
even if AMD would have succeed in holding the line on IPC while agressively increasing clocks, they would have ended up with a dud.
The resulting chip would be faster then its predecessor for single and multi threaded performance. But not enough to dislodge intel and potentially not enough to matter.
From a laptop and server standpoint it would be a disaster (just not as big as it is now) due to the exponential increase in power consumption with clockspeed increase (even if they held IPC they would lose IPW)

I think the real purpose was exactly the same as intels P4. Marketing dictated that they need "higher mhz" because "higher mhz sells for more money to ignorant buyers"
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
The next paragraph states

The new CPU core delivers higher frequency while maintaining IPC, improved multi-thread (parallel computing) performance, instructions per watt, advanced boost functionality, new x86 instruction sets, and over-clock capabilities never seen in previous microarchitectures. We believe these enhancements will show positive lift for end-users as new operating systems and software applications take advantage of the new features inside "Bulldozer." And, looking forward, as process technology matures over time, the core is well structured for potential increased frequencies in the future.

So while Butler gave the answers to some of the questions, it's also unfortunate that they weren't his answers

2. Why are the integer operation benchmarks so low compared to even previous AMD 4 cores?

Mike Butler, Senior Fellow Design Engineer, AMD - "Bulldozer" is a new microarchitecture that differs in several ways from previous generations. The "Bulldozer" architecture uses both dedicated and shared resources, allowing for a more efficient design, improved instructions per watt and maintaining IPC over our widest operating range – from top boost frequencies in unlocked desktops to throughput server workloads at lower voltage, a range unmatched in prior AMD architecture generations.

Despite stating it stayed the same and then not and then that it did again, I think the general public (reviewers in this case) claimed a ~10% dip.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
Thanks Dresdenboy, that's exactly what I was telling Abwx. Also, in multiple quotes of Mike Butler around the web, the phrase "hold the line" was used to express the same goal.

Yet , maintaining constant IPC in respect of the previous
architecture mandate a more constant IPC in the new
architecture given its reduced integer execution units
from 3 in K10 to 2 in a bulldozer half module.

Even with only 80% of the K10 core integer throughput,
the new uarch has higher sustained IPC in respect
of the available exe ressources.
 

intangir

Member
Jun 13, 2005
113
0
76
So while Butler gave the answers to some of the questions, it's also unfortunate that they weren't his answers

Despite stating it stayed the same and then not and then that it did again, I think the general public (reviewers in this case) claimed a ~10% dip.

Heh, yes. The text does seem to give the impression of schizophrenia based on conflicting editorial goals. Still, it seems to me that "IPC has decreased slightly" is the least obscured fragment of the statements there, and the rest of the "maintain IPC" verbiage shows a lot of redactional fatigue, so is less to be trusted as communicating Butler's exact words.
 

intangir

Member
Jun 13, 2005
113
0
76
Yet , maintaining constant IPC in respect of the previous
architecture mandate a more constant IPC in the new
architecture given its reduced integer execution units
from 3 in K10 to 2 in a bulldozer half module.

Even with only 80% of the K10 core integer throughput,
the new uarch has higher sustained IPC in respect
of the available exe ressources.

That's obviously not what was intended. Microarchitects have a perfectly fine term for the concept you are using, which is "utilization". Since they didn't refer to it in those terms, that indicates that what they meant was "performance per clock".
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
Heh, yes. The text does seem to give the impression of schizophrenia based on conflicting editorial goals. Still, it seems to me that "IPC has decreased slightly" is the least obscured fragment of the statements there, and the rest of the "maintain IPC" verbiage shows a lot of redactional fatigue, so is less to be trusted as communicating Butler's exact words.

Yea. The impression on [H] was that you've got to sift through marketing jargon to find the bits of actual informative statements that were embedded within some of those answers. The fact that they stated it was designed with the server in mind is a prime example of them saying outright that it's not designed for the desktop in mind. I think we'd have all liked to see a bit more honesty, though. I guess some verbal dancing is to be expected It's like discussing policy with a politician.
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
That's obviously not what was intended. Microarchitects have a perfectly fine term for the concept you are using, which is "utilization". Since they didn't refer to it in those terms, that indicates that what they meant was "performance per clock".

Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.
 

intangir

Member
Jun 13, 2005
113
0
76
Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?..

Not necessarily. It is more than possible to have higher utilization, but more variable IPC. Utilization is a measure of how occupied a CPU's resources are; IPC is more relevant to customer needs in that it measures only useful work done. Two examples of ways in which a CPU could keep its execution units busy while doing no useful work are branch mispredictions, and replay.

The Pentium 4 kept its execution resources excessively busy, but since most of the work being done was actually thrown away, due to mispredicted branches or instructions whose data were not yet ready but were repeatedly sent through the pipeline again on the off-chance that they would be ready the next time around, the IPC did not trend in the same direction as the utilization.

For more detail, see xbitlab's article here: http://www.xbitlabs.com/articles/cpu/display/replay.html
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
@Idontcare:
Everest latency/throughput numbers:
http://instlatx64.atw.hu/

I also did an analysis of general latency/throughput trends here (for different subsets of the ISA):

http://www.planet3dnow.de/vbulletin/showthread.php?t=399118&garpg=2#content_start

translated: http://translate.google.com/transla....de/vbulletin/showthread.php?t=399118&garpg=2

The first diagram shows average latency and the second shows throughput (as IPC).

Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.
That's correct. But the average IPC for x86 code still hovers around 1. There are code loops which will reach higher levels, but keep in mind, that IPC is influenced by many factors, for example:
  • L1 I$
  • branch prediction
  • avg. decoder throughput
  • branch misprediction penalties
  • L1+L2 D$ latencies, throughputs, hit rates, way of handling misses (hit-under-miss etc.)
  • scheduler efficiency
  • issue width and efficiency
  • available EUs
  • load/store forwarding capabilities
  • partial register handling penalties
  • retirement throughput
  • unit sharing
  • memory access organisation (MMU, x86 page table walking, TLBs)
  • memory prefetch efficiency
  • and so on...

So the number of available EUs is just limiting IPC if all other possible limitations didn't come into play. That could be short phases of a few cycles or even a few hundred in case of an unrolled loop.
 

TuxDave

Lifer
Oct 8, 2002
10,572
3
71
Call it utilization if you wants , that doesnt change the fact
that with less ressources a core has to have higher utilization ,
wich mean more constant IPC , isnt it ?...

Maintening perfs/clock mandate such exigence.

A simple exemple , let s say on a three cycles time ,
a K10 can per exemple do 1 + 2 + 3 = 6 instructions/3Cy = 2 IPC..

To perform the same 6 instructions/3 cycles , a BD core has
to stick as much as possible to its max of 2 IPC , wich
mandate a sustained IPC since the max IPC is 33% lower.

Not necessarily because they can put in hardware that reduces the number of uops per instruction and therefore accomplish the same number of instructions per cycle with the same utilization but with less hardware.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |