AMD Barcelona Thoughts/Questions

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kuzi

Senior member
Sep 16, 2007
572
0
0
Originally posted by: IntelUser2000
The problem with Anand's tests that screws with everybody's minds is that Barcelona tests used 1024x768 to show performance that's "CPU bound", while they did not do that with Core 2 Duo tests. Core 2 Duo's tests used 1600x1200. Tomshardware showed the pure CPU power Core 2 Duo had over Athlon's by using 1024x768 resolutions and Geforce 8800GTX. It doesn't happen at 1600x1200.

You can see in: http://www.anandtech.com/cpuch...owdoc.aspx?i=3038&p=15

that Core 2 Duo's over 2.66GHz are GPU bound to really show performance differences. You can see excellent scaling by comparing the 2.33GHz to 2.66GHz.

Oblivion: 83.2%
HL2: 67.3%

By looking at those two points alone its better than Barcelona's scaling from 2.0 to 2.5GHz
Oblivion: 66.4%
HL2: 65.6%

You are comparing a Dual core Intel, with a Quad core AMD, and comparing a different frequency increase 13.5% for Intel and 25% for AMD, so all your numbers are incorrect.

Check out the QX6850 3GHz scaling when compared to Q6600 2.4GHz, this will be a better comparison. Since those are quad cores running at 25% clock difference. You can check the link below and add up the numbers yourself:

xbitlabs review of QX6850

The testing was not done on games only, they compare many applications there, so no worries about the resolution. After adding up all the numbers and taking the average you get about 18% improvement in performance for a 25% clock increase from Q6600 to QX6850. I arrived at the 17% number because the QX6850 uses a 1333MHz FSB, while the Q6600 was at 1066MHz FSB. (the 1333MHz FSB advantage gives about +1% increase for most apps, so 18% - 1% = 17%)

Barcelona tested here at Anandtech was a bit better with scaling at 19% for a 25% clock increase.

Again it's too early to tell, but these are some numbers at least
 

Accord99

Platinum Member
Jul 2, 2001
2,259
172
106
From Anandtech's review of the QX6850, you can calculate the scaling of the Q6600 to QX6800 to avoid the FSB bump. The C2D quad scales at 8.6% performance increase for every 10% increase in clockspeed vs Barcelona's 7.9% performance increase for every 10% increase in clockspeed, with games not considered due to the difference in resolution.

Another interesting thing is to compare the difference in performance of the 3GHz QX6850 vs the 3GHz FX-74 with the difference between the 2GHz Barcelona vs simulated K8 2GHz. A single C2D quad-core beats the dual-socket dual-core FX-74 by (both equipped with fast memory):

Sysmark: 36.9%
DivX: 42.7%
WME: 18.5%
3Dsmax 9: 33.9%
Cinebench: 21.4%
Oblivion: 34.1%
HL2: 24.8%
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: dmens
i was just looking at realworldtech

http://www.realworldtech.com/p...ID=RWT051607033728&p=5

ROB writes 3 per cycle from the front

Ok, I wrote my testcase. Since I can't find the "attach code" button (is that only in the programming forum?), I put it here. Basically, it's a loop, unrolled 10 times, that does a push, an inc, and a pop, and it runs 268 million times (0xfffffff). Push requires 2 "uops": a store, and a subtraction. Inc is one operation. Pop is 2: a load, and an addition. I unrolled it 10 times because all reasonable x86 CPUs other than Core2 have a one cycle taken-branch fetch bubble and looping requires a couple extra instructions, so I wanted to to amortize the overhead.

The program took 3.531 seconds on my 2.3GHz AMD64 CPU. 268435455 iterations * 10 unrollings/3.5 seconds = 2,300,875,328 iterations of the 3-instruction block per second, which means it's doing all 3 instructions in a single cycle. That's 5 execution-unit (i.e. "unfused') uops and it's 3 macro-ops.
 

dmens

Platinum Member
Mar 18, 2005
2,275
960
136
interesting. from what im reading, the pack buffer before the rename has a throughput of three uops per cycle. could some uops skip that pipestage entirely? or perhaps if the frontend recognized something with the stream and optimized it. ill try to look it up in the optimization handbook. i do know c2d performs that type of optimizations.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: dmens
interesting. from what im reading, the pack buffer before the rename has a throughput of three uops per cycle. could some uops skip that pipestage entirely? or perhaps if the frontend recognized something with the stream and optimized it. ill try to look it up in the optimization handbook. i do know c2d performs that type of optimizations.

I don't know what pack does. If a "pack" "unit" can handle 3 uops, maybe whatever it is gets replicated 3 times. I don't think the K8 frontend can do any interesting optimizations on that instruction sequence.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Viditor
CTho9305 and dmens...I really appreciate the dialogue, but could you "dimb this down" for the rest of us?
BTW, thanks very much for the input!

I can try... hopefully this post ends up being useful despite the serious oversimplifications. I'm going to arbitrarily ignore a very large number of very significant things in order to keep this understandable to people with minimal comp-arch background. If you understand the "classic 5 stage pipeline", you may end up either confused or thinking I'm stupid . I revised parts of this a bunch of times, so hopefully it still "flows" reasonably well.

MOTIVATION FOR OUT OF ORDER EXECUTION
A CPU executes an instruction in a series of steps:
1. Fetch the instruction from memory.
2. Decode the instruction
3. Get the inputs to the instruction.
4. Execute the instruction in one of the execution units (let's say we have an ALU, a multiplier, a memory-read unit, and a memory-write unit)
5. Write the result back to the register file (or memory or the screen)

If you build your machine just like this one, you can have one instruction in each of those stages and have a throughput of 1 instruction each cycle. Note that this means only 1 of the 5 execution units will ever be in use at a time. This design looks something like a line of people handing a package down the line, and in the middle of the line there's a spot where there are 5 people next to each other and you can hand the package to one of them. The multiplier and divider are slow, and they hold on to packages for a while before passing them on, and while they're holding something, everyone else has to wait (packages stay in order).

* = person, | = directions a package can be handed down to.

*
|
*
|\
* * (except 5 instead of just 2 here)
|/
*

Now, you'll often find that one instruction takes a while to execute and holds up the rest of the machine, which is bad for performance. For example, if you had a multiply instruction, there would be about 3 cycles that the machine spends waiting for the multiply to finish while everything else is stalled. To get around this, you can use "out of order execution".

BUFFERS REQUIRED FOR OUT OF ORDER EXECUTION
That changes the story a little:

1. Fetch the instruction from memory.
2. Decode the instruction
3. Send the instruction to a queue where it waits until the inputs become available
4. Execute the instruction in one of the execution units (let's say we have an ALU, a multiplier, a divider, a memory-read unit, and a memory-write unit)
5. Wait in a queue until the instruction is the oldest instruction
6. Write the result back to the register file (or memory or the screen)

A quick note: an input might not be available if it came from something slow. If your input came from an add, it'll be ready the next cycle, but if it came from a multiply you'll have to wait around for a while.

Now, instead of having a line where each person holds one package at a time and passes it on with every tick of a clock, there's a buffer part of the way down. The front-end of the cpu (fetch and decode) crunch through instructions as fast as they can, and toss them into this buffer. The back-end (execute) takes instructions as they become ready, executes them, and puts them into a second buffer. The instructions finish ("retire") when they leave the second buffer.

You might wonder why we have to have step 5 - why do instructions have to wait? Well, the problem is that program flow doesn't always go smoothly. What happens if a computation produces a number too big to store? Let's consider this example:
a = 999
b = 999
c = 1
z = 0
c = a * b
d = z + 1

First interesting cycle: z becomes 0
Next cycle: the multiplication starts
Next cycle: the multiplication is still going. d becomes 1
Next cycle: the multiplication is still going. we discover that we can't hold a number that big.
Next cycle: we dump the values of all of the variables so the programmer can figure out what happened.

At this point, we have a problem. The programmer will want to see the values of all the variables to figure out what his mistake was, and he'll see a=999, b=999, c=garbage, z=0, and d=1. He'll be very confused, because even though the program crashed on the 5th instruction, the 6th instruction already executed!

To solve this, we use that second queue. The real registers aren't written out of order as their values are calculated; they're written in order. By waiting to do writeback until the instruction is the oldest in the machine, we can ensure that the "program crashed" variable dump can't show the updated value of d until the multiplication has actually finished. This queue is called the "re-order buffer" (ROB) because it puts the instructions back into the right order.

Note that in reality, the mistakes that happen are caused by the branch predictor 99.99...% of the time, and exceptions like divide by zero / invalid memory access / etc very infrequently.

I made some serious simplifications here which may have confused you (whoever's reading this post ). If you had no background in this stuff, you probably got the message, but if you have just a little background you might have figured out some things that made the examples confusing. I'm going to ignore that for now. The message was, "there's a queue at the front that holds instructions until they're ready, and there's a queue at the back that holds them so they finish in order".

BUFFER SIZES AND UOPS
The sizes of the two queues are very important. Let's say a division takes 100 cycles. If the queues hold 10 entries, what happens when you encounter a division? Well, the re-order buffer can only track 10 instructions and they have to be handled in order, so once the division instruction is the oldest instruction in the machine, it probably hasn't finished yet. Over the next few cycles, the other execution units finish some more instructions and fill up the remaining 9 ROB slots. At this point, the machine has to stall until the division finishes, because if it executes any more instructions it won't know how to put them back in order. The bigger the queues are, the longer the machine can stay busy when some units are encountering long delays.

The size of the first queue comes into play when reading instructions from memory occasionally takes extra long (a cache miss). If the front-end can get far enough ahead of the execution units, then even when it gets held up for a few cycles because of a cache miss, the execution units can keep crunching instructions that are waiting in that buffer. If the buffer is small, the execution units will quickly run out of work to do.

There's another constraint besides just size: how many entries you can add to them each cycle ("dispatch"), and how many you can remove from the second one each cycle ("retire"). There's a lot of book-keeping type work involved, and the complexity grows drastically as you increase this number. If you go back to that slow-division example above, real CPUs will be able to empty the ROB at a rate of about 3 entries per cycle, even though as soon as the division finishes, all entries are theoretically ready to retire.

Now, many real x86 instructions aren't as simple as add, multiply, read, write. A single x86 instruction, like "push" can do a combination of things ("push" writes a value to memory, and also decrements a number).

Older Intel CPUs broke that "push" into 2 "microoperations" or "µops" or "uops" that handle the 2 uops, and each uop takes its own slot in the queues. If you had a 100-entry ROB, you could actually only hold 50 "push" instructions. This is pretty intuitive.

AMD's architecture, on the other hand, tracks the "push" in one slot, but each queue entry has a little extra information that indicates what operations are required by the entry (an entry holding a "push" instruction indicates that both a subtraction and a memory write are required). Both the ALU and the memory-write unit know that they'll have to do an operation for a "push". At the cost of this added complexity, the 100-entry ROB can hold 100 "push" instructions (making it effectively twice as big!). One annoying thing about this style is that the term "uop" is now ambiguous: is a uop something that takes one entry in the ROB, or something that an execution unit views as one piece of work? When someone says K8 is 3-wide, it's 3 ROB-entries wide, but more than 3 execution-unit pieces of work wide.

With Pentium M, Intel picked up some of the benefit of the AMD-style architecture with "uop fusion". uop fusion allows one queue entry to represent more than one execution-unit piece of work. Core 2's "macro-op fusion" improves on this further, so there are some things that take less queue space in Core 2 and some things that take less space in K8.

WHAT DMENS AND I WERE ARGUING ABOUT
dmens and I were arguing about exactly how wide K8 is. He thought that some parts of the pipeline were limited to 3 execution-unit pieces of work wide, most likely because a lot of people use "uop" inconsistently when talking about AMD chip
 

dmens

Platinum Member
Mar 18, 2005
2,275
960
136
Originally posted by: CTho9305
Ok, I wrote my testcase. Since I can't find the "attach code" button (is that only in the programming forum?), I put it here. Basically, it's a loop, unrolled 10 times, that does a push, an inc, and a pop, and it runs 268 million times (0xfffffff). Push requires 2 "uops": a store, and a subtraction. Inc is one operation. Pop is 2: a load, and an addition. I unrolled it 10 times because all reasonable x86 CPUs other than Core2 have a one cycle taken-branch fetch bubble and looping requires a couple extra instructions, so I wanted to to amortize the overhead.

The program took 3.531 seconds on my 2.3GHz AMD64 CPU. 268435455 iterations * 10 unrollings/3.5 seconds = 2,300,875,328 iterations of the 3-instruction block per second, which means it's doing all 3 instructions in a single cycle. That's 5 execution-unit (i.e. "unfused') uops and it's 3 macro-ops.

looked at the instruction tables, your code is using register operands, so POP is the only one that is broken into two uops, for a total of 4 uops per chunk. but that is still greater than the 3 uop limit i thought was the case.

anyways, the hardware optimization i was thinking of was the fact that the push/pop dealt with the same register and there is no dependency chain, it is possible for the machine to remove the instruction altogether. i don't know if the K8 does such a thing. maybe that is part of the functionality of the pack buffer.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: dmens
Originally posted by: CTho9305
Ok, I wrote my testcase. Since I can't find the "attach code" button (is that only in the programming forum?), I put it here. Basically, it's a loop, unrolled 10 times, that does a push, an inc, and a pop, and it runs 268 million times (0xfffffff). Push requires 2 "uops": a store, and a subtraction. Inc is one operation. Pop is 2: a load, and an addition. I unrolled it 10 times because all reasonable x86 CPUs other than Core2 have a one cycle taken-branch fetch bubble and looping requires a couple extra instructions, so I wanted to to amortize the overhead.

The program took 3.531 seconds on my 2.3GHz AMD64 CPU. 268435455 iterations * 10 unrollings/3.5 seconds = 2,300,875,328 iterations of the 3-instruction block per second, which means it's doing all 3 instructions in a single cycle. That's 5 execution-unit (i.e. "unfused') uops and it's 3 macro-ops.

looked at the instruction tables, your code is using register operands, so POP is the only one that is broken into two uops, for a total of 4 uops per chunk. but that is still greater than the 3 uop limit i thought was the case.

AMD's optimization guide for K8 confirms POP RAX is a directpath double, which means 2 macro-operations. Interesting. Did I miscalculate the performance?

edit: I wonder if there's something wrong with the "time" command on linux on dual-core CPUs.
edit2: I think I dropped a factor of 3
 

dmens

Platinum Member
Mar 18, 2005
2,275
960
136
Originally posted by: CTho9305
AMD's optimization guide for K8 confirms POP RAX is a directpath double, which means 2 macro-operations. Interesting. Did I miscalculate the performance?

the reason i am skeptical of a macro-op level handling of code beyond decode (other than retirement) is because it seems unnecessary to burden the backend of the machine with x86, except for the parts of the machine that absolutely demand it (and there's only a few of them i can think of). even if the width of the machine from decode to rename is 3 macro-ops, which as you mentioned can be far more uops, can rename even handle that many uops per cycle? it would be pointless to widen one pipestage so much only to have it run into an immediate bottleneck the next cycle.

that and it would be (imho) monstrously difficult to make the decode to rename interface so wide, at least from my experience.
 

OneEng

Senior member
Oct 25, 1999
585
0
0
TechReport has a new article up here: http://techreport.com/articles.x/13224

In application server benchmarks, a 2.5Ghz K10 is faster than a 3.0Ghz Clovertown and only 4% slower than the 3.0Ghz Penryn (if you look at the 4 warehouse score it actually equals it).

In application server performance, K10 is around 15% faster per clock than Penryn and 25% faster than Clovertown.

It is still notable that the cache memory bandwidth of K10 is about 25% lower than that of Core 2 at the same clock speed. You have to hand it to those Intel engineers. They really know how to make fast cache.

This isn't to belittle K10's cache improvements over K8. It slaughters K8's pathetic bandwidth (more than doubles it at the same clock). It is just that Intel's cache is simply stupendous.

It is also interesting to note that K10's power efficiency bests Intel's Clovertown and even Penryn in most cases. Intel is gaining ground here and if they ever abandon FBDIMM's, things may change quickly. The K10 is still a very cool operator showing that it can hang with an Intel 45nm part even at 65nm.

From the AnandTech article, the database performance of K10 is higher than Clovertown by 25% and higher than Penryn by 15% clock for clock.

Now if AMD can just raise the clock ....... things may not look so dismal for them after all.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Originally posted by: OneEng
TechReport has a new article up here: http://techreport.com/articles.x/13224

In application server benchmarks, a 2.5Ghz K10 is faster than a 3.0Ghz Clovertown and only 4% slower than the 3.0Ghz Penryn (if you look at the 4 warehouse score it actually equals it).

In application server performance, K10 is around 15% faster per clock than Penryn and 25% faster than Clovertown.

It is still notable that the cache memory bandwidth of K10 is about 25% lower than that of Core 2 at the same clock speed. You have to hand it to those Intel engineers. They really know how to make fast cache.

This isn't to belittle K10's cache improvements over K8. It slaughters K8's pathetic bandwidth (more than doubles it at the same clock). It is just that Intel's cache is simply stupendous.

It is also interesting to note that K10's power efficiency bests Intel's Clovertown and even Penryn in most cases. Intel is gaining ground here and if they ever abandon FBDIMM's, things may change quickly. The K10 is still a very cool operator showing that it can hang with an Intel 45nm part even at 65nm.

From the AnandTech article, the database performance of K10 is higher than Clovertown by 25% and higher than Penryn by 15% clock for clock.

Now if AMD can just raise the clock ....... things may not look so dismal for them after all.

I have a hard time finding these benchmarks where it's 25% faster. Could you provide the names of them?

Thanks!

 

davegraham

Senior member
Jun 25, 2004
241
0
0
wow, i'd love to join in here. of course, i'd probably be re-hashing everything I've been talking about at DT, XS, 2cpu, etc. but...we'll see.

dave
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Hi . Dave was waiting for ya to post at XS yesterday. Didn't see ya tho I may have missed it.

Here is a very good review the most comprehensive one I have seen so far.

http://www.hardware.info/en-UK...a_vs_Intel_Harpertown/.


I keep hereing people saying much hangs on K10 ability to scale .

Say what you want I find this a bit puzzling . If I used simple logic . I would think that Intels 45nm high K and metal gates has a 100% chance of scaling much higher than AMD's 65nm SOi process based off the fact that SOI is old tech .

Were as Intels 45nm high K metal gate teck is brand spanking new. We all know what Fugger did yesterday . So we all know Penryn can scale very high . So if Intel gets a good handle on energy efficiency . I find most this K10 scaling argument some what lame.
I am not cutting down K10 it seems to be a good cpu . But logicly speaking does it have a chance of scaling up against Intel 45nm process?

Everyone is saying if K10 can scale up than AMD has a chance. This includes webmasters.

I think most people try to be fair and unbias. But when I see people say if K10- scales there in the Game .

Well I say if Penryn can = K10s upscaling than we keep the same status quo. Is this not a true fact or am I reaching ?
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
I think that they mean that they expect the k10 processors to get more of an improvement from a given rise in clock speed. Even the craziest amd fanboy is unlikely to expect phenom x4 to get to a higher clockspeed than penryn in the foreseeable future.
 

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Ya I thought about that 1995. But I don't see that much differance in scaling between the 2 processors. At least so far I haven't. With the events of the past year and the K10 hype . Grain of salt no longer required . Its at Salt block statis now.
 

Munky

Diamond Member
Feb 5, 2005
9,372
0
76
Originally posted by: zsdersw
Originally posted by: OneEng
Of course Core 2 gains big time when more L2 is added.

Then why do most applications not show "big time" gains from additional cache?
That's simply not true. Take a look at this article (several pages comparing the difference cache size makes). Games in particular show a very significant gain from extra cache, as well as some office applications. In particular, the 1MB cache cpu suffers the biggest drop in performance going from a 2MB mode, as opposed to going from 4MB to 2MB.
 

OneEng

Senior member
Oct 25, 1999
585
0
0
Phynaz,
I have a hard time finding these benchmarks where it's 25% faster. Could you provide the names of them?

Thanks!

It is 25% faster clock for clock, but here is the link:

From here: http://www.techreport.com/articles.x/13224/4

If you look at the SPECjbb2005 scores, you can derive the 25% IPC. The 2.5Ghz K10 lands right between the 3.0Ghz Clovertown and the 3.0Ghz Penryn. Not bad for a 2.5Ghz processor.

From the AnandTech article here: http://www.anandtech.com/IT/showdoc.aspx?i=3099&p=6

At the load5 point, I have the 2.0Ghz K10 at 4300, the 3.0Ghz Clovertown at 4800 and the 3.0Ghz Penryn at 5500.

Again, you can see the relative IPC advantage that K10 has in this very real world database exercise.

Again, I must stipulate that IPC alone doesn't bring down the house for K10. It is presently at a 50% clock speed disadvantage.



 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
Again, you can see the relative IPC advantage that K10 has in this very real world database exercise.

Those benchmarks aren't too relevant to majority of the Anandtech users because they are dual processor capable machines, and we all know that the present Opteron already has a vast advantage against Intel there. However, go to single thread/pc apps and it changes everything. Those benchmarks are really telling the advantages AMD's K10 platform has against Intel's.

AMD has their own presentation saying K10 has 15% improved IPC over Opteron. That's likely referring to performance in single thread and pc apps.
 

nonameo

Diamond Member
Mar 13, 2006
5,902
2
76
Originally posted by: IntelUser2000
Again, you can see the relative IPC advantage that K10 has in this very real world database exercise.

Those benchmarks aren't too relevant to majority of the Anandtech users because they are dual processor capable machines, and we all know that the present Opteron already has a vast advantage against Intel there. However, go to single thread/pc apps and it changes everything. Those benchmarks are really telling the advantages AMD's K10 platform has against Intel's.

AMD has their own presentation saying K10 has 15% improved IPC over Opteron. That's likely referring to performance in single thread and pc apps.

How do you go from 40% better than conroe, to 15% better than K8?
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
CTho9305...

Beautifully written!! I can't thank you enough for all of that effort, mate! :thumbsup:

Do you mind if I reprint that at another website (it's a pay site)? There are quite a few people there who understand investing but are having a difficult time with the basics on the chips. Your post is an excellent explanation.

Many humble thanks...
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,786
136
How do you go from 40% better than conroe, to 15% better than K8?

Mmm. A little bit misinformed comment, but I'll tell you why. Because Barcelona was never 40% better than Conroe. Barcelona was 40% better than Clovertown at:
-SpecFP_Rate(very memory bandwidth sensitive, what does K8's have in abundance in multi-CPUs compared to Clovertown??)
-Using older results for Intel
-At multi-processor systems

Put the Barcelona as desktop chip, and significant advantages go away. Seriously, do more research please, I would research before posting if something I knew in contrary were posted.
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: IntelUser2000
How do you go from 40% better than conroe, to 15% better than K8?

Mmm. A little bit misinformed comment, but I'll tell you why. Because Barcelona was never 40% better than Conroe. Barcelona was 40% better than Clovertown at:
-SpecFP_Rate(very memory bandwidth sensitive, what does K8's have in abundance in multi-CPUs compared to Clovertown??)
-Using older results for Intel
-At multi-processor systems

Agreed...

Put the Barcelona as desktop chip, and significant advantages go away. Seriously, do more research please, I would research before posting if something I knew in contrary were posted.

I mostly disagree here...
1. I think the jury's still out for the desktop
2. As my 6th grade Physics teacher told us, the only stupid question is the one you don't ask. I think that unimformed questions are terrific because they answer someone and they make those who do know the answer think about the question. JMHO
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Viditor
CTho9305...

Beautifully written!! I can't thank you enough for all of that effort, mate! :thumbsup:

Do you mind if I reprint that at another website (it's a pay site)? There are quite a few people there who understand investing but are having a difficult time with the basics on the chips. Your post is an excellent explanation.

Many humble thanks...

I do mind. Feel free to link to it here, but I think it's wrong to increase the value of somebody else's site without compensation. If you don't want to link to this forum from another, I can put it on a page on my own site or my wikipedia user page and you can link to that.
edit: To claify, I realize that posting it here may increase the quality of AT forums and make Anand more money, but I think it's ok because everybody can read it for free.
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: CTho9305
Originally posted by: Viditor
CTho9305...

Beautifully written!! I can't thank you enough for all of that effort, mate! :thumbsup:

Do you mind if I reprint that at another website (it's a pay site)? There are quite a few people there who understand investing but are having a difficult time with the basics on the chips. Your post is an excellent explanation.

Many humble thanks...

I do mind. Feel free to link to it here, but I think it's wrong to increase the value of somebody else's site without compensation. If you don't want to link to this forum from another, I can put it on a page on my own site or my wikipedia user page and you can link to that.
edit: To claify, I realize that posting it here may increase the quality of AT forums and make Anand more money, but I think it's ok because everybody can read it for free.

Fair enough, and I'll comply...again, thanks for the effort! (BTW, Wikipedia would be a great idea...)
 

coldpower27

Golden Member
Jul 18, 2004
1,676
0
76
Originally posted by: OneEng
It is also interesting to note that K10's power efficiency bests Intel's Clovertown and even Penryn in most cases. Intel is gaining ground here and if they ever abandon FBDIMM's, things may change quickly. The K10 is still a very cool operator showing that it can hang with an Intel 45nm part even at 65nm.

I will say that it's primarily due to the FB-DIMM's that the Stoakley platform post such high power consumption.

http://techreport.com/articles.x/13224/9

As seen there in terms of power consumed at high load the E5472 system already beats the 2360 SE system, and that is with the handicap of using FB-DIMM modules which consume considerable amounts of power.

The primary reason the AMD parts can hang in there is due to Intel's use of FB-DIMM's and not because the processors are particularly more efficient in terms of energy usage. The difference between the 2350 and the E5472 is only 43W, and there are 8 DIMM's there so even with a conservative figure like 5W per DIMM we are looking at 3 W difference which would be within a margin of error.

K10 is very competitive in terms of power usage due primarily because of the platform and not because of the processors themselves necessarily.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |