What do you expect from Barcelona in reality?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Nemesis 1

Lifer
Dec 30, 2006
11,366
2
0
Isn't that 2x64 bit instruction pipeline. Vs. Intels 1x 128 bit instruction line?

Is K10 a 3 issue cpu or is it 4 issue like C2D?

I would say that K10 will be 10% faster than C2D clock for clock at the most. Which will put it even with penryn .

Than take into account the extradinary speed DDR3 is developing . When The X38 chipset is released shortly. I believe The benchies using C2D or penryn ( which is what I believe it will be reviewed with) Is going to be eye opening . Low latency DDR3 running a FSB of 500 x4 =2000 with low latency should really be eyeopening.
 

apoppin

Lifer
Mar 9, 2000
34,890
1
0
alienbabeltech.com
Originally posted by: bryanW1995
yeah, don't get me wrong, I HOPE that barcelona/phenom is stronger than garlic because I feel that competition is great for all of us and I think that it's 50/50 or better that there will be no amd in 12 mos without some seriously strong product releases. I just think that it makes absolutely no sense for them to withold benchmarks right now if they could possibly get people like us to ignore intel's price cuts and hold out for amd's new offerings. Every day that they don't release info is one more day that droves of people ditch their am2 and s939 mobo's and march over to the intel camp.

the holiday season is critical ... far more so then now

and AMD's purported reason for "no benchmarks" is so as to not tip intel off as what to counter with.

Personally, i think AMD is also hoping for a miracle with Barcelona speeds.
Does God care about silicon purity and overclockabilty ?
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: nyker96
I believe that the Barcelona core about same as C2D but will be more energy efficient than C2Quad. However, there maybe a few spots where it shines like workstation workload but not in desktop. This is because Barcelona is designed to replace Opty and is aimed at enterprise first and desktop second. this is a prediction only nothing concrete of course.

Yes, but what is an opty but a rebranded A64/X2. Same as a C2D and a Xeon. Is there really any difference? Besides price and pin count (sometimes)? I don't really know.
 

dmens

Platinum Member
Mar 18, 2005
2,274
959
136
Originally posted by: RussianSensation
So if AMD were to double A64's efficiency which is easily realistic

Double? Yeah right. It ain't 1990.

That kind of improvement may exist for some specialized FP/vector codes where the only real limiting factor is machine bandwidth. Like that SpecFP demo the other day, pretty slick. For general code? No chance.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Originally posted by: TuxDave
So if AMD were to double A64's efficiency which is easily realistic given A64's dual state 64-bit instruction pipeline, lack of L3 cache, and less efficient memory controller, it'd be roughly 50% more efficient than C2D per clock cycle.

I'll have to quote you on that.... but due to lack of information I can't make a guess on my own.

Sure

"However, Barcelona is far more than a quad-core K8 with an L3 cache. We estimate the number of non-cache transistors in a dual-core Athlon 64 X2 to be approximately 94M, and the Barcelona core is around 247M; even doubling the dual-core K8 figure won't get you close to Barcelona. Note that simply doubling the 94M number also isn't an accurate comparison as Barcelona only features a single on-die Northbridge. In essence, there are more than 60M additional transistors (or more than 15M per core) that went into architectural enhancements outside of more cores and cache in Barcelona.

Originally posted by: dmens
Originally posted by: RussianSensation
So if AMD were to double A64's efficiency which is easily realistic

Double? Yeah right. It ain't 1990.

That kind of improvement may exist for some specialized FP/vector codes where the only real limiting factor is machine bandwidth. Like that SpecFP demo the other day, pretty slick. For general code? No chance.

Core 2 Duo has 90-100% efficiency per clock cycle as P4 NetBurst architecture. Here is why doubling for AMD is realistic:

AMD Architecture Comparison

K8 vs. Barcelona
1. SSE Execution Width 64-bit vs. 128-bit (double)

2. Instruction Fetch Bandwidth 16 bytes/cycle vs. 32 bytes/cycle (double)

3. Data Cache Bandwidth 2 x 64-bit loads/cycle 2 x 128-bit loads/cycle (double)

4. L2/Northbridge Bandwidth 64 bits/cycle vs. 128 bits/cycle (double)

5. FP Scheduler Depth 36 Dedicated x 64-bit ops vs. 36 Dedicated x 128-bit ops (double)

6. Barcelona adds a 512-entry indirect predictor - the 253.perlbmk test of SPEC CPU2000 the reduction in mispredicted branches with Prescott was significant, reaching almost 55%.

7. The inclusion of an indirect predictor wasn't the only crystal ball improvement in Barcelona; the size of the return stack in the new core is double what it was in K8.

8. One major aspect of Intel's Core micro-architecture advantage is its ability to allow load instructions to bypass previous load and store instructions. On average, about 1/3 of all instructions in a program end up being loads, thus if you can improve load performance you can generally impact overall application performance pretty significantly. AMD's K8 architecture had no equivalent scheme for allowing the out of order execution of loads ahead of other loads and stores!!! Barcelona can now re-order loads ahead of other loads, just like Core 2 can. It can also execute loads ahead of other stores. Barcelona can generate up to three store addresses per clock as it has three AGUs (Address Generation Units) compared to Intel's one for stores.

9. The K8 core featured a single memory controller that was 128-bits wide, but in Barcelona AMD has split up the DRAM controller into two separate 64-bit controllers. Each controller can be operated independently and thus you get some improvements in efficiency, especially when dealing with quad core implementations where the individual cores working on independent threads all have their own memory access patterns. Now, instead of executing writes as soon as they show up, writes are stored in a buffer and once the buffer reaches a preset threshold the controller bursts the writes sequentially. What this avoids is the costly read/write switch penalty, helping improve bandwidth efficiency and reduce latency. [ Each Barcelona core gets its own set of data and instruction prefetchers, but the major improvement is that there's a new prefetcher in town - a DRAM prefetcher. The DRAM prefetcher doesn't pull data into the CPU's L2 or L3 caches either; instead it features its own buffer to avoid polluting the caches ]

I am not even touching on major changes in power management such as Barcelona's Northbridge now runs on a separate power plane, ability of each core to run at different clock speeds depending on load [AMD's first quad core part to operate within the same thermal envelope as current Opterons.]

Source: AMD on the Counterattack
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Originally posted by: zephyrprime
Originally posted by: RussianSensation
Dual stage 128-bit instruction pipeline vs. single stage for C2D,
What is that?

I got my terms mixed up here but a better clarification is this:

From Ananatech's analysis of Barcelona:

"In the K8 (A64) architecture AMD can execute two SSE operations in parallel; however the SSE execution units are only 64-bits wide. For 128-bit SSE operations, the K8 had to handle them as two 64-bit operations. This also means that when a 128-bit SSE instruction is fetched, it is first decoded into two micro-ops (one for each 64-bit half of the instruction), thus taking up an extra decode port for a single instruction. Barcelona widens the execution units that handle SSE operations from 64-bits to 128-bits, so now 128-bit SSE operations don't have to be broken up into two 64-bit operations" (Note: same as Core 2 Duo).

"Now that you can fetch and decode more instructions, you need to be able to get more data to the execution core and thus AMD widened the interface between the L1 data cache and Barcelona's SSE registers. Barcelona can now perform two 128-bit SSE loads per cycle from the L1-D cache compared to two 64-bit loads per cycle in K8 (A64)" (Note: Core 2 Duo can only perform 1 128-bit SSE load per cycle).
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: RussianSensation

*snip*

Wow, that's a lot of numbers... but here goes my opinion.

I doubt the method of estimating the number of transistors on each chip and I really don't see how there's any strong correlation between transistor count and chip performance. There's a good correlation of leakage power and transistor count but without knowing how those transistors are being used, I'm not sure if you can derive any performance numbers from it.

As for the C2D v P4 : Phenom v K8. You have a point there but the big problem was that P4 was a frequency beast and K8 wasn't designed with the same idea that frequency is the end all of everything.

1,2,3,4,5,7)
As for the rest, it's too late to wrap my head around anything besides the fact that there are some new spec that is double the old spec. Unfortunately I could've done the same saying "The old P4 was 32bit and the new P4 was 64bit (and of course we do all the something x 32bit ops vs something x 64bit ops to get a couple more numbers)" and then conclude that we should've gotten twice the performance.

6)
Not sure why you're comparing against Prescott when you should be looking for K8 numbers.

8)
You lost me here. First you talk about the importance of loads and then you talk about the number of AGUs for stores.

9)
It's a good idea that will increase performance.... not sure exactly why you're highlighting some lines for emphasis. A 128-bit controller was divided into two 64-bit controller. The fact that there's a 'two' in there supports your doubling theory?
 

dmens

Platinum Member
Mar 18, 2005
2,274
959
136
Originally posted by: RussianSensation
Core 2 Duo has 90-100% efficiency per clock cycle as P4 NetBurst architecture. Here is why doubling for AMD is realistic...

1. I already mentioned SSE, it is a special case and only limited by machine data width.
2-5. Were those items even bottlenecks to begin with? If not, doubling them wouldn't do a thing for the throughput. If there were the bottleneck in some cases, then doubling the width will alleviate the problem to the point where some other structure is the new bottleneck. 2x width never equals 2x throughput.
6. Even if the P4 or C2D had oracle predictors across the board, the performance will not get anywhere close to 2X improvement.
7. Um... OK? That will help a bit I guess.
8-9. That's nice and all, but how do those features double throughput? All signs point to an even higher average latency from memory, relative to clock speed, on K10 vs K8.

K10 is still 3-wide issue. With the memory subsystem held equivalent to K8, the only way throughput can be doubled for general code is if the machine somehow managed to get microcode to do twice as much work as before, or something equivalently magical. In regards to the memory system, even if the latency is significantly decreased (and all signs point to the contrary), it will not yield twice the throughput, unless the machine had no depth or something crazy like that.

AMD wouldn't release any information worth a damn to the public about how they tuned the machine before the product is released. Intel still has not released much information about how c2d is tuned. All the items you mentioned from the AT article about K10 don't really give any useful information to deduce performance. I'm going to wait for real benchmarks from AMD. The meaningless crap from their PR jockeys right now is beyond useless.
 

Game Boy

Member
Jul 18, 2007
32
0
0
My specific prediction for the best desktop x86 processors(December 15, 2007):

1st. 3.33GHz Core 2 Extreme X6950 (120W TDP, 4MB L2, 333MHz FSB) $999
2nd. 2.6GHz Phenom FX-xx (120W TDP, 4x512KB L2, 2MB L3) $599

My specific prediction for the best desktop x86 processors(March 15, 2008):

1st. 2.8GHz Phenom FX-yy (120W TDP, 4x512KB L2, 2MB L3) $599
2nd. 3.33GHz Core 2 Extreme X6950 (120W TDP, 4MB L2, 333MHz FSB) $530

 

Schmeh

Member
Jun 25, 2004
29
0
0
Originally posted by: Game Boy
My specific prediction for the best desktop x86 processors(December 15, 2007):

1st. 3.33GHz Core 2 Extreme X6950 (120W TDP, 4MB L2, 333MHz FSB) $999
2nd. 2.6GHz Phenom FX-xx (120W TDP, 4x512KB L2, 2MB L3) $599

My specific prediction for the best desktop x86 processors(March 15, 2008):

1st. 2.8GHz Phenom FX-yy (120W TDP, 4x512KB L2, 2MB L3) $599
2nd. 3.33GHz Core 2 Extreme X6950 (120W TDP, 4MB L2, 333MHz FSB) $530

So not only are you predicting that 200Mhz (the difference between the two Phenoms) will make a big difference, but you are also predicting that Intel will release no new processors between December 2007 and March 2008.

I am going to doubt it on both counts.
 

Game Boy

Member
Jul 18, 2007
32
0
0
Yes, 200Mhz will make a large difference because of the performance-per-clock (Maybe not as above but going from 2.6 to 2.8 will put it very close to the best Intel comes out with) and also I think Intel won't release any new ones because the prediction was 20% better clock speeds over Kentsfield and they'll be saving the 3.66GHz for later.

Um, and there's a reason it's a personal prediction and not a statement of fact: so people won't take it seriously.

My prediction is not insane or unreasonable, just not immediately obvious from the available data. I believe upcoming benchmarks will approximate the figures, and if they don't it doesn't matter.

Once again, feel free to post your own equally unsupported predictions and then we'll see who was lucky in eight months' time.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Originally posted by: TuxDave


6)
Not sure why you're comparing against Prescott when you should be looking for K8 numbers.

K8 doesn't have an indirect branch predictor. Therefore it is impossible to make the same comparison. Having said that, Core Duo had already some form of indirect prediction and the performance difference from CD to C2D with improved predictor was significant. Based on this, the introduction of an indirect branch predictor in Barcelona should have an even more significant effect on performance as it did in C2D (however, you could argue that a shorter pipeline of Athlon processors doesn't suffer as much to begin with as it can quickly refill the pipeline vs. a Prescott long stage one.

8)
You lost me here. First you talk about the importance of loads and then you talk about the number of AGUs for stores.

Once you load instructions, there could be a time span before they can be actually used. If Barcelona can store more 'ready' instructions, it can access them much quicker. Hence the comparison - if you can load more AND store more, you reduce latency in the calculation process.

9)
It's a good idea that will increase performance.... not sure exactly why you're highlighting some lines for emphasis. A 128-bit controller was divided into two 64-bit controller. The fact that there's a 'two' in there supports your doubling theory?

No, this doesnt support the doubling theory. But all of the points above + further help to explain most of the performance enhancements. Although the separation of the memory controller wont necessarily double its performance, all these factors together add up to better efficiency. Remember the internal memory controller is SEPARATE from the Barcelona efficiency changes above. Think of it as improved northbridge memory controller on an Intel motherboard. So now not only do you have a far more efficient processor in theoretical terms, but improved memory access/latency.
 

amenx

Diamond Member
Dec 17, 2004
4,090
2,361
136
I hope Barcelona is a kick-ass CPU. A big price war should then ensue and bring down $1000 CPUs like the upcoming 3.33ghz Penryn and its equivalent AMDs to more affordable levels.
 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Originally posted by: dmens

1. I already mentioned SSE, it is a special case and only limited by machine data width.

It might be a special case or not, but it had a significant improvement in the floating data calculations of c2d processor relative to P4. Having a more robust SSE width is imprortant for longer instruction sets, especially with the advancements of 64-bit OS.

2x width never equals 2x throughput.

So you are saying that if you have 128-bit instruction set which can be completed in 1 cycle on a 128-bit instruction capable pipeline, it won't be twice as fast relative to a 64-bit pipeline which has to separate the instruction set into 2 parts and do 2 cycles?

6. Even if the P4 or C2D had oracle predictors across the board, the performance will not get anywhere close to 2X improvement.

All points above together will account for the increase. Hence the reasoning for listing further items. I am not saying a branch predictor will double the performance as a stand-alone item.


All the items you mentioned from the AT article about K10 don't really give any useful information to deduce performance. I'm going to wait for real benchmarks from AMD. The meaningless crap from their PR jockeys right now is beyond useless.

Obviously, as I said before it's not 100% guaranteed Barcelona will double the performance, but it's possible as C2D did relative to P4. If you guys think AMD engineers can't accomplish the same feat consider that in the last 6 years or so amd was more efficient per clock cycle. Historically, their processor designs focus on efficiency. A64 introduced so many revolutionary changes and easily produced 50% efficiency per clock cycle to the might Intel. Athlon XP 2.2ghz 3200+ easily competed with P4 2.8ghz. When it comes to making efficient processors, AMD is far far ahead of Intel and has been for a long time. To say that Barcelona won't be more efficient than C2D is certainly possible, but most likely not probable given the historical efficiencies of XP and A64 relative to Intel.

1) At the end of the day, when Radeon 9700Pro was introduced, did ATI shout at the top of their lungs we have a killer gpu? You can't just assume that because AMD didn't release benches, their processor is crap.

2) If things were so bad on the cpu front, would AMD go into massive debt to purchase ATI for $5.4 billion? They do need to repay the cost of the acquisition with future cash flows generated from their products.

 

zsdersw

Lifer
Oct 29, 2003
10,505
2
0
Originally posted by: RussianSensation
but it's possible as C2D did relative to P4.

That's not a comparison you can sensibly use to demonstrate your point. The P4 and C2D are totally different designs with totally different design philosophies. The K8 and Barcelona are not as different from each other as the P4 was from C2D.

A64 introduced so many revolutionary changes and easily produced 50% efficiency per clock cycle to the might Intel. Athlon XP 2.2ghz 3200+ easily competed with P4 2.8ghz. When it comes to making efficient processors, AMD is far far ahead of Intel and has been for a long time. To say that Barcelona won't be more efficient than C2D is certainly possible, but most likely not probable given the historical efficiencies of XP and A64 relative to Intel.

The fact that the A64 was more efficient than the P4 is to be expected, not a demonstration of any particular brilliance on the part of AMD. The P4 was a low-IPC/high-clock design. The A64 was not.

And actually, no, AMD hasn't been "far ahead of Intel" and "for a long time" in the making-efficient-processors department. Remember the Pentium M and Core Solo/Duo? They were highly efficient processors and, obviously, the basis for C2D. Intel had two different approaches in processor design out on the market at the same time. So when you say "historical efficiencies of XP and A64 relative to Intel" you're not telling the whole story. The A64 was more efficient compared to the P4, not to "Intel".

If things were so bad on the cpu front, would AMD go into massive debt to purchase ATI for $5.4 billion? They do need to repay the cost of the acquisition with future cash flows generated from their products.

A less-than-perfect processor design isn't going to ruin AMD's future as much as a longer-term lack of vision, so the ATI purchase (or something similar to it) would've happened anyway.

 

RussianSensation

Elite Member
Sep 5, 2003
19,458
765
126
Originally posted by: zsdersw

That's not a comparison you can sensibly use to demonstrate your point. The P4 and C2D are totally different designs with totally different design philosophies. The K8 and Barcelona are not as different from each other as the P4 was from C2D.

Yes, you are right. At the same time after taking 4-5 years to develop Barcelona, it would be a major disappointment to me if it wasn't 2x as efficient because in technology terms, 5 years is an eternity. If you can't double the efficiency per clock cycle in 5 years based on the advancements in technology and the time frame, your firm's engineers need retraining at MIT.

The fact that the A64 was more efficient than the P4 is to be expected, not a demonstration of any particular brilliance on the part of AMD. The P4 was a low-IPC/high-clock design. The A64 was not.

The fact that Intel was gunning for clock speed vs. efficiency while AMD's forte is focusing on efficiency first and not clock speed is what makes them brilliant. If it wasn't for the mhz / heat / power barrier encounted by Intel, they'd have a 6.6ghz P4 right now. You have one company's management and engineers who understand they cannot compete on manufacturing technology. Their last 2 desktop processor designs focused on trying to maximize efficiency per clock cycle to match their disadvantage in clock speed offered by their competitor. There is no question that AMD's engineers' core competency is their ability to maximize efficiency per clock cycle.

A less-than-perfect processor design isn't going to ruin AMD's future as much as a longer-term lack of vision, so the ATI purchase (or something similar to it) would've happened anyway.

I somewhat agree on your point here. Even if AMD doesn't have the top dog processor, as long as they do well in the <$300 price markets, they'll do well. The problem now though is that with E6850 at 3.0ghz and Q6600 below $300 price level, it leaves very little room for AMD to compete on margins. Therefore, a less than stellar processor might not be enough because it wont offer enough incentive for consumers to consider AMD over the far more recognized and respected Intel brand in the average joe's mind. And we know AMD cannot engage in a price war with Intel in the long-term. If AMD doesn't deliver, it doesn't look good for them as IIRC the videocard division only generates 15%-20% of the net income of AMD as a whole.
 

dmens

Platinum Member
Mar 18, 2005
2,274
959
136
Originally posted by: RussianSensation
It might be a special case or not, but it had a significant improvement in the floating data calculations of c2d processor relative to P4. Having a more robust SSE width is imprortant for longer instruction sets, especially with the advancements of 64-bit OS.

OK? It's good for SSE, but it doesn't help anything else. How is it important for "longer instruction sets", or "advancements in 64-bit OS". I have no idea what you're talking about.

So you are saying that if you have 128-bit instruction set which can be completed in 1 cycle on a 128-bit instruction capable pipeline, it won't be twice as fast relative to a 64-bit pipeline which has to separate the instruction set into 2 parts and do 2 cycles?

If the backend resources aren't equivalently lengthened, then data flow will just stall at the next chokepoint. Of course it won't be twice as fast.

All points above together will account for the increase. Hence the reasoning for listing further items. I am not saying a branch predictor will double the performance as a stand-alone item.

And I already said all the stuff you mentioned combined won't give double the throughput. The real reasons for speedup have not been revealed.

Obviously, as I said before it's not 100% guaranteed Barcelona will double the performance, but it's possible as C2D did relative to P4. If you guys think AMD engineers can't accomplish the same feat consider that in the last 6 years or so amd was more efficient per clock cycle. Historically, their processor designs focus on efficiency. A64 introduced so many revolutionary changes and easily produced 50% efficiency per clock cycle to the might Intel. Athlon XP 2.2ghz 3200+ easily competed with P4 2.8ghz. When it comes to making efficient processors, AMD is far far ahead of Intel and has been for a long time. To say that Barcelona won't be more efficient than C2D is certainly possible, but most likely not probable given the historical efficiencies of XP and A64 relative to Intel.

Here's a free hint: high clocks speeds aren't necessarily bad, and throughput is still a function of clock speed. Don't let the PR machines brainwash you.

I really don't feel like entering a retarded intel vs amd argument, talking about "historical focuses". I will tell you that since c2d is based on p-m, that is a uarch that has been developing high ROI on throughput vs power since p6. So much for being "far far ahead of intel".
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |