Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Martimus · May 23, 2011

Ben90 said:
Lol, 36% hit from HTT. What a joke. Maybe in Windows 95 running Linpack. Remember us fanboys use consumer processors with consumer applications. There is a reason why you never hear Power7 on this forum. Anandtech as a whole cares more about i7s and Phenoms instead of Xeons and Opterons.

I'm not saying that the server space isn't important. Just remember where you are at, before you start pointing fingers and calling fanboy.

No offense, but you should really read through that again. You obviously don't understand what is being said at all about hyperthreading.

Phynaz · May 23, 2011

He understands it fine as what it is, which is bunk.

Martimus · May 23, 2011

Phynaz said:
He understands it fine as what it is, which is bunk.

Prove it.

I bet you can't substantiate any claim you have ever made, because you sure as hell never do. Yet you continue to attack any claim anyone ellse ever makes, all the while making unsupported claims of your own. I can't stand such hypocritical behavior.

podspi · May 23, 2011

Hyperthreading almost always (assuming the task can scale) increases throughput by at least a little bit, but will cause a decrease in singlethread performance.

This is mostly irrelevant because in workloads that benefit from HT, singlethread performance is not important, throughput is. In singlethreaded applications, there is only one thread, and so by definition the performance hit doesn't exist

The supposed "magic" of Bulldozer is this: The ability to run 8-cores without TOO much degradation in singlethread performance (compared to a 4C/8T Intel CPU).

The issue (or potential issue) with Bulldozer is that CMT doesn't seem to get them as much die savings as I would have expected, given that most reasonable people are bounding singlethread performance between the PhII and SNB.

The entire point JFAMD was trying to make was that while AMD's method of executing 8-threads led to some decrease in singlethread performance, so does Intel's. He never said hyperthreading was bad -- just that it wasn't as good. Seeing as how this is AMD's official party line, and they are putting their money where their mouth is (they've had more than enough time to implement SMT if they wanted to) it isn't too surprising someone who works for AMD thinks this way.

We'll see. But what JFAMD said about HT is true, it just doesn't matter because it increases throughput. But that was JFAMD's point about CMT in the first place.

Voo · May 23, 2011

podspi said:
This is mostly irrelevant because in workloads that benefit from HT, singlethread performance is not important, throughput is.

An extremely bold statement considering that eg Google has even published papers that don't agree with it.

podspi · May 23, 2011

Voo said:
An extremely bold statement considering that eg Google has even published papers that don't agree with it.

I'd be the first to admit that I am not a computer science academic, but I am not familiar with the research you are talking about. The only paper that I've seen is this one:

http://www.eecs.harvard.edu/~vj/images/4/4e/Msr09search.pdf

Which is by Microsoft, but talks about how because searches must be latency constrained, so singlethread performance still matters (if the search gets returned by a deadline each time, slower singlethread performance degrades search quality results).

This is certainly true for some class of workloads, I was thinking more along the lines of transcoding and rendering -- which are purely throughput oriented.

Phynaz · May 23, 2011

Martimus said:
Prove it.

Do you really believe there is a 36% performance penalty for turning on HT?

Do you have it enabled on your i7?

Accord99 · May 23, 2011

podspi said:
I'd be the first to admit that I am not a computer science academic, but I am not familiar with the research you are talking about.

"Brawny cores still beat wimpy cores, most of the time

http://static.googleusercontent.com...esearch.google.com/en//pubs/archive/36448.pdf

Tuna-Fish · May 23, 2011

Phynaz said:
Do you really believe there is a 36% performance penalty for turning on HT?

Except that that's not in any way what he said. You managed to completely fail to understand what he said, interpreted it in the worst possible way, and instantly proceed to bash apart the strawmen you built up in your head. Project much?

His point was that compared to two threads running on two separate Intel cores, each thread gets cut 36% when you run on one core with HT. Which, frankly, is very courteous to Intel.

Accord99 · May 23, 2011

Tuna-Fish said:
Except that that's not in any way what he said. You managed to completely fail to understand what he said, interpret it in the worst possible way, and instantly proceed to bash apart the strawmen you built up in your head. Project much?

His point was that compared to two threads running on two separate Intel cores, each thread gets cut 36% when you run on one core with HT. Which, frankly, is very courteous to Intel.

The problem with JF-AMD is that he fails to acknowledge the difference between current Intel and AMD single-threaded performance. It's not simply CMT = 1.8X and SMT = 1.2X.

SMT on average gives may be only 20%, but combined with the much stronger Intel core, you end up with a single HT Sandy Bridge core conservatively having the throughput of 1.7 current AMD cores at the same clock speed.

Black96ws6 · May 23, 2011

I'm starting to think we're setting our hopes too high for Bulldozer.

If you look at AMD's leaked marketing slide: http://www.xbitlabs.com/news/cpu/di..._Range_Microprocessor_to_Cost_320_Report.html

You can see that, at least according to that slide, 4 core BD does not match i5-2500k.

Apparently it takes a FX-6110 6 core BD to match\slightly beat a 2500k, with the 8 core BD matching the 2600k.

If that slide is close to accurate, and the leaked pricing is also close to accurate, it basically means this:

2500k for $225, or 6 core BD for $240 (about the same performance)
2600k for $320, or 8 core BDp for $320 (about the same performance)

Again, speculation, but, if that leaked slide and those prices are true\close, anyone waiting on BD should just go buy a 2500k or 2600k now, at least from a gaming perspective, and be able to upgrade to Ivy later on without changing the MB...of course, it's so close now, might as well wait to see if it's better than hoped...

Black96ws6 · May 23, 2011

Also note in that marketing slide it only says "Superior Price\Performance" against Intel's Pentium architecture.

It does NOT say this vs the core i3,5,7xx chips.

The only advantages listed are "More Cores" and "Overclocked" (Turbo?)

To me that screams out it's NOT going to be faster than Sandy at the same price point, at least from an IPC perspective...

I sure hope I'm wrong however...

ElFenix · May 23, 2011

Black96ws6 said:
Also note in that marketing slide it only says "Superior Price\Performance" against Intel's Pentium architecture.

It does NOT say this vs the core i3,5,7xx chips.

The only advantages listed are "More Cores" and "Overclocked" (Turbo?)

To me that screams out it's NOT going to be faster than Sandy at the same price point, at least from an IPC perspective...

I sure hope I'm wrong however...

it's a marketing chart. they had a certain amount of room for bullet points and they wanted to get the strongest one out there for each identified segment. price/performance isn't as sexy as overclock/cores.

GlacierFreeze · May 23, 2011

Your two posts have already been talked about a handful of times.

Voo · May 23, 2011

podspi said:
This is certainly true for some class of workloads, I was thinking more along the lines of transcoding and rendering -- which are purely throughput oriented.

Well your MS paper is just as fine and comes to similar conclusions as those I had in mind, so you'll excuse me if I don't start crawling through archives - and Accord99 posted a nice and quick summary of some of the most important points anyways (although as has to be the case with a 1 1/2 page report some quite important and everything but obvious scenarios are mentioned in only one sentence).

But to keep it simple we can really just postulate that there's a large amount of RL problems where latency does indeed matter - and pretty much any server-client communication falls into that category. Even if the problem itself may be highly parallel the communication itself is limited to one thread and there's always some part that can't be parallelized efficiently.

Now if you're only interested in one particular kind of computation like rendering or pretty much any HPC oriented computation then the game changes, but it's far too simple to just say that "throughput is the only interesting metric for server". But sure, there's a large market for those applications - see Niagara or GPU rendering farms.

PS: The few bits known about googles data centers is quite fascinating

PreferLinux · May 23, 2011

Tuna-Fish said:
His point was that compared to two threads running on two separate Intel cores, each thread gets cut 36% when you run on one core with HT. Which, frankly, is very courteous to Intel.

Correct, but with one major assumption: each thread would fully utilize the core on its own. If that is the case, then there will be no benefit to throughput anyway. But that is almost never the case except with benchmarks/stress testers specifically designed to do that. So you don't actually get a 36% drop in single-threaded performance in that situation.

Phynaz · May 23, 2011

Tuna-Fish said:
Except that that's not in any way what he said. You managed to completely fail to understand what he said, interpreted it in the worst possible way, and instantly proceed to bash apart the strawmen you built up in your head. Project much?

His point was that compared to two threads running on two separate Intel cores, each thread gets cut 36% when you run on one core with HT. Which, frankly, is very courteous to Intel.

Okay, fine. Maybe I did misread it.

How much of a hit do you think a thread running on a BD core will take when a second thread is added to the same core?

Did you see what the marketing guy did? He's comparing two threads running on two BD cores against two threads running on one Intel core.

LucJoe · May 23, 2011

Phynaz said:
Okay, fine. Maybe I did misread it.

How much of a hit do you think a thread running on a BD core will take when a second thread is added to the same core?

Did you see what the marketing guy did? He's comparing two threads running on two BD cores against two threads running on one Intel core.

Think about this: the 8130P is positioned to compete against the 2600k. 2600k is marketed as "4C/8T" while 8130P is marketed as "4M/8T".

I'm assuming you want to know the performance hit when running two threads on a single...thread? That's silly and obvious... you know the answer.

The point he was making is Bulldozer's modular design which allows for higher "core" count (two cores sharing some circuitry) is a more efficient way to run highly threaded applications than adding hyperthreading.

Tuna-Fish · May 23, 2011

Phynaz said:
Okay, fine. Maybe I did misread it.

How much of a hit do you think a thread running on a BD core will take when a second thread is added to the same core?

Well, per core I'd expect that to be 50% + whatever the OS loses for switching.

Per module, weren't they advertising 10%?

Did you see what the marketing guy did? He's comparing two threads running on two BD cores against two threads running on one Intel core.

Oh yes. And I also hate cores as a way of counting "processors" -- IMHO, for throughput-oriented stuff, counting threads is the sanest approach, because that's what the software sees and has to live with. I honestly have no clue why Intel didn't start doing that, wouldn't it even have made them look better?

And wouldn't you agree that, given the leaked price points and die sizes, a BD core should be compared to an Intel thread, and a BD module to an Intel core?

Tuna-Fish · May 23, 2011

PreferLinux said:
Correct, but with one major assumption: each thread would fully utilize the core on its own. If that is the case, then there will be no benefit to throughput anyway. But that is almost never the case except with benchmarks/stress testers specifically designed to do that. So you don't actually get a 36% drop in single-threaded performance in that situation.

Isn't that already accounted for in the 36%? If a thread fully utilizes the core, HT gets you nothing. You get the good boost from HT when your software is written by monkeys and plays hide and seek with pointers.

Unless you meant when the other thread is completely idle? Because, in that case I don't care about performance -- it's when your queues are full and everything is firing on full cylinders when it counts. Every other time, I just want it to suck as little power as possible.

Cogman · May 23, 2011

Tuna-Fish said:
Isn't that already accounted for in the 36%? If a thread fully utilizes the core, HT gets you nothing. You get the good boost from HT when your software is written by monkeys and plays hide and seek with pointers.

Unless you meant when the other thread is completely idle? Because, in that case I don't care about performance -- it's when your queues are full and everything is firing on full cylinders when it counts. Every other time, I just want it to suck as little power as possible.

It is pretty dang hard to write software that fully utilizes a core of a cpu. You have to keep the FPU, ALU, branch predictors, etc, busy without ever touching L3 or lower cache.

It has nothing to do with being a monkey and everything to do with the fact that CPUs have a ton of parts to keep busy.

Voo · May 23, 2011

Tuna-Fish said:
You get the good boost from HT when your software is written by monkeys and plays hide and seek with pointers.

Yeah, because as we all know, the vast majority of software is written in assembly (obviously creating different codepaths to support all kinds of additions like SSE, SSE2, SSE3, AVX,..) and therefore can be easily influenced to get the most out of every CPU - that is, despite the fact that different architectures are different enough to warrant completely different code - but then who wouldn't write different code for different architectures anyways? And really, what monkey isn't able to know while compiling code how for every possible input the cache will behave and how to minimize cache hits thereby.

Now back here in reality those things are all done by the compiler, partially the CPU itself (both things hardly to be influenced by the average programmer) and depend obviously on the executed workload.

Phynaz · May 23, 2011

Tuna-Fish said:
And wouldn't you agree that, given the leaked price points and die sizes, a BD core should be compared to an Intel thread, and a BD module to an Intel core?

No, I don't. Neither does AMD. JF has said they are not going to market modules.

Now application performance at a specific price point I can get behind. But I really don't care if it takes 10 cores or 10 threads or 100.

Tuna-Fish · May 23, 2011

Cogman said:
It is pretty dang hard to write software that fully utilizes a core of a cpu. You have to keep the FPU, ALU, branch predictors, etc, busy without ever touching L3 or lower cache.

It has nothing to do with being a monkey and everything to do with the fact that CPUs have a ton of parts to keep busy.

My talk about code-monkeys is self-depreciating -- I'd call myself a code monkey. Also, just because my previous post seemed overly harsh, I absolutely love HT and other features like it that allow us programmers to be lazy. Features that let programmers do less work are features that saves us a lot of money.

Also, it's not necessary to keep all the parts of the core utilized to block the other thread -- you only need to fill up one part that the other thread also needs. On SNB, that would probably be the cache write unit. Also, as the FPU shares issue ports with the ALU's on Intel processors, using one of them also blocks the other.

Tuna-Fish · May 23, 2011

Voo said:
Yeah, because as we all know, the vast majority of software is written in assembly (obviously creating different codepaths to support all kinds of additions like SSE, SSE2, SSE3, AVX,..) and therefore can be easily influenced to get the most out of every CPU - that is, despite the fact that different architectures are different enough to warrant completely different code - but then who wouldn't write different code for different architectures anyways? And really, what monkey isn't able to know while compiling code how for every possible input the cache will behave and how to minimize cache hits thereby.

Now back here in reality those things are all done by the compiler, partially the CPU itself (both things hardly to be influenced by the average programmer) and depend obviously on the executed workload.

Umm, I completely agree with you. On all counts. I might need to take a communications lessons or something.

Rumour: Bulldozer 50% Faster than Core i7 and Phenom II.

Diamond Member

Lifer

Diamond Member

Golden Member

Golden Member

Golden Member

Lifer

Platinum Member

Golden Member

Platinum Member

Member

Member

Elite Member

Golden Member

Golden Member

Senior member

Lifer

Golden Member

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Golden Member