Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

kernelc · Jun 11, 2012

Soulkeeper said:
does anyone else think that the small L1 caches on bulldozer could be killing performance ?

per core (half module) there is 1/2 the instruction and 1/4 the data compared to stars (correct me if i'm wrong).
Granted the instruction cache is shared, but even then it must feed 2 threads.

Was there any justification for decimating the L1 cache sizes other than saving die space ?
Have they sufficiently compensated (if possible) ?

Looking at These benchmarks comparing bulldozer with a 200MHz clock advantage against x6. Not only are the L1 cache sizes smaller, but the overrall latency and bandwidth of both L1 and L2 are worse.

Based on Anandtech data (http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/10) it seems that L1, while small, has decent hit rates.

Surely increase it 2X would be useful, but I think the real culprit is the write-through approach, coupled with the very small WCC cache (4 KB only).

Regards.

ShintaiDK · Jun 11, 2012

Homeles said:
No. Have you not read this article? http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/1

Bulldozer's definitely a great concept, but it's got some glaring flaws. Piledriver will make some significant strides towards hammering those out. AMD can only do so much on the 32nm node though. It appears Bulldozer wasn't really ready for 32nm, as there isn't much room for improvement (literally).

So was the P4. Failure for both? Cant clock high enough.

Bulldozer uarch is unfixable.

SocketF · Jun 11, 2012

Cpus said:
What do you think the clock speed, turbo, and max turbo will be of the flagship steamroller cpu? I'm almost positive the flagship piledriver will be 4.2ghz with 4.7ghz turbo core

The clocks for the APUs (Trinity) are already official:
http://www.amd.com/us/products/desktop/apu/mainstream/pages/mainstream.aspx#7
3.8 GHz / 4.2 Turbo on a 100W TDP.
I hope that they will get enough yields at 4.0/4.5 GHz for Vishera within a 125W TDP envelope. But that's a Steamroller thread, so this discussion is off topic here.

Soulkeeper said:
does anyone else think that the small L1 caches on bulldozer could be killing performance ?
Was there any justification for decimating the L1 cache sizes other than saving die space ? Have they sufficiently compensated (if possible) ?

Yes - clock speed. For example the L2 has a latency of 18 cycles, if it has a size of1MB, and 20 if it is 2MB. Less cache is accessed faster. However, I still wonder why they have kept a 4 cycle latency for the little 16kB L1. Intel has 4 cycles to, but they have a 32kB 8way write-back design, i.e. much more complex. Well, I guess they have their reasons.

Furthermore the cache question was answered here:
http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/12

Yes of course it doesn't help, but it is not the huge, big bottleneck.

Edit: kernelc was faster *G*

kernelc · Jun 11, 2012

Homeles said:
It was probably to save die space. AMD could use some increased associativity to counter the resources being shared and small cache size.

I don't think that the small L1 cache is a direct consequence of saving die space: L1 cache wil remain very small (from a die estate point) at 32 or 64 KB also.

I think that the small L1 and the relatively low associativity are due to the high planned clock speed to be reached. The real problem is that these high clock speed mostly failed to materialize with current-gen Bulldozer chip...

Lets see the desktop version of Piledriver

Thanks.

Vesku · Jun 11, 2012

Seems like they lost some of their core CPU engineers over the course of the CMT design. The design that would launch as Bulldozer - FX was originally targeted at 45nm, probably first scheduled to launch around the time we saw x6 CPUs.

Loss of institutional knowledge is the most likely explanation why they would apparently rely on the next fab process shrink for such a large proportion of final performance. Especially when their fab actually had a history of rocky roll outs.

SocketF · Jun 11, 2012

ShintaiDK said:
So was the P4. Failure for both? Cant clock high enough.

Bulldozer uarch is unfixable.

P4's problem was that intel didn't foresee the steep increasing leakage problem. AMD's engineers however, had that in mind. Now, if you want to tell us some inside story, please feel free to enlighten us with the technical details. Statements without the "because"-part, however, are useless and merely babbling.

kernelc said:
This is true, but modern operating systems tend to don't move threads between cores without a very good reason: http://www.tomshardware.com/reviews/intel-core-i5,2410-8.html

With Windows Vista it was a major concern for AMD Phenom CPU, as you correcly noted.

I have red about it, but I am still very skeptical. If I run for example the cinebench single thread benchmark and I check the task-manager, then I have ~13% CPU utilization spread evenly across *all* my 8 cores. I thought that it maybe due to my Win2008R2 Server OS, but it seems it seems it is a problem for normal W7, too, there are several people complaining about the core parking feature:
http://bitsum.com/about_cpu_core_parking.php

However, it seems to be disabled for Bulldozer CPUs with one of the scheduler hotfixes:
http://support.microsoft.com/kb/2646060/en-us

A Bulldozer-specific problem that remain with the new W7 patch also, is that ideal thread schedule heavily depend on the expected instruciton mix and others difficult to predict factors. For example, FP-heavy threads should be scheduled on a different modules (this maximize FPUs usage), while integer thread can be scheduled both on different modules (when they don't share data) or on the same module (when they share data, for let Turbo boost to aggressively kick in and use the L2 to quicky pass data between the two cores).

So, the ideal scheduling on Bulldozer-class cores is a quite complex affair, and often the OS scheduler simply don't has enought information to do the best choice.

Oh yes, scheduling is veery complex. I red a few articles about the Linux and BSD schedulers, and there is really lots of work going on. However, it seems that Linux has chosen to implement one dedicated run queue per core, whereas windows has imo 1 global run queue, hence the core hopping ( I didnt find sources for windows, but seeing the threads hopping around, I assume it is only 1 run queue). I wonder if something will change in Win8. At least this will bring the "true" bulldozer scheduler. The current patches for Win7 are imo just re-using the features implemented for Intel's hyperthreading.

If I remember correctly, the first Intel 32nm product, westmere, had L2 cache that didn't run at core clock. It is with Sandy Bridge that L2 runs at core clock

Doh, I was too lazy to write "sandy-bridge" and decided to go with the shorter 32nm, and look what happened, it got wrong. You are right, of course

regards

SF

kernelc · Jun 11, 2012

SocketF said:
I have red about it, but I am still very skeptical. If I run for example the cinebench single thread benchmark and I check the task-manager, then I have ~13% CPU utilization spread evenly across *all* my 8 cores. I thought that it maybe due to my Win2008R2 Server OS, but it seems it seems it is a problem for normal W7, too, there are several people complaining about the core parking feature:
http://bitsum.com/about_cpu_core_parking.php

Interesting. I should try with my Arrandale-based notebook...

However, it seems to be disabled for Bulldozer CPUs with one of the scheduler hotfixes:
http://support.microsoft.com/kb/2646060/en-us

Oh yes, scheduling is veery complex. I red a few articles about the Linux and BSD schedulers, and there is really lots of work going on. However, it seems that Linux has chosen to implement one dedicated run queue per core, whereas windows has imo 1 global run queue, hence the core hopping ( I didnt find sources for windows, but seeing the threads hopping around, I assume it is only 1 run queue). I wonder if something will change in Win8. At least this will bring the "true" bulldozer scheduler. The current patches for Win7 are imo just re-using the features implemented for Intel's hyperthreading.

Yes, this was needed to prevent the OS to too often assign two threads to a single module. The downside is reduced Turbo-core efficiency and increased power consumption. And the results speak for themselves: with this patch, Bulldozer improvement are very limited (for throughput at least; responsiveness can be quite improved it seems).

Doh, I was too lazy to write "sandy-bridge" and decided to go with the shorter 32nm, and look what happened, it got wrong. You are right, of course

regards

SF

Homeles · Jun 11, 2012

ShintaiDK said:
So was the P4. Failure for both? Cant clock high enough.

Bulldozer uarch is unfixable.

Bulldozer's pipeline is only slightly larger than Sandy Bridge's. You have absolutely no credibility to be making calls like that.

sm625 · Jun 11, 2012

I just dont get why AMD removes important things like INT execution clusters and then replaces them with crap like the WCC. And why go from write back to write through>? How much die space did that save? And why not just cut the size of the L2/L3 by 20% if they wanted to save space? I just dont get why cut and gut critical sections of the core and replace it with "nonintelligent" bits like L2/L3, and ill-conceived band-aid solutions like the WCC. I wonder how much of the BD die is composed of sloppy duct-tape type transistors...

Soulkeeper · Jun 11, 2012

Good posts everyone.
Nice to see some intelligent speculation/ideas.

ShintaiDK · Jun 11, 2012

Homeles said:
Bulldozer's pipeline is only slightly larger than Sandy Bridge's. You have absolutely no credibility to be making calls like that.

Interesting considering none of the 2 companies have disclosed the pipeline length. So can you show me where that information is published?

Personally I would guess that SB got around 16 stages and Bulldozer around 25 stages. Misprediction latency gives an estimate that its around 50% longer than Phenom II.

My point was rather that Bulldozer uarch is based on INT cores that only got 2 issue ports and the focus on simple instructions. Plus the shared FPU with 2 issue ports as well.

Don Karnage · Jun 11, 2012

Homeles said:
Bulldozer's pipeline is only slightly larger than Sandy Bridge's. You have absolutely no credibility to be making calls like that.

I could have sworn i read Bulldozer's pipeline is on par with Netburst's

iCyborg · Jun 11, 2012

ShintaiDK said:
Interesting considering none of the 2 companies have disclosed the pipeline length. So can you show me where that information is published?

Personally I would guess that SB got around 16 stages and Bulldozer around 25 stages. Misprediction latency gives an estimate that its around 50% longer than Phenom II.

He's probably referring to this: http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2

Secondly, the Pentium 4's pipeline was 28 ("Willamette") to 39 ("Prescott") cycles. Bulldozer's pipeline is deep, but it's not that deep. The exact number is not known, but it's in the lower twenties. Really, Bulldozer's pipeline length is not that much higher than Intel's Nehalem or Sandy Bridge architectures (around 16 to 19 stages).

Homeles · Jun 11, 2012

ShintaiDK said:
Interesting considering none of the 2 companies have disclosed the pipeline length. So can you show me where that information is published?

Personally I would guess that SB got around 18 stages and Bulldozer above 25 stages.

My point was rather that Bulldozer uarch is based on a INT cores that only got 2 issue ports and the focus on simple instructions.

Then you should have stated your point. We've only seen thorough benchmarks for one generation of Bulldozer. You cannot say the architecture is a failure solely based off of its first implementation. By your logic, I could say that Intel's current desktop uarch is doomed because the P6 architecture (which Conroe, Nehalem and Sandy Bridge are derived from) was a flop at the time of its introduction. And it was.

Regarding the "official" pipeline stage count, there hasn't been a number published for either architecture, but it's not difficult to get a ballpark figure.

Don Karnage said:
I could have sworn i read Bulldozer's pipeline is on par with Netburst's

How does that have anything to do with what I said? Regardless, the pipeline is shorter, and although the minimum branch prediction penalty is similar, the maximums are very different.

To both of you:
http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/2

ShintaiDK · Jun 11, 2012

Homeles said:
Then you should have stated your point. We've only seen thorough benchmarks for one generation of Bulldozer. You cannot say the architecture is a failure solely based off of its first implementation. By your logic, I could say that Intel's current desktop uarch is doomed because the P6 architecture (which Conroe, Nehalem and Sandy Bridge are derived from) was a flop at the time of its introduction. And it was.

I thought we had seen 2 generations:
http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope

Homeles · Jun 11, 2012

ShintaiDK said:
I thought we had seen 2 generations:
http://www.anandtech.com/show/5831/amd-trinity-review-a10-4600m-a-new-hope

We've only seen thorough benchmarks for one generation of Bulldozer.

I worded that specifically to imply that Piledriver hasn't seen thorough benchmarks yet. Mobile benches are too "muddy." Regardless, we haven't seen it run at desktop TDPs.

Arzachel · Jun 11, 2012

Don Karnage said:
I could have sworn i read Bulldozer's pipeline is on par with Netburst's

It's around 20 for Northwood, 31 for Prescott, 20 for Bulldozer and 17 for Sandy Bridge (sometimes less do to the micro op cache) iirc.

SocketF · Jun 14, 2012

kernelc said:
Interesting. I should try with my Arrandale-based notebook...

Here are now some interesting numbers on the new Trinity Chips from THG:

http://www.tomshardware.com/reviews/a10-5800k-a8-5600k-a6-5400k,3224-13.html
(Turbo and power savings modes were switched on)

With the single-threaded iTunes and Lame the anticipated inferior A6-5400K is up to ~10% faster than the slightly higher clocked A8-5600K (it has 100MHz more Turbo headroom). The little 5400K also has much less L2 cache, only 1x1MB, whereas the 5600K has 2x2MB.

But, as you can see from the cache numbers, the 5400K consists of one module only, thus thread switches wont leave the module and the L2 Cache will be always hot. Furthermore the module wont be able to switch into (deep) sleep modes.

The question now is only: How much of the penalty is due to the power saving deep sleep mode switches and how much is due to the cold L2 caches.

IntelUser2000 · Jun 14, 2012

ShintaiDK said:
Personally I would guess that SB got around 16 stages and Bulldozer around 25 stages. Misprediction latency gives an estimate that its around 50% longer than Phenom II.

If it really is 50% longer than Phenom II, it would be 18. Every AMD CPU including Athlon 64, Phenom II and anything else between is at 12.

denev2004 · Jun 14, 2012

ShintaiDK said:
Interesting considering none of the 2 companies have disclosed the pipeline length. So can you show me where that information is published?.

They have...I guess Its just like the term pipeline length has been kind of abandoned by them for which in modern CPU varies from different application.

ShintaiDK · Jun 14, 2012

denev2004 said:
They have...I guess Its just like the term pipeline length has been kind of abandoned by them for which in modern CPU varies from different application.

Interesting. Please link the documents.

Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

Member

Lifer

Senior member

Member

Diamond Member

Senior member

Member

Platinum Member

Diamond Member

Diamond Member

Lifer

Platinum Member

Golden Member

Platinum Member

Lifer

Platinum Member

Senior member

Senior member

Elite Member

Member

Lifer