Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

Soulkeeper · Jun 3, 2012

Fox5 said:
Just like AMD's professional video card lines (FireGL and compute cards), it's a way more profitable market per sale. Ideally, you get scale by selling your stuff for whatever you can on the consumer market, and then reap the benefits on the highly profitable professional market.

Plus, AMD had some good advantages in the server market until Nehalem, and was even pretty competitive up until bulldozer.

That said, I think more focus on the consumer market would have benefited them greatly, even if it was just with more appropriate product placement.
IE: Focus on low power designs for laptops and ultraportables, and claim that highly profitable segment from Intel. AMD never put much effort into low power designs, even when they've had good power efficiency.
Instead of Phenom being a quad core, it should have been a native dual core design similar to the core 2 duo and with no L3 cache. In many real world situations, they probably could have had equal to or better performance than what they offered, but with vastly lower costs per chip.
Really, AMD shouldn't have launched consumer lines with L3 cache at all (at least not on the low end), it bloats die size to a large amount and AMD's designs still under perform.
Instead of bulldozer, a die shrunk and up-clocked Phenom II based design would have been better. A Bulldozer module is nearly 2x the size of a Phenom II core, ie, their innovative design doesn't seem to have saved them very much die space.

I think you're spot on.
I really havn't understood the whole L3 decision by either intel or AMD.
The yorkfields had 12MB of fast L2 which gave them incredible performance characteristics.
Maybe i'm missing something, but why do we need/want L3 again ?

Ajay · Jun 4, 2012

podspi said:
Clearly modules aren't great for peak performance, but for perf/watt they are the way to go. The idea behind the module configuration is to achieve very high resource utilization.

Same design goal as HT (SMT). It too was of limited usefulness when it first came out - I think the average real world performance boost was ~30% for threaded apps. Now it's much, much better.

I'm hoping that they improve on the caches in Vishera (wouldn't be surprising, given that it will be based off of server designs), and then for steamroller improve the frontend.

Yeah, it would be nice if AMD could pull off some significant improvements. But they have two things running against them:
1) Amd laid off a bunch of uProc designers which has already delayed Vishera (and will thus set back other follow on designs in this generation).
2) There is no way they will be able to catch up with Intel on process tech (unless Intel fabbed their CPUs and gave AMD a cut rate price on wafers )

Still, hopefully by Excavator, they will be able to be in the same 'ballpark' as Intel, since it appears that sub 22nm fabs and designs are going to be really, really hard - even for Intel.

So Intel will likely be faced with diminishing returns wrt performance improvements, which could allow AMD to do a bit of catching up by ~2015, if AMD and it's fab partners execute very well.

We'll have to see how Vishera comes to b/4 we have some insight into AMD's ability to improve performance and power consumption going forward.

Olikan · Jun 4, 2012

Soulkeeper said:
I think you're spot on.
I really havn't understood the whole L3 decision by either intel or AMD.
The yorkfields had 12MB of fast L2 which gave them incredible performance characteristics.
Maybe i'm missing something, but why do we need/want L3 again ?

intel's L3 is more like a protection... if L2 cache misses data, the CPU will find the missing data it at L3.

for amd....i don't really remember

ShintaiDK · Jun 4, 2012

L3 was needed to to the increased amounts of access ports to the L2 as core amount went up. As far as I recall. The L2 would simply end up too slow if shared.

CPUarchitect · Jun 4, 2012

Soulkeeper said:
I really havn't understood the whole L3 decision by either intel or AMD.
The yorkfields had 12MB of fast L2 which gave them incredible performance characteristics.
Maybe i'm missing something, but why do we need/want L3 again ?

Yorkfield was not fast at all for synchronizing between threads on both chips. And its 6+6 MB cache was a waste due to having duplicate data in both halves. Nehalem's monolithic L3 cache is superior, despite being "only" 8 MB. It also uses more power efficient 8T cells. Furthermore, Nehalem's L2 cache latency (11 cycles) is lower than that of Penryn/Yorkfield (18 cycles). And even though it's smaller and thus there are more misses, the L3 cache and integrated memory controller makes it all result in a lower average access lantecy.

There are rumors that Haswell might even include an L4 cache consisting of eDRAM. This would allow the expensive SRAM caches to be smaller while saving power by going off-package less often.

Ajay · Jun 4, 2012

CPUarchitect said:
There are rumors that Haswell might even include an L4 cache consisting of eDRAM. This would allow the expensive SRAM caches to be smaller while saving power by going off-package less often.

Has this been confirmed yet? I know there have been some rumors that it's actually L2$ for the HD graphics core.

kernelc · Jun 4, 2012

Olikan said:
intel's L3 is more like a protection... if L2 cache misses data, the CPU will find the missing data it at L3.

for amd....i don't really remember

As ShintaiDK already wrote, a dedicated L2 cache was needed because sharing the L2 cache between 4 cores will be slow.

So, with Nehalem Intel went the reverse route: it create a small, very fast L2 cached backed by a slower and bigger L3 cache.

This inclusive L3 cache permits:
- fast data sharing between threads that run on different cores
- to efficiently probe cores on other sockets
- to let idling cores go to very low powers states in a very short time.

On the other side, as AMD's L3 cache is a victim cache (its data are not duplicates of L2 generally) it has lower efficienty for point n.2 and n.3, but it give Bulldozer a bigger total, effective cache size.

Regards.

Note: to efficiently proble cores on different sockets, Shangai and Bulldozer based chips include a "snoop filter" features.

soccerballtux · Jun 4, 2012

blckgrffn said:
@ kernelc - fantastic to have you posting here, welcome!

+1

Cpus · Jun 5, 2012

Just wondering what is L4 cache?

ShintaiDK · Jun 5, 2012

Cpus said:
Just wondering what is L4 cache?

A 4th level cache? Its as such no different than L1, L2 and L3.

Makaveli · Jun 5, 2012

ShintaiDK said:
A 4th level cache? Its as such no different than L1, L2 and L3.

Wouldn't it be slower since every level of cache get slower than the previous one.

ShintaiDK · Jun 5, 2012

Makaveli said:
Wouldn't it be slower since ever level of cache get slower than the previous one.

But it would still be faster than main memory.

The only reason we got L3 is due to the fact, that sharing the L2 with so many cores would make it too slow. The L4 is currently still a rumour and I guess it will stay so. But if we get an L4. It would be due to the above reason.

Dresdenboy · Jun 6, 2012

ShintaiDK said:
But it would still be faster than main memory.

The only reason we got L3 is due to the fact, that sharing the L2 with so many cores would make it too slow. The L4 is currently still a rumour and I guess it will stay so. But if we get an L4. It would be due to the above reason.

One common task in the future will be to work on streamed data, usually not fitting into any cache. In such cases, only prefetching helps lowering average latencies, but throughput wouldn't increase (as it would when repeatedly using cached data).

So it's likely (and some rumours point into this direction) that high bandwidth stacked memory will be used in the future to provide enough bandwidth to CPU+GPU cores.

Currently I see the uses of the cache levels as follows:
L1: low latency high bandwidth read cache per core, no probing, no sharing
L2: higher latency medium bandwidth read/write cache, shared (writes via WCC), probing (one cache to check per 2 cores)
L3: higher latency medium latency cache, inclusive (shared data) and exclusive, actually high total peak bandwidth according to ICC (307 GB/s at 2.4 GHz)

FlanK3r · Jun 8, 2012

L4 could be for primary cache iGPU in APU or so

peonyu · Jun 8, 2012

soccerballtux said:
these are really bad names. Bulldozer, steamroller,... very large, very heavy, and very slow vehicles. Not what I want in my CPU.

But think of how many corez you could cram into a steamroller! It may be slow but it has alot of the corez!

Cerb · Jun 8, 2012

Makaveli said:
Wouldn't it be slower since every level of cache get slower than the previous one.

That's why I'm especially skeptical. A local memory, or cache just for certain uses (make it exclusive where it would need to be CC, and the rest of cache management is made easier), would not terribly affect the rest of memory speeds. A huge L4, OTOH, added to L3 (if it's a local memory, or non-CPU-core cache, it's not really L4, is it?), would require very big cache lines, sequential cacheline extents, or some other method of keeping CAM lookups quick (but which could affect the hit rate negatively), or could add substantial time to lookups. Memory may be slow, but slower cache is not ideal, either. There is definitely a point at which making them faster for the sake of keeping the CPU busy when every request is a hit is worth it over making them bigger to reduce the need to go out to RAM on misses.

SocketF · Jun 10, 2012

kernelc said:
As ShintaiDK already wrote, a dedicated L2 cache was needed because sharing the L2 cache between 4 cores will be slow.

Yes and dedicated 4x1MB L2 caches (without L3) wont be able to share data at all.

Besides, as you mentioned, there were other benefits for small L2 caches, too. The biggest one imo was:

- to efficiently probe cores on other sockets

The L3 cache acts as the center of memory coherency of the whole CPU. With that, intel could use also the less complex MESIF protocol.
All in all, the small L2 caches are a result of the necessary L2-include-cache strategy of the L3(the bigger the L2 the more copy-overhead in the L3, hence L2s must be small); and that was in turn a direct consequence from the requirements for a scalable server-architecture.
In short: Nehalem and its descendants are full-blooded server designs.

Apart from your benefits' list, I would also mentioned the increased robustness against bad OS schedulers. If the operating system has a bad scheduling algorithm, then it does not matter much for a modern intel CPU. The cached data is still preserved in the L3. A thread which is scheduled to run on another core can load the data from L3,i.e. the caches are always hot. However, this is impossible on an AMD CPU, data has to be loaded from memory in that case, because the L3 does contain different data compared to the L2 (only exception would be a thread, changing clusters within one Bulldozer-module).

Note: to efficiently proble cores on different sockets, Shangai and Bulldozer based chips include a "snoop filter" features.

AFAIK it was just introduced with the first six core CPUs, i.e. Istanbul, not Shanghai.

Furthermore, here we see again the pros and cons of intel's decision for a small L2. Yes, AMD has the bigger, usable cache setup, but for multi-processor systems, AMD *also* has to cut down the usable cache-size.

Theoretically, that is the better way, because the 1P / desktop users do not have to suffer from the server-architecture-penalty, i.e. AMD user have the choice to sacrifice L3-cache size for a better MP scalability. Intel users do not have that choice. However, in reality, intel CPUs are still faster, due to other factors. Furthermore, since the 32nm generation, intel's L3 caches run at full core-clock speed, thus acting more like a "L2.5" cache. What I want to point out with that strange notation is the fact that intel's L3-cache of its 32/22nm CPUs is (nearly) as fast as AMD's L2 caches of Bulldozer.

Well let's wait and see if AMD will change something in the cache setup for Steamroller. I won't hold my breath, though.

Cpus · Jun 10, 2012

What do you think the clock speed, turbo, and max turbo will be of the flagship steamroller cpu? I'm almost positive the flagship piledriver will be 4.2ghz with 4.7ghz turbo core

ShintaiDK · Jun 10, 2012

+600Mhz core, +500Mhz Turbo with 4 modules on the same 32nm process? Thats optimistic.

The NB is still clocked at 2-2.2Ghz.

Sofar Steamroller aint on the 2013 roadmap for FX. So who knows, depends on processnode too.

Cpus · Jun 10, 2012

I know it is going to be onthe 28nm node unless something goes majorly wrong. I am pretty sure it is going to be release in early 2014.

SickBeast · Jun 10, 2012

Guys I've really gotta say that Steamroller is going to be really bad. It's still based off the original Bulldozer design. There's only so much they can do with this thing. It's just like the P4.

Homeles · Jun 10, 2012

SickBeast said:
Guys I've really gotta say that Steamroller is going to be really bad. It's still based off the original Bulldozer design. There's only so much they can do with this thing. It's just like the P4.

No. Have you not read this article? http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper/1

Bulldozer's definitely a great concept, but it's got some glaring flaws. Piledriver will make some significant strides towards hammering those out. AMD can only do so much on the 32nm node though. It appears Bulldozer wasn't really ready for 32nm, as there isn't much room for improvement (literally).

Soulkeeper · Jun 10, 2012

does anyone else think that the small L1 caches on bulldozer could be killing performance ?

per core (half module) there is 1/2 the instruction and 1/4 the data compared to stars (correct me if i'm wrong).
Granted the instruction cache is shared, but even then it must feed 2 threads.

Was there any justification for decimating the L1 cache sizes other than saving die space ?
Have they sufficiently compensated (if possible) ?

Looking at These benchmarks comparing bulldozer with a 200MHz clock advantage against x6. Not only are the L1 cache sizes smaller, but the overrall latency and bandwidth of both L1 and L2 are worse.

Homeles · Jun 10, 2012

Soulkeeper said:
does anyone else think that the small L1 caches on bulldozer could be killing performance ?

per core (half module) there is 1/2 the instruction and 1/4 the data compared to stars (correct me if i'm wrong).
Granted the instruction cache is shared, but even then it must feed 2 threads.

Was there any justification for decimating the L1 cache sizes other than saving die space ?
Have they sufficiently compensated (if possible) ?

It was probably to save die space. AMD could use some increased associativity to counter the resources being shared and small cache size.

kernelc · Jun 11, 2012

SocketF said:
Apart from your benefits' list, I would also mentioned the increased robustness against bad OS schedulers. If the operating system has a bad scheduling algorithm, then it does not matter much for a modern intel CPU. The cached data is still preserved in the L3. A thread which is scheduled to run on another core can load the data from L3,i.e. the caches are always hot. However, this is impossible on an AMD CPU, data has to be loaded from memory in that case, because the L3 does contain different data compared to the L2 (only exception would be a thread, changing clusters within one Bulldozer-module).

This is true, but modern operating systems tend to don't move threads between cores without a very good reason: http://www.tomshardware.com/reviews/intel-core-i5,2410-8.html

With Windows Vista it was a major concern for AMD Phenom CPU, as you correcly noted.

A Bulldozer-specific problem that remain with the new W7 patch also, is that ideal thread schedule heavily depend on the expected instruciton mix and others difficult to predict factors. For example, FP-heavy threads should be scheduled on a different modules (this maximize FPUs usage), while integer thread can be scheduled both on different modules (when they don't share data) or on the same module (when they share data, for let Turbo boost to aggressively kick in and use the L2 to quicky pass data between the two cores).

So, the ideal scheduling on Bulldozer-class cores is a quite complex affair, and often the OS scheduler simply don't has enought information to do the best choice.

Fortunately heavy workload are generally well threaded, at a point were they engage _all_ CPU cores. In this case scheduling is simpler because, as many desktop-server workload don't rely on data sharing between threads, the OS can distribute threads as it wants, without too much concerns.

AFAIK it was just introduced with the first six core CPUs, i.e. Istanbul, not Shanghai.

I double-checked and yes, you are right: it was introduced with Instanbul core: http://www.anandtech.com/show/2774/2

Furthermore, here we see again the pros and cons of intel's decision for a small L2. Yes, AMD has the bigger, usable cache setup, but for multi-processor systems, AMD *also* has to cut down the usable cache-size.

Theoretically, that is the better way, because the 1P / desktop users do not have to suffer from the server-architecture-penalty, i.e. AMD user have the choice to sacrifice L3-cache size for a better MP scalability. Intel users do not have that choice. However, in reality, intel CPUs are still faster, due to other factors. Furthermore, since the 32nm generation, intel's L3 caches run at full core-clock speed, thus acting more like a "L2.5" cache. What I want to point out with that strange notation is the fact that intel's L3-cache of its 32/22nm CPUs is (nearly) as fast as AMD's L2 caches of Bulldozer.

Well let's wait and see if AMD will change something in the cache setup for Steamroller. I won't hold my breath, though.

If I remember correctly, the first Intel 32nm product, westmere, had L2 cache that didn't run at core clock. It is with Sandy Bridge that L2 runs at core clock

For all other things, I agree

Thanks.

Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

Diamond Member

Lifer

Platinum Member

Lifer

Senior member

Lifer

Member

Lifer

Senior member

Lifer

Diamond Member

Lifer

Golden Member

Senior member

Platinum Member

Elite Member

Senior member

Senior member

Lifer

Senior member

Lifer

Platinum Member

Diamond Member

Platinum Member

Member