[WCCF] AMD To Drop CMT, Welcome back SMT?

Ancalagon44 · May 5, 2014

itsmydamnation said:
K7 is palomnio barton etc. and clock for clock they were much faster then a P4

This is what I could find so far.

P4 2.4Ghz beats Barton core running at 2.5Ghz.

I seem to remember that the K7 dominated the P4 initially, but later on, it couldn't keep with it. I think this was mostly because AMD could not ramp up the frequency of the K7 as easily as Intel could ramp up the frequency of the P4, but I do remember cases where the above happened. It was pretty rare though - I think most of the time, K7 performs better clock for clock.

itsmydamnation · May 5, 2014

get complex code like games for example and it was much in the K7 favor. get code that doesn't branch much/ simple branches and P4 could stretch its legs.

see in your own link
http://www.tomshardware.com/reviews/barton,587-15.html

Ancalagon44 · May 5, 2014

Yeah, also true. This is all off topic, I really just posted to say that it is demonstrably false that the Pentium 4 has higher IPC than K8. Heck, as we have seen, K7 has higher IPC than P4 99% of the time - never mind K8.

Really interested to see what AMD will do regarding CMT, but I'm not convinced they are going to do away with it just yet.

R0H1T · May 5, 2014

The idea of CMT + SMT is intriguing, however as we've seen with AMD in the recent past it's the execution part that really matters & they've let most us down in that regard, we'll have to wait & see whether Carrizo &/or Basilisk do a better job with this approach.

Ancalagon44 · May 5, 2014

R0H1T said:
The idea of CMT + SMT is intriguing, however as we've seen with AMD in the recent past it's the execution part that really matters & they've let most us down in that regard, we'll have to wait & see whether Carrizo &/or Basilisk do a better job with this approach.

Only with Bulldozer based cores. Jaguar and Puma+ have performed as well as AMD promised they would.

R0H1T · May 5, 2014

Ancalagon44 said:
Only with Bulldozer based cores. Jaguar and Puma+ have performed as well as AMD promised they would.

Yes but unfortunately the low power/price segment doesn't seem to be taking off for'em either, this may change though with the next round of Windows (pro) tablets but AMD will have to revise their current offerings at least one more time so as to compete with Broadwell on 14nm.

itsmydamnation · May 5, 2014

Ancalagon44 said:
Only with Bulldozer based cores. Jaguar and Puma+ have performed as well as AMD promised they would.

its only really been with bulldozer itself, piledriver offers a very solid perf gain over bulldozer, steamroller has solid gains as well, loosing a bit of clock from process change. If bulldozer simply offered more "IPC" ( as much as i hate saying that) amd would be in a better place, most of the improvements they have made would only help feed more execution resources.

inf64 · May 5, 2014

Excavator will be a good chance to see what AMD decided to do with BD core. If they opted to beef it up ("intel-style", via resource increases) then this would imply the follow up (brand new core) will continue that route, which is basically perf./watt route. You beef the cores up while targeting much better performance at lower power. Bad news is that clock will most likely go down in this process.

Ancalagon44 · May 5, 2014

Well, SteamRoller does have higher IPC than piledriver and bulldozer, but the problem is that AMD is still a long way away from Intel in terms of power consumption and performance, at least for large CPUs.

Enigmoid · May 5, 2014

AtenRa said:
As NostaSeronx have said earlier, BD can do 2x ALU(Arithmetic Logical Unit) and 2x AGU(Address Generation Unit) simultaneously, that is 4x ops per cycle. AMD K10 architecture could do either 3x ALU or 3x AGUs thus 3x ops per Cycle.
The problem is that in order to have 4x ops per cycle you need to have the appropriate program, that means you must have data movement so your AGU will need to calculate the address, data operands and registers.

That means that the program should be written for ILP/DLP and SIMD/MIMD(*), but the vast majority of desktop programs are more serial than parallel in nature. AMD knew this but they chose to go for the more ILP/DLP approach because BD's first target was and still is the Server segment. The 2x ALU + 2x AGU approach also helps with the APU Architecture because GPUs are highly parallel in nature. Also HSA greatly uses Data parallelism.
They were also expecting that Desktop programs would become more parallel sooner but that didnt happen.

Also, the Integer execution units doesnt take that much die area as the front end. And it is the front end that they wanted to save space, they duplicated the Int Cores after all.

The only problem with BD was that it was one year late on market. As PileDriver clearly shown that the Architecture was very competitive against 32nm Intel's SandBridge in throughput performance(Server). If the market would not collapse and Dirk Meyer would not left AMD in early 2012 then we could have seen a 22/20nm Steamroller Server part in 2014.

One doesnt excludes the other, you can have a strong ST on a CMT Design. Depending on the program, even BD has higher single thread than K10.

(*)
ILP = Instruction Level parallelism,
DLP = Data Level Parallelism
SIMD = Single Instruction Multiple DATA
MIMD = Multiple Instruction Multiple DATA

http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200

http://www.anandtech.com/show/6508/the-new-opteron-6300-finally-tested

BD was a minor improvement over magny-cours in performance while using more power under load. It was barely competitive vs. westmere and SB completely blew it away. Piledriver was better but the situation was worse against SB.

Lets not forget that the discussion isn't primarily about specific phenom II or BD features but about the situation in general. For a modern architecture phenom 2 is quite bad, lacking power gating and other features to lower energy use. The fact that BD used more power on a die shrink for similar performance is pretty astounding. A heavily modified K10 would still be competitive against kaveri on a performance aspect (no clue about power).

BD being a year early would not have done much for the desktop (potentially so for servers).

Can you provide an example of BD's higher ST performance on a task not optimized for it (AVX, FMA, or any other extension PII doesn't support).

Where would AMD have fabbed their 20 nm server part in 2014? 20 nm is barely ready and a die that size on an immature process would be prohibitively expensive.

parvadomus · May 5, 2014

CMT is only good for server (if the IPC reached is good enough). As it may help reduce fetch/decode/schedule complexity and transistor count.
But SMT is the way to go for desktop. It will always be more focused on IPC by design.

Cogman · May 5, 2014

Idontcare said:
They are conflating CMT as a design methodology with the specific microarchitectural trade-offs that AMD chose in implementing their version of CMT.

Conceptually, CMT is not complicated and in the limit of full duplication of hardware resources provides the exact same benefits of adding an entirely new core to the chip.

Its when you choose to cut out too much, hold back on duplicating too many items, that you create a CMT-based product that barely improves on a standalone core design.

AMD's CMT was surprisingly good, providing nearly 80% the performance scaling of having two full-fledged cores. The problem was that those "full fledged cores" themselves were quite weak in terms of IPC...that's a design trade-off made by AMD that had nothing to do with CMT.

As to why other high-profile companies don't pursue CMT themselves, I can tell you the complications that come with robustly qualifying and validating a CMT-based microarchitecture is a strong reason many will avoid it.

Another good reason to void CMT is that it raises your design development expenses over that of designing a smaller (dimension-wise) single-threaded core (or one with SMT enabled) and then just copy-and-paste that core until you have your desired thread-count capability.

Why bring on the complications unless you have to?

Bingo.

CMT requires a brand new set of hardware that has never existed before. That hardware is involved in two of the most complex things about hardware, parallel execution and scheduling. Is it any wonder hardware manufacturers would want to avoid that sort of a beast and look elsewhere for gains? (it is much easier to add more cores than it is to make those cores share parts).

Couple that with the fact that your CMT implementation is going to be pretty tightly coupled with the rest of the system design and you can see why people aren't jumping on board. (By being tightly coupled, I mean you have to specifically design a large portion of your architecture around it and with it in mind).

That doesn't mean CMT is a bad thing, just that it adds a lot of complexity to a system that is already very complex. Hardware designers like to avoid complexity if at all possible.

AtenRa · May 5, 2014

Enigmoid said:
http://www.anandtech.com/show/5058/amds-opteron-interlagos-6200

http://www.anandtech.com/show/6508/the-new-opteron-6300-finally-tested

BD was a minor improvement over magny-cours in performance while using more power under load. It was barely competitive vs. westmere and SB completely blew it away. Piledriver was better but the situation was worse against SB.

PileDriver 6300 Series is very competitive against 32nm SB server. From the AT Review you can clearly see that the AMD Opteron 6376 is even or faster than Intel XEON E5 2630 but it also cost $783 less (entire server setup hardware)
The AMD Opteron 6380 is way faster than the XEON 2630 and only $331 more expensive (entire server setup hardware). Also, the Opteron 6380 is very close to XEON 2660 that cost $1439 more than the Opteron setup.
Performance and Performance per Dollar, the AMD Opteron 6300 series is very competitive against Intel 32nm SB XEONs.

Enigmoid said:
Lets not forget that the discussion isn't primarily about specific phenom II or BD features but about the situation in general. For a modern architecture phenom 2 is quite bad, lacking power gating and other features to lower energy use. The fact that BD used more power on a die shrink for similar performance is pretty astounding. A heavily modified K10 would still be competitive against kaveri on a performance aspect (no clue about power).

There are only one or two applications that BD has lower Performance per Watt than K10, in the vast majority BD has same or higher Performance per Watt. PileDrive Performance per Watt is always higher than K10.

Enigmoid said:
Can you provide an example of BD's higher ST performance on a task not optimized for it (AVX, FMA, or any other extension PII doesn't support).

From AT review, in Cinebench Opteron 6276(BD) has higher Single Thread performance than Opteron 6174(K10).

Enigmoid said:
Where would AMD have fabbed their 20 nm server part in 2014? 20 nm is barely ready and a die that size on an immature process would be prohibitively expensive.

GloFo, AMD could have a 22/20nm Server part until the End of 2014 (H2). Yes it would have been expensive but Server SKUs command a high Margin so no problem there.

Enigmoid · May 5, 2014

AtenRa said:
PileDriver 6300 Series is very competitive against 32nm SB server. From the AT Review you can clearly see that the AMD Opteron 6376 is even or faster than Intel XEON E5 2630 but it also cost $783 less (entire server setup hardware)
The AMD Opteron 6380 is way faster than the XEON 2630 and only $331 more expensive (entire server setup hardware). Also, the Opteron 6380 is very close to XEON 2660 that cost $1439 more than the Opteron setup.
Performance and Performance per Dollar, the AMD Opteron 6300 series is very competitive against Intel 32nm SB XEONs.

There are only one or two applications that BD has lower Performance per Watt than K10, in the vast majority BD has same or higher Performance per Watt. PileDrive Performance per Watt is always higher than K10.

From AT review, in Cinebench Opteron 6276(BD) has higher Single Thread performance than Opteron 6174(K10).

GloFo, AMD could have a 22/20nm Server part until the End of 2014 (H2). Yes it would have been expensive but Server SKUs command a high Margin so no problem there.

When I say better I mean from a technical standpoint, not price. A poor product is always going to command a lower price. BD or PD basically cost AMD the server market, their share is almost nonexistent now; their products are not competetitive.

Sorry, I was unclear, I meant at the same clocks. Phemon 11 still does very well when comparing ST. And piledriver better use less power its Much better power gated and on a much better process. There is no turbo on mangy cours.

inf64 · May 5, 2014

We already have Stars core on 32nm, they have worse perf./watt than PD. So nope, Stars based Opteron on 32nm would also have worse perf. and perf./watt than PD based Opteron.

AtenRa · May 5, 2014

Enigmoid said:
When I say better I mean from a technical standpoint, not price.

What do you mean by technical standpoint ??

Enigmoid said:
A poor product is always going to command a lower price. BD or PD basically cost AMD the server market, their share is almost nonexistent now; their products are not competetitive.

Sorry but Opteron 6376 has the same performance of Intel XEON E5 2630 and cost less. That doesn't make it a poor product. Also, server market share has nothing to do with Bulldozer as a product. AMD was loosing server market share even before Bulldozer was released.

Enigmoid said:
Sorry, I was unclear, I meant at the same clocks. Phemon 11 still does very well when comparing ST. And piledriver better use less power its Much better power gated and on a much better process. There is no turbo on mangy cours.

If you mean IPC then i dont have data to provide. But PileDriver's ST performance is higher than K10.

NTMBK · May 5, 2014

Enigmoid said:
Sorry, I was unclear, I meant at the same clocks. Phemon 11 still does very well when comparing ST. And piledriver better use less power its Much better power gated and on a much better process. There is no turbo on mangy cours.

Meh, IPC isn't that important to me. Choosing to go for lower IPC and higher clocks is just a different type of design- if a product has better performance/W and performance/$, I don't especially care how it got there.

mrmt · May 5, 2014

Enigmoid said:
When I say better I mean from a technical standpoint, not price. A poor product is always going to command a lower price. BD or PD basically cost AMD the server market, their share is almost nonexistent now; their products are not competetitive.

The server market is far more sensitive to power consumption than the desktop market. This is why AMD share crashed with Bulldozer. Not only performance wasn't good, power consumption turned the tables when factoring TCO.

Maxima1 · May 5, 2014

if a product has better performance/W and performance/$

A lot of people don't seem to consider the electricity costs.

ElFenix · May 5, 2014

NTMBK said:
Meh, IPC isn't that important to me. Choosing to go for lower IPC and higher clocks is just a different type of design- if a product has better performance/W and performance/$, I don't especially care how it got there.

yup, absolute performance, performance at given power envelope, and performance for cost are what matter. IPC is useful because higher clocks means higher power requirements, ceteris paribus. but it's only an indicator of what i actually want to know.

Enigmoid · May 5, 2014

AtenRa said:
What do you mean by technical standpoint ??

Sorry but Opteron 6376 has the same performance of Intel XEON E5 2630 and cost less. That doesn't make it a poor product. Also, server market share has nothing to do with Bulldozer as a product. AMD was loosing server market share even before Bulldozer was released.

If you mean IPC then i dont have data to provide. But PileDriver's ST performance is higher than K10.

Technical not meaning price. A product that cannot compete with a competitor is going to get a price drop until it can.

But lets look at this again.

The higher CB score is because of turbo, without turbocore CB 11.5 singlethread is around 0.57 (from the first link comparing to the 62xx series). This is BD, Piledriver manages to match performance per clock.

Single-threaded performance is relatively poor when you do not enable Turbo Core: with that setting the Opteron 6276 scores only 0.57. So the single-threaded FP performance is about 10% lower, probably a result of the higher FP/SSE latencies of the Interlagos FPU. However, the 6276 Opteron can boost the clock speed to 3.2GHz. This 39% clock speed boost leads to a 37% (!) performance boost.

I don't know where you are getting pricing but the E5 2630 was/is cheaper at the 63xx series launch.

http://www.anandtech.com/show/6508/the-new-opteron-6300-finally-tested/2

E5-2630 - $649
6376 - $703

The 6376 is also using more power. It is slower than the E5-2660 (8 cores) and faster than the E5-2630 (6 cores).

As far as die sizes go the 6376 is a MCM containing two 315 mm^2 dies. The E5-2630 is 412 mm^2 for the 8 core die. I'm not sure on how much cost different there is on that.

NTMBK said:
Meh, IPC isn't that important to me. Choosing to go for lower IPC and higher clocks is just a different type of design- if a product has better performance/W and performance/$, I don't especially care how it got there.

Higher clocks almost always indicate higher power and worse efficiency, especially when you break the ~3 Ghz barrier.

NostaSeronx · May 5, 2014

Enigmoid said:
The higher CB score is because of turbo, without turbocore CB 11.5 singlethread is around 0.57 (from the first link comparing to the 62xx series). This is BD, Piledriver manages to match performance per clock.

Cinebench uses the ICC compiler, while the Cinema 4D software uses GCC/MSVC. There is a clear bias that FMACs without ADD offloaders will be slower than fixed ADD and MUL units.

Two flaws of the Cinebech;
It does not represent the application it is benching for.
It does not do what benchmarks must be able to do and that is display the performance of highly optimized code for any architecture.

Get a proper optimized benchmark of what Cinema 4D does, and you will observe; Bulldozer(not Piledriver or Steamroller; 8120) being 2x faster than Phenom II X6. While also being 1.1x faster than Westmere(990X). No overclocks or anything and people want the slowest architecture K8 still.

--
https://imgur.com/a/Y8ro0
http://www.yeppp.info/benchmarks.html

Enigmoid · May 5, 2014

NostaSeronx said:
Cinebench uses the ICC compiler, while the Cinema 4D software uses GCC/MSVC. There is a clear bias that FMACs without ADD offloaders will be slower than fixed ADD and MUL units.

Two flaws of the Cinebech;
It does not represent the application it is benching for.
It does not do what benchmarks must be able to do and that is display the performance of highly optimized code for any architecture.

Get a proper optimized benchmark of what Cinema 4D does, and you will observe; Bulldozer(not Piledriver or Steamroller; 8120) being 2x faster than Phenom II X6. While also being 1.1x faster than Westmere(990X). No overclocks or anything and people want the slowest architecture K8 still.

--
https://imgur.com/a/Y8ro0
http://www.yeppp.info/benchmarks.html

I'm comparing the two between each other, not towards intel.

http://www.anandtech.com/bench/product/434?vs=444

I'm seeing a through thrashing.

http://www.anandtech.com/bench/product/434?vs=203

We are nowhere close to 2x PII x6

NostaSeronx · May 5, 2014

Enigmoid said:
I'm comparing the two between each other, not towards intel.

With irrelevant bentmarks, nice job, very exceptional, much pride.

---
Again this forum provides a perfect example of very limited education in actual real world capabilities of architectures. Cinebench doesn't agree with you SO YOUR WRONG! Anand's half-decade old bentmarks don't agree with your numbers SO YOUR WRONG! If Anandtech doesn't prove my point; I'll go to another review website that uses the same half-decoade bentmarks. To prove you wrong, NOT!

Even with relevant benchmarks they are no where near actual enterprise or consumer optimization. So benchmarks just show an estimation range of said product but never give actual performance. If you have results from a bentmark that means the ESTIMATION of performance is wrong.

---
Orochi is 2x faster than Thuban and 1.1x faster than Gulftown. Enough said that is exactly what the hardware performance counter returns back not some fictional workload in a benchmark says.

---
Heck the bentmarks even effect Intel's APUs; Intel's Haswell is two times faster than Sandy Bridge and Ivy Bridge. While most bentmarks show it only having a marginal increase over them.

Is this poor judgement of the consumers? nope.
Is this poor judgement of the app devs? nope.
Is this poor judgement of the reviewer? yes.

---
Consumers/Enterprise don't buy new CPUs/GPUs for the same applications they buy new CPUs/GPUs for better applications. If reviewers continue to go with new CPUs same old version applications methodology. Then, reviewers are simply alienating the community and the industry.

Users of Pirate Islands GPUs are not going to get that GPU family for pixel shaders. They are going to get that series of GPUs for compute shaders. New CPU/GPUs mean that the old gets depreciated and the new gets faster.

---
Back to the topic of CMT; CMT is built for everything not just servers especially Bulldozer/Piledriver/Steamroller/Excavator.

Dual-core CMT has two advantages of Dual-core CMP;
Both cores get access to double the resources. That means if a single core in a dual core CMT solution is active it gets to hog those resources. If both cores are active the doubled resources keep it at the same performance of a Dual-core CMP processor.

So comparing Dual-core CMT with a single core active CMT is well iffy at best. Since, single-core has access to more resources than dual-core. So, all benchmarks that are observing four modules with the second core disabled where it gets slightly higher benchmarks. Is just showing that the CMT model is working as planned.

The second advantage comes with dual core workloads where the first one is single core workloads. The second advantage I'm not well educated with but it has to do with TLP.

CMP is not an alternative to SMT, it is an alternative to CMP.

SMT allows for two threads to use different resources per clock. (Thread A: ALU0/ALU1/FPU0/AGU1 ; Thread B: ALU2/ALU3/FPU1/FPU2/AGU0)
CMT allows for two threads to use the same resources per clock as the threads resources are duplicated. (Thread A: EX0/AGLU0/EX1/AGLU1 ; Thread B: EX0/AGLU0/EX1/AGLU1)

Intel's SMT core has the FPU included in the core while AMD's CMT core does not include the FPU in the cores. The FPU in AMD's dual-core CMT processor is separate and uses the SMT model. (Thread A: P0/P2 ; Thread B; P1/P3)

For example if Intel decides to use CMT it would be like; (Thread A: ALU0/ALU1/AGU0/AGU1/FPU0/FPU1/MISC0 ; Thread B: ALU0/ALU1/AGU0/AGU1/FPU0/FPU1/MISC0)
The Intel implementation will have the Front-End and the L2 cache shared and doubled in size. So, if the SMT/CMP version was 2-way decode then the CMT version will be 4-way decode. If the SMT/CMP fetch was 16B then the CMT fetch will be 32B. If the SMT/CMP L2 cache is 256KB then the CMT L2 cache is 512KB.

If Intel saw utilization was not as high expected then they could include SMT. (Thread A: ALU0/AGU1/FPU0/MISC0 ; Thread B: ALU1/AGU0/FPU2 ; Thread C: ALU1/AGU0/AGU1/MISC0 ; Thread D : ALU0/FPU0/FPU1)

Ancalagon44 · May 6, 2014

Wow calm down NS. If BD really is better than Thuban, it shouldn't be that hard to find benchmarks proving that, right?

[WCCF] AMD To Drop CMT, Welcome back SMT?

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Senior member

Lifer

Lifer

Platinum Member

Diamond Member

Lifer

Lifer

Diamond Member

Diamond Member

Elite Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member