[WCCF] AMD To Drop CMT, Welcome back SMT?

Arkaign · May 6, 2014

Ancalagon44 said:
Yeah, also true. This is all off topic, I really just posted to say that it is demonstrably false that the Pentium 4 has higher IPC than K8. Heck, as we have seen, K7 has higher IPC than P4 99% of the time - never mind K8.

Really interested to see what AMD will do regarding CMT, but I'm not convinced they are going to do away with it just yet.

Very much so.

The Northwood at 3.2Ghz+ left Athlon XP behind in raw performance, but it most certainly wasn't due to Northwood being faster clock for clock. It was due to the cache and clock speed hitting the sweet spot for P4, along with hyperthreading and the AXP not being able to reasonably hit the same clock speeds.

Even the first K8s were only about on par with P4 (3000+/3200+). But when AMD launched the 3500+ and beyond, combined with Prescott falling totally flat on it's face (lower IPC than Northwood in many cases!), that they ran away with it, all the way to Conroe with ease.

CPU Performance as a package =! IPC.

P4 3.2C handily beat AXP 3200+ in performance
AXP handily beat P4 3.2C in IPC

NostaSeronx · May 6, 2014

Ancalagon44 said:
Wow calm down NS. If BD really is better than Thuban, it shouldn't be that hard to find benchmarks proving that, right?

That's assuming that there are benchmarks that are properly optimized for AMD 15h. I'm not a reviewer and I'm way to lazy to learn how to compile just to prove that I am right.

If you use slow buns Open64 from AMD. AMD's Bulldozer is faster than Stars. If you use a much more recent compiler like it was just drafted three months ago. Then, you have AMD's Bulldozer that is still faster than Stars.

AMD 00h/10h is perfectly okay running Pentium 4/Core 2 optimized code.
AMD 15h hates running those, in fact it slows down execution time by two times.

So instead of finishing an operation in ~9 cycles(Log) or ~6 cycles(Exp) it takes ~18 to 30 cycles(Log) or ~13 to 21 cycles(Exp). In comparison for Stars it takes ~17 to 30 cycles(Log) or ~12 to 26 cycles(Exp).

So why bother trying to look for that one perfectly compiled/hand written application. When majority of them give 15h, code that takes ~>2x to finish.

Probably the reason why this happened is because AMD laid off the software optimization group. Well guess what they are back for bdver4, so don't be surprised when Excavator somehow is 4x the performance of Bulldozer/Piledriver.

AtenRa · May 6, 2014

Enigmoid said:
I don't know where you are getting pricing but the E5 2630 was/is cheaper at the 63xx series launch.

http://www.anandtech.com/show/6508/the-new-opteron-6300-finally-tested/2

E5-2630 - $649
6376 - $703

The 6376 is also using more power. It is slower than the E5-2660 (8 cores) and faster than the E5-2630 (6 cores).

From the same link you quoted, it has a Server price table bellow the CPU prices.

The Opteron 6376 Server costs $4225
The XEON E5 2630 Server costs $5008.

For the same performance you pay $753 more for the XEON, close to 19% more. Yes the Xeon has a little better energy efficiency but you pay upfront %19 more to acquire it.

From the same link,
The XEON 2660 use 84Wh, i will take 80wh for XEON 2630

80wh = 0.08Kwh * 24 hours * 365 days = 700,8Kwh per year
Lets assume you pay 0.2cents per Kwh

700,8Kwh * 0.2 = $140,16 per year

The Opteron 6367 use 95wh

95wh = 0.095Kwh * 24 hours * 365 days = 832,2Kwh per year
Lets assume you pay 0.2cents per Kwh

832,2Kwh * 0.2 = $166,44 per year

The difference is $26,28 per year.

Now, you paid $753 more for the XEON Server, then it will take you ~28 years to get even.

So, which of them is the better product ?? :whiste:

Now go to dell.com and custom build a Dell 720 with Intel Dual XEON E5 2640 and Dell 715 with AMD Dual Opteron 6367 and watch for the price.

Enigmoid said:
As far as die sizes go the 6376 is a MCM containing two 315 mm^2 dies. The E5-2630 is 412 mm^2 for the 8 core die. I'm not sure on how much cost different there is on that.

I have no idea what yields Intel has on that 412mm2 die or AMD for the 315mm2 die, but both of them have high margins in server parts so that is not of big consequence.

Enigmoid · May 6, 2014

NostaSeronx said:
With irrelevant bentmarks, nice job, very exceptional, much pride.

---
Again this forum provides a perfect example of very limited education in actual real world capabilities of architectures. Cinebench doesn't agree with you SO YOUR WRONG! Anand's half-decade old bentmarks don't agree with your numbers SO YOUR WRONG! If Anandtech doesn't prove my point; I'll go to another review website that uses the same half-decoade bentmarks. To prove you wrong, NOT!

Even with relevant benchmarks they are no where near actual enterprise or consumer optimization. So benchmarks just show an estimation range of said product but never give actual performance. If you have results from a bentmark that means the ESTIMATION of performance is wrong.

---
Orochi is 2x faster than Thuban and 1.1x faster than Gulftown. Enough said that is exactly what the hardware performance counter returns back not some fictional workload in a benchmark says.

---
Heck the bentmarks even effect Intel's APUs; Intel's Haswell is two times faster than Sandy Bridge and Ivy Bridge. While most bentmarks show it only having a marginal increase over them.

Is this poor judgement of the consumers? nope.
Is this poor judgement of the app devs? nope.
Is this poor judgement of the reviewer? yes.

I'm confused as to what you are even getting at. AT's server tests did not show 2x gains. Are you looking at purely theoretical numbers? Because that only matters if you can get the performance out of the system. Cell was very impressive on paper but real world, not so much. You may think all benchmarks are just 'estimates' but many of them simulate typical workloads. In fact, enterprise benchmarks are designed to be as 'real world' as possible. Your comment was in reference to what I believed to be consumer products and consumers aren't going to just jump ship on their software when they upgrade.

HW is nowhere close to 2x the speed of Ivy on average.

The scientific methodology is very results driven

Well what do you consider relevant? I already said that I am NOT looking at benchmarks were BD will be unduely optimized, including AVX, FMA, etc. as these are a measure of instruction set support, rather than architectural prowess. The fact also remains that a lot of companies are still using a lot of really old software.

I invite you to provide some proof to your claims.

Consumers/Enterprise don't buy new CPUs/GPUs for the same applications they buy new CPUs/GPUs for better applications. If reviewers continue to go with new CPUs same old version applications methodology. Then, reviewers are simply alienating the community and the industry.

Users of Pirate Islands GPUs are not going to get that GPU family for pixel shaders. They are going to get that series of GPUs for compute shaders. New CPU/GPUs mean that the old gets depreciated and the new gets faster.

Oh, lots of people buy the new equipment for the same programs (or newer versions). They buy a server to run visualization on or for rendering. Performance on 3dsmax 2012 is very indicative of performance on 3ds max 2014; much more so than looking at performance on something unrelated.

Back to the topic of CMT; CMT is built for everything not just servers especially Bulldozer/Piledriver/Steamroller/Excavator.

Dual-core CMT has two advantages of Dual-core CMP;
Both cores get access to double the resources. That means if a single core in a dual core CMT solution is active it gets to hog those resources. If both cores are active the doubled resources keep it at the same performance of a Dual-core CMP processor.

So comparing Dual-core CMT with a single core active CMT is well iffy at best. Since, single-core has access to more resources than dual-core. So, all benchmarks that are observing four modules with the second core disabled where it gets slightly higher benchmarks. Is just showing that the CMT model is working as planned.

I'm not sure if I am understanding you correctly (probably not). BD/PD/ and Kaveri don't get access to all the resources.

One thread has access to only ONE integer unit (2 ALUs). Never does one thread become 4-wide. Look at IDC's CB benchmarks, loading cores first improves performance but an additional 80% is achieved with the second thread. Note that AMD's implementations share the front end (on BD the decode was 4 wide whether one core or two cores were used, this has been fixed and now subsequent designs use two 4 wide decoders- a single thread never gets 8 wide decoding). This sharing of the front end is why singlethread performance is sometimes higher with one thread loaded, not because of execution resources.

The second advantage comes with dual core workloads where the first one is single core workloads. The second advantage I'm not well educated with but it has to do with TLP.

CMP is not an alternative to SMT, it is an alternative to CMP.

SMT allows for two threads to use different resources per clock. (Thread A: ALU0/ALU1/FPU0/AGU1 ; Thread B: ALU2/ALU3/FPU1/FPU2/AGU0)
CMT allows for two threads to use the same resources per clock as the threads resources are duplicated. (Thread A: EX0/AGLU0/EX1/AGLU1 ; Thread B: EX0/AGLU0/EX1/AGLU1)

Intel's SMT core has the FPU included in the core while AMD's CMT core does not include the FPU in the cores. The FPU in AMD's dual-core CMT processor is separate and uses the SMT model. (Thread A: P0/P2 ; Thread B; P1/P3)

For example if Intel decides to use CMT it would be like; (Thread A: ALU0/ALU1/AGU0/AGU1/FPU0/FPU1/MISC0 ; Thread B: ALU0/ALU1/AGU0/AGU1/FPU0/FPU1/MISC0)
The Intel implementation will have the Front-End and the L2 cache shared and doubled in size. So, if the SMT/CMP version was 2-way decode then the CMT version will be 4-way decode. If the SMT/CMP fetch was 16B then the CMT fetch will be 32B. If the SMT/CMP L2 cache is 256KB then the CMT L2 cache is 512KB.

If Intel saw utilization was not as high expected then they could include SMT. (Thread A: ALU0/AGU1/FPU0/MISC0 ; Thread B: ALU1/AGU0/FPU2 ; Thread C: ALU1/AGU0/AGU1/MISC0 ; Thread D : ALU0/FPU0/FPU1)

This is really going to depend on the implementation.

AtenRa said:
From the same link you quoted, it has a Server price table bellow the CPU prices.

The Opteron 6376 Server costs $4225
The XEON E5 2630 Server costs $5008.

They are running different setups. Find where they tried to normalize to get power numbers. Those are not even heavily loaded numbers.
The fact remains that the 6200 or 6300 series has not be successful in servers at all, despite its lower price (and selling more for less while good for the consumer is not good for AMD).

Idontcare · May 6, 2014

Enigmoid said:
This is really going to depend on the implementation.

I quote you out of context but only to say this is the bottom-line for CMT itself.

What you get is entirely dependent on what you designed to get, what tradeoffs you made and what sort of performance issues you were willing to accept in the pursuit of die-savings.

CMT is, and will always be, the poor man's CMP. No way around that.

NostaSeronx · May 6, 2014

Idontcare said:
What you get is entirely dependent on what you designed to get, what tradeoffs you made and what sort of performance issues you were willing to accept in the pursuit of die-savings.

CMT is, and will always be, the poor man's CMP. No way around that.

CMT isn't made to save die area, it is meant to increase throughput of two cores.

Bulldozer CMP Core: ~11 mm²
Steamroller CMP Core: ~13 mm²

2 * BCMPC/SCMPC => ~22 mm²/~26 mm²

CMP to CMT is not built for die area savings. In no way is it a poor mans implementation of CMP.

2 Bulldozer/Piledriver CMP Cores:
2 * 32KB L1i
2 * 8 Byte Fetches
2 * 2-way Decodes
2 * x86-64 Core (2 ALU + 2 AGLU, 16KB L1d)
2 * FPU (128-bit FMAC + 128-bit FMISC)
2 * 1 MB L2

1 Bulldozer/Piledriver CMT Module:
1 * 64KB L1i
1 * 16 Byte Fetch
1 * 4-way decode
2 * x86-64 Cores (2 ALU + 2 AGLUs, 16KB L1d)
1 * FPU (2 * (128-bit FMAC + 128-bit FMISC))
1 * 2 MB L2

2 Steamroller/Excavator CMP Cores:
2 * 48KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Core (4 ALU + 4 AGLU, 16KB L1d)
2 * FPU (256-bit FMAC + 256-bit FMISC)
2 * 1 MB L2

1 Steamroller/Excavator CMT2 Module:
1 * 96KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Cores (4 ALU + 4 AGLUs, 16KB L1d)
1 * FPU (2 * (256-bit FMAC + 256-bit FMISC))
1 * 2 MB L2

NTMBK · May 6, 2014

NostaSeronx said:
CMT isn't made to save die area, it is meant to increase throughput of two cores.

Bulldozer CMP Core: ~11 mm²
Steamroller CMP Core: ~13 mm²

2 * BCMPC/SCMPC => ~22 mm²/~26 mm²

CMP to CMT is not built for die area savings. In no way is it a poor mans implementation of CMP.

2 Bulldozer/Piledriver CMP Cores:
2 * 32KB L1i
2 * 8 Byte Fetches
2 * 2-way Decodes
2 * x86-64 Core (2 ALU + 2 AGLU, 16KB L1d)
2 * FPU (128-bit FMAC + 128-bit FMISC)
2 * 1 MB L2

1 Bulldozer/Piledriver CMT Module:
1 * 64KB L1i
1 * 16 Byte Fetch
1 * 4-way decode
2 * x86-64 Cores (2 ALU + 2 AGLUs, 16KB L1d)
1 * FPU (2 * (128-bit FMAC + 128-bit FMISC))
1 * 2 MB L2

2 Steamroller/Excavator CMP Cores:
2 * 48KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Core (4 ALU + 4 AGLU, 16KB L1d)
2 * FPU (256-bit FMAC + 256-bit FMISC)
2 * 1 MB L2

1 Steamroller/Excavator CMT2 Module:
1 * 96KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Cores (4 ALU + 4 AGLUs, 16KB L1d)
1 * FPU (2 * (256-bit FMAC + 256-bit FMISC))
1 * 2 MB L2

1. You just made up those numbers for a theoretical CMP core...
2. Steamroller doesn't have 256 bit FMACs.
3. Excavator and Steamroller aren't the same thing.
4. Steamroller doesn't have 4 ALUs and 4 AGUs per core.

EDIT: Added another point...

parvadomus · May 6, 2014

NostaSeronx said:
CMT isn't made to save die area, it is meant to increase throughput of two cores.

Bulldozer CMP Core: ~11 mm²
Steamroller CMP Core: ~13 mm²

2 * BCMPC/SCMPC => ~22 mm²/~26 mm²

CMP to CMT is not built for die area savings. In no way is it a poor mans implementation of CMP.

2 Bulldozer/Piledriver CMP Cores:
2 * 32KB L1i
2 * 8 Byte Fetches
2 * 2-way Decodes
2 * x86-64 Core (2 ALU + 2 AGLU, 16KB L1d)
2 * FPU (128-bit FMAC + 128-bit FMISC)
2 * 1 MB L2

1 Bulldozer/Piledriver CMT Module:
1 * 64KB L1i
1 * 16 Byte Fetch
1 * 4-way decode
2 * x86-64 Cores (2 ALU + 2 AGLUs, 16KB L1d)
1 * FPU (2 * (128-bit FMAC + 128-bit FMISC))
1 * 2 MB L2

2 Steamroller/Excavator CMP Cores:
2 * 48KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Core (4 ALU + 4 AGLU, 16KB L1d)
2 * FPU (256-bit FMAC + 256-bit FMISC)
2 * 1 MB L2

1 Steamroller/Excavator CMT2 Module:
1 * 96KB L1i
2 * 16 Byte Fetches
2 * 4-way Decodes
2 * x86-64 Cores (4 ALU + 4 AGLUs, 16KB L1d)
1 * FPU (2 * (256-bit FMAC + 256-bit FMISC))
1 * 2 MB L2

CMT was used by AMD to save FPU area, mainly to remain competitive at servers (where floating point is less used), but the shared nature just killed the IPC and CMT in the process too. It was just a very weak design. SMT is much much better in the long run.

NostaSeronx · May 6, 2014

NTMBK said:
1. You just made up those numbers for a theoretical CMP core...

Nope, that is what the CMP core of 15h is.

NTMBK said:
2. Steamroller doesn't have 256 bit FMACs.

Well it has 4 x 64-bit FMACs in the upper and lower datapaths. This is an upgrade from 2 x 64-bit FMACs in the upper and lower datapaths.

NTMBK said:
3. Excavator and Steamroller aren't the same thing.

Steamroller never came out, SteamrollerB did. SteamrollerB and Excavator use the same FEOL mask but different BEOL M0/M1 layers. Not all the units in SteamrollerB are connected to the BEOL.

parvadomus said:
CMT was used by AMD to save FPU area, mainly to remain competitive at servers (where floating point is less used)...

That is a guess on your part not the engineers of AMD.
CMT was used to give two cores a way to access double the resources they would not otherwise have. There is also an increase in TLP but I'm uneducated in the TLP enhancement.

parvadomus said:
...but the shared nature just killed the IPC and CMT in the process too. It was just a very weak design. SMT is much much better in the long run.

In real world workloads the cores do not share when both cores are active. While if one core is active majority of the time it gets double resources; Full Front-end Access and Full FPU access.

parvadomus · May 6, 2014

NostaSeronx said:
CMT was used to give two cores a way to access double the resources they would not otherwise have. There is also an increase in TLP but I'm uneducated in the TLP enhancement.In real world workloads the cores do not share when both cores are active. While if one core is active majority of the time it gets double resources; Full Front-end Access and Full FPU access.

The only double resources a "core" access there is the L1i cache (which wasnt double) and just got thrashed by the access of both cores.
The rest of the resources (ALUs) are per core, only the FPU was shared, and it was weaker than phenom's :|

NostaSeronx · May 6, 2014

parvadomus said:
The only double resources a "core" access there is the L1i cache (which wasnt double) and just got thrashed by the access of both cores.

Except fetches from the L1i is decoupled from the cores and temporally multithreaded. The only thrashing possible is if the code wasn't aligned to 16Bs.

Cycle A - 16B Fetch for Core A
Cycle B - 16B Fetch for Core B
Cycle C - 16B Fetch for Core A
Cycle D - 16B Fetch for Core B, up to 4 macro-op decode for Core A.
Cycle E - 16B Fetch for Core A, up to 4 macro-op decode for Core B.
Cycle F - 16B Fetch for Core B, up to 4 macro-op decode for Core A.
so on.

If this Fetching arrangement is so weak it is cataclysmically weak for KV-A1.

Cycle A - 16B Fetch for Core A
Cycle B - Nothing
Cycle C - 16B Fetch for Core B
Cycle D - up to 4 macro-op decode for Core A, up to 4 macro-op decode for Core B.
Cycle E - 16B Fetch for Core A
Cycle F - Nothing
so on.

parvadomus said:
The rest of the resources (ALUs) are per core, only the FPU was shared, and it was weaker than phenom's :|

The FPU was not weaker than K8's, you can do FMA's when MULs+ADDs are present. Also, the FMAC can be active just for ADDs or MULs. While the FPU in K8 can only peak out if there is an ADD and a MUL.

Vesku · May 6, 2014

Dirk just had AMD hatch the wrong egg with CMT. Should have spent those Bulldozer billions on fully integrating the GPU with their HSA initiative even if that meant limping along on K8 based CPU technology for a few years longer. Unfortunately, imo, Dirk's background with Alpha meant he was far more enthusiastic about turning the CPU aspect into a budget Power/Sparc solution than fully embracing the fully integrated CPU+GPU SoC. Still a bit mind boggling an experienced industry guy like that fell into the Big Iron/P4 mind trap. The mistake was compounded by the failure to ensure the individual cores were faster than their immediate predecessor.

The more seamlessly the GPU can handle FPU tasks the more useful AMD's Bulldozer+ designs become.

NostaSeronx · May 6, 2014

Vesku said:
Dirk just had AMD hatch the wrong egg with CMT.

CMT isn't from Dirk Meyer it is from Hector Ruiz. If we are going to name CEOs in a certain development cycle.

CEOs at R&D frames of time;
00h/10h architectures -> Jerry Sanders/Hector Ruiz
15h architectures (K9/BD) -> Hector Ruiz
14h architectures -> Dirk Meyer
16h architectures -> Dirk Meyer/Rory Read
15h architectures (K15/BD2-PD) -> Dirk Meyer
15h architectures (K15.5/SRB-XV) -> Rory Read

AtenRa · May 6, 2014

SMT and CMT are all about Throughput. Each implementation has its pros and cons. There is not straight forward answer to which of them is better in general.

NostaSeronx · May 6, 2014

AtenRa said:
SMT and CMT are all about Throughput. Each implementation has its pros and cons. There is not straight forward answer to which of them is better in general.

SMT is about Utilization and CMT is about Throughput.

Intel architectures and old AMD architectures had over provisioning which meant there was more units than one logical thread could handle. SMT allowed those resources to be utilized with positive or negative results by adding a second logical thread.

CMT1 is a mixed bag; You have the front-end being VMT, the cores being CMP, the FPU Front-end being VMT and the FPU Execution being SMT.
CMT2 is also a mixed bag; You have the front-end being SMT, the cores being CMP, the FPU Front-end being SMT and the FPU Execution being SMT.
CMP Modules(1 core) -> CMT1/2 Modules(2 cores)

Vesku · May 6, 2014

NostaSeronx said:
CMT isn't from Dirk Meyer it is from Hector Ruiz. If we are going to name CEOs in a certain development cycle.

CEOs at R&D frames of time;
00h/10h architectures -> Jerry Sanders/Hector Ruiz
15h architectures (K9/BD) -> Hector Ruiz
14h architectures -> Dirk Meyer
16h architectures -> Dirk Meyer/Rory Read
15h architectures (K15/BD2-PD) -> Dirk Meyer
15h architectures (K15.5/SRB-XV) -> Rory Read

Dirk Meyer was at the top of AMD's cpu design management pyramid prior to becoming CEO.

http://usatoday30.usatoday.com/tech/products/2008-07-17-4237049458_x.htm

The "whatif" AMD actually aggressively pursued ATI SoC integration with both hardware and software investments instead of pumping billions into Bulldozer 45nm and Round2: Bulldozer 32nm is only surpassed by the "whatif" AMD had merged with Nvidia.

Granted, can't change the past so they are best off embracing the CMT they are now very knowledgeable in since it will improve in value greatly the farther along they get in fully integrating CPU and GPU.

NostaSeronx · May 6, 2014

Vesku said:
Dirk Meyer was at the top of AMD's cpu design management pyramid prior to becoming CEO.

Mitch Alsup was the Chief Architect for the original Bulldozer. That was meant to compete with Pentium 4 Tejas and Pentium 4 Nehalem.

http://www.linkedin.com/pub/mitch-alsup/7/153/869

Vesku · May 6, 2014

NostaSeronx said:
Mitch Alsup was the Chief Architect for the original Bulldozer. That was meant to compete with Pentium 4 Tejas and Pentium 4 Nehalem.

http://www.linkedin.com/pub/mitch-alsup/7/153/869

Dirk was clearly the one in charge of overall direction when it came to CPU development:

June 4, 2004 - "Dirk Meyer is responsible for product development at AMD, and is therefore one of their key architects. He was the director of engineering in 1996 and before that was involved in the development of the Alpha 21064 and 212654 microprocessors. "

http://www.pcper.com/reviews/Shows-...-2004/Dirk-Meyer-AMD-and-John-Peddie-Research

Still a bit of a head-scratcher that they weren't taking lessons from Intel's P4 failures even as they were happening right in front of their eyes. Interesting to think of the "might have been" of AMD having already rolled out fully HSA enabled SoCs and now introducing CMT CPUs to them rather than vice versa.

NostaSeronx · May 6, 2014

Vesku said:
Dirk was clearly the one in charge of overall direction when it came to CPU development.

Head of Direction != Chief Architect. He simply made sure the Chief Architects were doing their work.

Mitch Alsup - Chief Architect 2003-2007; Bulldozer1
Mister X - Chief Architect 2008-2009; Downscaling Bulldozer1 from 5+ GHz to ~3.5 GHz.
Mike Butler - Chief Architect 2010-2011; Bulldozer2, Piledriver, Steamroller
Mister Y (from India) - Chief Architect 2012-2014; SteamrollerB, Excavator

Vesku · May 6, 2014

NostaSeronx said:
Head of Direction != Chief Architect. He simply made sure the Chief Architects were doing their work.

Mitch Alsup - Chief Architect 2003-2007; Bulldozer1
Mister X - Chief Architect 2008-2009; Downscaling Bulldozer1 from 5+ GHz to ~3.5 GHz.
Mike Butler - Chief Architect 2010-2011; Bulldozer2, Piledriver, Steamroller
Mister Y (from India) - Chief Architect 2012-2014; SteamrollerB, Excavator

You are the one who brought up Chief Architects, I'm saying Dirk was the major decision maker in pushing through CMT design before full CPU+GPU integration. Rather than pursue HSA first and riding the K8 based architectures a couple of years longer. Which his being head decision maker for AMD processors during the time period backs up.

Although it was done in the wrong order imo they are best off sticking with CMT. Should pay off if they can fully realize HSA. If they can't see through HSA it won't matter whether it's CMT or SMT because they'll be solely reliant on becoming a design company for others to stay afloat.

NostaSeronx · May 6, 2014

The HSA doesn't care about the design focus of the CPU or GPU. HSA doesn't care if the CPU is CMP or CMT or if the threading is VMT or SMT or ST. HSA doesn't care if the core is x86-64, ARM, Power, PowerPC, EPIC, etc.

---
The Chief Architect for FSAIL didn't come up with it till after he was done with Graphic Core Next. CMT was thought up way before FSAIL ever came up in discussions. The FSA concept didn't appear till 2011 while the CMT concept appeared in 2005.

---
The guy who made FSAIL and Graphic Core Next is now at Nvidia. He had creative input with Kepler and was/is the Chief for Maxwell.

Which could explain why AMD opened the specification up to other corporations to piggy back on their R&D.

Idontcare · May 6, 2014

AtenRa said:
SMT and CMT are all about Throughput. Each implementation has its pros and cons. There is not straight forward answer to which of them is better in general.

This.

The challenge is multi-fold, in part because the performance of the architecture is strongly dependent on the software that will be employed by the end-user, something which the CPU architects have zero control over.

NostaSeronx · May 6, 2014

Idontcare said:
...in part because the performance of the architecture is strongly dependent on the software that will be employed by the end-user, something which the CPU architects have zero control over.

This only applies to the floating point unit in the current design. If you use only the generic x86-64 instruction set without Vector Integer/Floating Point(x87 -> AVX/etc). Then, the CMT solution is the better design in a thread to thread throughput metric.

SMT you can't have two threads doing the same thing. While with CMT you can have two threads doing the same thing.

One is built for utilization.
One is built for throughput.
---
People seem to have an unrealistic expectation that Flops = IPC, or actual performance of a generic core.

TuxDave · May 7, 2014

NostaSeronx said:
...

This only applies to the floating point unit in the current design. If you use only the generic x86-64 instruction set without Vector Integer/Floating Point(x87 -> AVX/etc). Then, the CMT solution is the better design in a thread to thread throughput metric.

...

Let's not promote bad behavior.

(and yes, I know some customers are still in love with x87 but they've been dealt so many performance nerfs that I think thread to thread throughput is a secondary concern)

[WCCF] AMD To Drop CMT, Welcome back SMT?

Lifer

Diamond Member

Lifer

Platinum Member

Elite Member

Diamond Member

Lifer

Senior member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Lifer