[WCCF] AMD To Drop CMT, Welcome back SMT?

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
CMT is here to stay, it does not compete with SMT. The next CMT architecture is CCMT which is Concurrent Cluster Multithreading, which is CMT and SMT combined.

In the case of a 15h like architecture it would be like;

ALU/AGLU cluster 0 + ALU/AGLU cluster 1 + ALU/AGLU cluster 2 + ALU/AGLU cluster 3. All these clusters are a single thread and share a unified scheduler and retire, etc.

Each cluster has their own registers and a L0d cache that is write through to L1d. Which in turn that is also write through to the L2 cache.
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
2015 guys.

Mark it in your diaries, although I would not be surprised if it launched mid 2016.

Actually I don't care when it does, just get it right.

http://wccftech.com/amds-high-performance-processor-cores-coming-2015-giving-modular-architecture/

God Speed team red.

Not surprised. Bulldozer and associates were AMD's pentium 4. Barely comparable with phenom 2 (kaveri finally catches up IPC). Further proof is that even AMD's low power architecture (jaguar/puma) boasts equivalent IPC though at lower clocks.

Taking pieces from jaugar and scaling it up, possible using bits from phenom, could be a nice increase.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
What's wrong with CMT? CMT is a nice microarchitectural approach.

I think CMT has a bad name only because AMD chose to implement CMT while simultaneously cutting down the core's single-threaded IPC way too much. (which was done to enable high clockspeeds at the expense of lowered IPC)

Also, CMT and SMT are not mutually exclusive. You can implement both in the same microarchitecture as they are addressing different opportunities.

Had AMD chose to keep with beefier (and slower clocked) cores, adding CMT as an added benefit, I think the enthusiast world would look upon CMT with a far different set of expectations.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
CMT has a bad name because everyone thinks that it is the cause for "bad performance." Sadly, the choice to go CMT was not one that reduced single-threaded IPC.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
I think CMT has a bad name only because AMD chose to implement CMT while simultaneously cutting down the core's single-threaded IPC way too much. (which was done to enable high clockspeeds at the expense of lowered IPC)

Not because of AMD only. SUN also tried to implement something along the CMT lines with Rock and also failed big time. They also got a big, hot, slow processor in the process. Coincidence?

Or maybe CMT just makes sense in clock speed uber alles designs, which SUN's Rock also was. I remember reading a text from Andy Glew where he was advocating CMT for Intel by the time of Willamete design. CMT would address somethings he saw as shortcomings on that design.

Whatever the case, it seems that whatever benefits CMT bring is simply not enough to entice Qualcomm, Intel, Apple and the other successful chip designers to try the concept on an actual product.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Here's Andy Glew text. I wonder if Cogman, Tuxdave, Cto and others could chime in and say something.

Andy Glew said:
https://groups.google.com/forum/#!msg/comp.arch/dZvMy_oLiwc/VmpzO6m_0zwJ

BRIEF:

AMD's Bulldozer is an MCMT (MultiCluster MultiThreaded)
microarchitecture. That's my baby!

DETAIL:

(...)

I can't express how good it feels to see MCMT become a product. It's
been public for years, but it gets no respect until it is in a product.
It would have been better if I had stayed at Intel to see it through.
I know that I won't get any credit for it. (Except from some of the guys
who were at AMD at the time.) But it feels good nevertheless.

The only bad thing is that some guys I know at AMD say that Bulldozer is
not really all that great a product, but is shipping just because AMD
needs a model refresh. "Sometimes you just gotta ship what you got." If
this is so, and if I deserve any credit for CMT, then I also deserve
some of the blame. Although it might have been different, better, if I
had stayed.

I came up with MCMT in 1996-2000 while at the University of Wisconsin.
It became public via presentations.

I brought MCMT back to Intel in 2000, and to AMD in 2002.

I was beginning to despair of MCMT ever seeing the light of day. I
thought that when I left AMD in 2004, the MCMT ideas may have left with
me. Apparently not. I must admit that I am surprised to see that the
concept endured so many years - 5+ years after I left, 7+ years to
market. Apparently they didn't have any better ideas.

True, there were rumors. For example, Chuck Moore presented a slide
with Multicluster Multithreading on it to analysts in 2004 or 2005. But
things went quiet. There were several patents filed, with diagrams that
looked very much like the ones I drew for the K10 proposal. But, one
often sees patent applications for cancelled projects.

Of course, AMD has undoubtedly changed and evolved MCMT in many ways
since I first proposed it to them. For example, I called the set of an
integer scheduler, integer execution units, and an L1 data cache a
"cluster", and the whole thing, consisting of shared front end, shared
FP, and 2 or more clusters, a processor core. Apparently AMD is calling
my clusters their cores, and my core their cluster. It has been
suggested that this change of terminology is motivated by marketing, so
that they can say they have twice as many cores.

My original motivation for MCMT was to work around some of the
limitations of Hyperthreading on Willamette. E.g. Willamette had a very
small L0 data cache, 4K in some of the internal proposals, although it
shipped at 8K. Two threads sharing such a tiny L0 data cache thrash.
Indeed, this is one of the reasons why hyperthreading is disabled on
many systems, including many current Nhm based machines with much larger
closest-in caches.

At the time, the small L0s were a given. You couldn't build a
Willamette style "fireball" high frequency machine, and have a much
bigger cache, and still preserve the same small cache latency.

To avoid threads thrashing each other, I wanted to give each thread
their own L0. But, you can't do so, and still keep sharing the
execution units and scheduler - you can't just build a 2X larger array,
or put two arrays side by side, and expect to have the same latency.
Wires. Therefore, I had to replicate the execution units, and enough of
the scheduler so that the "critical loop" of Scheduler->Execution->Data
Cache was all isolated from the other thread/cluster. Hence, the form
of multi-cluster multi-threading you see in Bulldozer.

True, there are differences, and I am sure more will become evident as
more Bulldozer information becomes public. For example, although I came
up with MCMT to make Willamette-style threading faster, I have always
wanted to put SpMT, Speculative Multithreading, on such a substrate.
SpMT has potential to speed up a single thread of execution, by
splitting it up into separate threads and running the separate threads
on different clusters, whereas Willamette-style hyperthreading, and
Bulldizer-style MCMT (apparently), only speed up workloads that have
existing independent threads. I still want to build SpMT. My work at
Wisconsin showed that SpMT on a Willamette substrate was constrained by
Willamette's poor threading microarchitecture, so naturally I had to
first create the best explicit threading microarchitecture I could, and
then run SpT on top of it.

If I received arows in my back for MCMT, I received 10 times as many
arrows for SpMT. And yet still I have hope for it. Unfortunately, I am
not currently working on SpMT. Haitham Akkary, the father of DMT,
continues the work.

I also tried, and still continue, to explore other ways of speeding up
single threads using multiple clusters.

Although I remain an advocate of SpMT, I have always recognized the
value of MCMT as an explicit threaded microarchitecture.

Perhaps I should say here that my MCMT had a significant difference from
clustering in, say, the Alpha 21264,
http://www.hotchips.org/archives/hc10/2_Mon/HC10.S1/HC10.1.1.pdf
Those clusters bypass to each other: there is a fast bypass within a
cluster, and a slightly slower (+1 cycle) bypass of results between
clusters. The clusters are execution units only, and share the data
cache. This bypassing makes it easy (or at least easier) to spread a
single thread across both clusters. My MCMT clusters, on the other
hand, do NOT bypass to each other. This motivates separate threads per
cluster, whether explicit or implicit.

I have a whole taxonomy of different sorts of clustering:
* fast vs slow bypass clusters
* fully bypassed vs. partially bypassed
* mechanisms to reduce bypassing
* physical layout of clusters
* bit interleaved datapaths
* datapaths flowing in opposite directions,
with bypassing where they touch
* what's in the cluster
* execute only
* execute + data cache
* schedule + execute + data cache
* renamer + schedule + execute + datacache
...
* what gets shared between clusters
* front-end
* renamer?
* data-cache - L0? L1? L2?
* TLBs...
* MSHRs...
* FP...

Anyway: if it has an L0 or L1 data cache in the cluster, with or
without the scheduler, it's my MCMT. If no cache in the cluster, not
mine (although I have enumerated many such possibilities).

Motivated by my work to use MCMT to speed up single threads, I often
propose a shared L2 instruction scheduler, to load balance between the
clusters dynamically. Although I admit that I only really figured out
how to do that properly after I left AMD, and before I joined Intel.
How to do this is part of the Multi-star microarchitecture, M*, that is
my next step beyond MCMT.

Also, although it is natural to have a single (explicit) thread per
cluster in MCMT, I have also proposed allowing two threads per cluster.
Mainly motivated by SpMT: I could fork to a "runt thread" running in
tghe same cluster, and then migrate the run thread to a different
cluster. Intra-cluster forking is faster than inter-cluster forkng, and
does not disturb the parent thread.
But, if you are not doing SpMT, there is much less motivation for
multiple threads per cluster. I would not want to do that unless I was
also trying to build a time-switched lightweight threading system.
Which, as you can imagine if you know me, I have also proposed. In
fact, I hope to go to the SC'09 Workshop on that topic.

I will be quite interested to see whether Bulldozer's cluster-private L1
caches (in AMD's swapped terminology, core-private L1 caches) are write
through or write-back. Willamette's L0 was write-through. I leaned
towards write-back, because my goal was to isolate clusters from each
other, to reduce thrashing. Also, because write-back lends itself
better to a speculative versionong cache, useful for SpMT.

With Willamette as background, I leaned towards a relatively small, L0,
cache in the cluster. Also, such a small L0 can often be pitch-matched
with the cluster execution unit datapath. A big L1, such as Bulldozer
seems to have, nearly always has to lie out of the datapath, and
requires wire turns. Wire turns waste area. I have, from time to time,
proposed putting the alignment muxes and barrel shifters in the wire
turn area. I'm surprised that a large cluster L1 makes sense, but that's
the sort of thing that you can only really tell from layout.

Some posters have been surprised by sharing the FP. Of course, AMD's K7
design, with separate clusters for integer and FP, was already half-way
there. They only had to double the integer cluster. It would have been
harder for Intel to go MCMT, since the P6 family had shared integer and
FP. Willamette might have been easier to go MCMT, since it had separate FP.

Anyway... of course, for FP threads you might like to have
thread-private FP. But, in some ways, it is the advent of expensve FP,
like Bulldozer's 2 sets of 128 bit, 4x32 bit, FMAs, that justify integer
MCMT: the FP is so big that the overhead of replicating the integer
cluster, including the OOO logic, is a drop in the bucket.
You'd like to have per-cluster-thread FP, but such big FP workloads are
often so memory intensive that they thrash the shared-between-clusters
L2 cache: threading may be disabled anyways. As it is, you get good
integer threads via MCMT, and you get 1 integer thread and 1 FP thread.
Two FP threads may have some slowdown, although, again, if memory
intensive they may be blocking on memory, and hence allowing the other
FP thread t use the FP. But two purely computational FP threads will
almost undoubtedly block, unless the schedulers are piss-poor and can't
use all of the FP for a single thread (e.g. by being too small).

I certainly want to explore possibilities such as SpMT and other single
thread speedups. But I know that you can't build all the neat ideas in
one project. Apparently MCMT by itself was enough for AMD Bulldozer.
(Actually, I am sure that there are other new ideas in Bulldozer. Just
apparently not SpMT or spreading a single thread across clusters.) Look
at the time-lag: 10-15 years from when I came up with MCMT in
Wisconsin, 1996-2000. It is now 7-5 years from when I was at AMD,
2002-2004, and it will be another 2 years or so before Bulldozer is a
real force in the marketplace.

I don't expect to get any credit for MCMT. In fact, I'm sure I'm going
to get shit for this post. I don't care. I know. The people who were
there, who saw my presentations and read my proposals, know. But, e.g.
Chuck Moore wasn't there at start; he came in later. Even Mike Haertel,
my usual collaborator, wasn't there; he was hired in later, although
before Chuck. Besides, Mike Haertel thinks that MCMT is obvious.
That's cool, although I ask: if MCMT is obvious, then why isn't Intel
doing it? Companies like Intel and AMD need idea generating people like
me about once every 10 years. In between, they don't need new ideas.
They need new incremental improvements of existing ideas.

Anyway... It's cool to see MCMT becoming real. It gives me hope that my
follow-on to MCMT, M* may still, eventually, also become real.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Not because of AMD only. SUN also tried to implement something along the CMT lines with Rock and also failed big time. They also got a big, hot, slow processor in the process. Coincidence?

Or maybe CMT just makes sense in clock speed uber alles designs, which SUN's Rock also was. I remember reading a text from Andy Glew where he was advocating CMT for Intel by the time of Willamete design. CMT would address somethings he saw as shortcomings on that design.

Whatever the case, it seems that whatever benefits CMT bring is simply not enough to entice Qualcomm, Intel, Apple and the other successful chip designers to try the concept on an actual product.
Sun's CMT != AMD's CMT.

Sun's CMT is actually normal basic CMP. The closest AMD design to it is Kabini/Beema/etc.
 
Last edited:

rtsurfer

Senior member
Oct 14, 2013
733
15
76
What's wrong with CMT? CMT is a nice microarchitectural approach.

I think CMT has a bad name only because AMD chose to implement CMT while simultaneously cutting down the core's single-threaded IPC way too much. (which was done to enable high clockspeeds at the expense of lowered IPC)

Also, CMT and SMT are not mutually exclusive. You can implement both in the same microarchitecture as they are addressing different opportunities.

Had AMD chose to keep with beefier (and slower clocked) cores, adding CMT as an added benefit, I think the enthusiast world would look upon CMT with a far different set of expectations.

CMT has a bad name because everyone thinks that it is the cause for "bad performance." Sadly, the choice to go CMT was not one that reduced single-threaded IPC.


I am pretty sure you guys know more about this stuff than me, but
here is an excerpt from an article that Wccftech has linked in this article

But if we look at the actual benchmarks, we see that the reality is different: AMD actually NEEDS those two dies to keep up with Intel’s single die. And even then, Intel’s chip excels in keeping response times short. The new CMT-based Opterons are not all that convincing compared to the smaller, older Opteron 6174 either, which can handle only 12 threads instead of 16, and just uses vanilla SMP for multithreading.

Let’s inspect things even closer… What are we benchmarking here? A series of database scenarios, with MySQL and MSSQL. This is integer code. Well, that *is* interesting. Because, what exactly was it that CMT did? Oh yes, it didn’t do anything special for integers! Each module simply has two dedicated integer cores. It is the FPU that is shared between two threads inside a module. But we are not using it here. Well, lucky AMD, best case scenario for CMT.

But let’s put that in perspective… Let’s have a simplified look at the execution resources, looking at the integer ALUs in each CPU.

The Opteron 6276 with CMT disabled has:

8 modules
8 threads
4 ALUs per module
2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
16 ALUs in total
With CMT enabled, this becomes:

8 modules
16 threads
4 ALUs per module
2 ALUs per thread
32 ALUs in total
So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.

The Xeon X5650 with SMT disabled has:

6 cores
6 threads
3 ALUs per core
3 ALUs per thread
18 ALUs in total
With SMT enabled, this becomes:

6 cores
12 threads
3 ALUs per core
3 ALUs per 2 threads, effectively ~1.5 ALUs per thread
18 ALUs in total
So here the difference between CMT and SMT becomes quite clear: With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.

And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.

At the same time, CMT is not actually saving a lot of die-space: There are 4 ALUs in a module in total. Yes, obviously, when you have more resources for two threads inside a module, and the single-threaded performance is poor anyway, one would expect it to scale better than SMT.


Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
A lot of people forget that with 00h/10h cores you could only do 3 ALU ops or 3 AGU ops. You could not do both per cycle which limited you to 3 micro-ops per cycle.

With 15h cores you could do 2 ALU ops and 2 AGLU ops. You can do both per cycle which gives you 4 micro-ops.

With Intel since Pentium 4 you could do several ALU and AGU ops in the same cycle. Which lead to Pentium 4 outrunning K8 clock to clock in IPC. High IPC doesn't always correlate to higher performance.
 
Last edited:

witeken

Diamond Member
Dec 25, 2013
3,899
193
106
CMT is here to stay, it does not compete with SMT. The next CMT architecture is CCMT which is Concurrent Cluster Multithreading, which is CMT and SMT combined.
The tweet in that article says CMT will be ditched.
 

dbcoopernz

Member
Aug 10, 2012
68
4
71
There doesn't seem to be anything (wccftech article aside) that actually says that CMT will be ditched.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
What's wrong with CMT? CMT is a nice microarchitectural approach.

I think CMT has a bad name only because AMD chose to implement CMT while simultaneously cutting down the core's single-threaded IPC way too much. (which was done to enable high clockspeeds at the expense of lowered IPC)

Also, CMT and SMT are not mutually exclusive. You can implement both in the same microarchitecture as they are addressing different opportunities.

Had AMD chose to keep with beefier (and slower clocked) cores, adding CMT as an added benefit, I think the enthusiast world would look upon CMT with a far different set of expectations.

This, and this. CMT is bad mouthed because of the design flaws in the Bulldozer uarch design and is not put into value by itself. You have to be blatantly ignorant in the matter to blame CMT for the handful of shortcomings and bad decisions lying around Bulldozer's design. Or is anyone gonna imply that, for example, Bulldozer's extremely slow L3 is because of the CMT, where K10's was as bad, if not, worse?

One has to learn first what does CMT affect in the rest of the design to actually judge it's feasibility as a design decision by itself. The really lacking areas of the BD family design are just starting to be correctly addressed with Steamroller (and even there, another poor and inevitable design decisions had to be made to make the core HSA compilant), and even by just a little departing from the purer CMT design BD/PD was, the core of the design decision itself, streamlining the FP resources without losing performance, is still there. Instead of doubling the width in the FP pipelines like SB from Nehalem, AMD went with even less FP resources from K10/.5 to BD, improving their efficiency to make them perform the same as prior designs.
 

Idontcare

Elite Member
Oct 10, 1999
21,118
59
91
I am pretty sure you guys know more about this stuff than me, but
here is an excerpt from an article that Wccftech has linked in this article

Source: http://scalibq.wordpress.com/2012/02/14/the-myth-of-cmt-cluster-based-multithreading/

They are conflating CMT as a design methodology with the specific microarchitectural trade-offs that AMD chose in implementing their version of CMT.

Conceptually, CMT is not complicated and in the limit of full duplication of hardware resources provides the exact same benefits of adding an entirely new core to the chip.

Its when you choose to cut out too much, hold back on duplicating too many items, that you create a CMT-based product that barely improves on a standalone core design.

AMD's CMT was surprisingly good, providing nearly 80% the performance scaling of having two full-fledged cores. The problem was that those "full fledged cores" themselves were quite weak in terms of IPC...that's a design trade-off made by AMD that had nothing to do with CMT.





As to why other high-profile companies don't pursue CMT themselves, I can tell you the complications that come with robustly qualifying and validating a CMT-based microarchitecture is a strong reason many will avoid it.

Another good reason to void CMT is that it raises your design development expenses over that of designing a smaller (dimension-wise) single-threaded core (or one with SMT enabled) and then just copy-and-paste that core until you have your desired thread-count capability.

Why bring on the complications unless you have to?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
CMT is a 50% increase in die area for 80% increase in throughput. A lot of people including myself have gotten confused by the language.

CMP Core Bulldozer;
8B Fetch, 32 KB L1i, 1 MB L2
2-way Decode
2 ALU + 2 AGLU, 16 KB L1d
128-bit FMAC + 128-bit FMISC

For CMT, the FPU and x86-64 Front-end and L2 is decoupled from the x86-64 Core.

The x86-64 Front-end and FPU and L2 is doubled while the core count is doubled.

16B Fetch, 64KB L1i, 2 MB L2.
4-way decode
2 * (2 ALU + 2 AGLU, 16KB L1d)
2 * (128-bit FMAC + 128-bit FMISC)
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
They are conflating CMT as a design methodology with the specific microarchitectural trade-offs that AMD chose in implementing their version of CMT.

Conceptually, CMT is not complicated and in the limit of full duplication of hardware resources provides the exact same benefits of adding an entirely new core to the chip.

Its when you choose to cut out too much, hold back on duplicating too many items, that you create a CMT-based product that barely improves on a standalone core design.

AMD's CMT was surprisingly good, providing nearly 80% the performance scaling of having two full-fledged cores. The problem was that those "full fledged cores" themselves were quite weak in terms of IPC...that's a design trade-off made by AMD that had nothing to do with CMT.





As to why other high-profile companies don't pursue CMT themselves, I can tell you the complications that come with robustly qualifying and validating a CMT-based microarchitecture is a strong reason many will avoid it.

Another good reason to void CMT is that it raises your design development expenses over that of designing a smaller (dimension-wise) single-threaded core (or one with SMT enabled) and then just copy-and-paste that core until you have your desired thread-count capability.

Why bring on the complications unless you have to?

I completely get what you are saying but a lot of the trade-offs that AMD made where because they went CMT, not because they decided to. Cutting out the third ALU meant a reduction in IPC; keeping three ALU's per thread (6 per module) would not have made sense as the module would have been too big and then you would lose the die savings. If you don't cut out enough then there is no reason to go CMT, especially when you consider the higher overhead of managing shared resources.

CMT doesn't make sense where ST is important as it will always result in greater aggregate performance over two threads rather then strong performance in a single thread.

I'm also looking at jaguar which is highly competitive with kaveri at low clock speeds and manages to get the same IPC despite substantially less per core resources.

Its also very easy to argue that an approach that will cost more R&D money and take longer to validate at the same performance level is inferior to a simpler less costly approach (and thus a failed design).
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I completely get what you are saying but a lot of the trade-offs that AMD made where because they went CMT, not because they decided to. Cutting out the third ALU meant a reduction in IPC; keeping three ALU's per thread (6 per module) would not have made sense as the module would have been too big and then you would lose the die savings. If you don't cut out enough then there is no reason to go CMT, especially when you consider the higher overhead of managing shared resources.

As NostaSeronx have said earlier, BD can do 2x ALU(Arithmetic Logical Unit) and 2x AGU(Address Generation Unit) simultaneously, that is 4x ops per cycle. AMD K10 architecture could do either 3x ALU or 3x AGUs thus 3x ops per Cycle.
The problem is that in order to have 4x ops per cycle you need to have the appropriate program, that means you must have data movement so your AGU will need to calculate the address, data operands and registers.

That means that the program should be written for ILP/DLP and SIMD/MIMD(*), but the vast majority of desktop programs are more serial than parallel in nature. AMD knew this but they chose to go for the more ILP/DLP approach because BD's first target was and still is the Server segment. The 2x ALU + 2x AGU approach also helps with the APU Architecture because GPUs are highly parallel in nature. Also HSA greatly uses Data parallelism.
They were also expecting that Desktop programs would become more parallel sooner but that didnt happen.

Also, the Integer execution units doesnt take that much die area as the front end. And it is the front end that they wanted to save space, they duplicated the Int Cores after all.

The only problem with BD was that it was one year late on market. As PileDriver clearly shown that the Architecture was very competitive against 32nm Intel's SandBridge in throughput performance(Server). If the market would not collapse and Dirk Meyer would not left AMD in early 2012 then we could have seen a 22/20nm Steamroller Server part in 2014.

CMT doesn't make sense where ST is important as it will always result in greater aggregate performance over two threads rather then strong performance in a single thread.

One doesnt excludes the other, you can have a strong ST on a CMT Design. Depending on the program, even BD has higher single thread than K10.


(*)
ILP = Instruction Level parallelism,
DLP = Data Level Parallelism
SIMD = Single Instruction Multiple DATA
MIMD = Multiple Instruction Multiple DATA
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,422
1,759
136
Cutting out the third ALU meant a reduction in IPC; keeping three ALU's per thread (6 per module) would not have made sense as the module would have been too big and then you would lose the die savings.

No. The scalar ALUs themselves (with the exception of mul) take vanishingly little die area. Cutting the third ALU had nothing to do with die area, and everything to do with delay. Their idea was to use a smaller, simpler, lower-ipc core to get faster clock speeds. They managed the first part, but not the second. This is orthogonal to the decision to do CMT.
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
Which lead to Pentium 4 outrunning K8 clock to clock in IPC. High IPC doesn't always correlate to higher performance.

Wrong, the K8 beat all Pentium 4s in IPC. Pentium 4 could only beat AMD K7 in IPC. In raw performance, the P4 could sometimes beat the K8. But never in clock for clock IPC.
 

pantsaregood

Senior member
Feb 13, 2011
993
37
91
Wrong, the K8 beat all Pentium 4s in IPC. Pentium 4 could only beat AMD K7 in IPC. In raw performance, the P4 could sometimes beat the K8. But never in clock for clock IPC.

If you can find any K7 - even a Duron - that is outperformed by a Pentium 4 at equal clock speed, I will be quite impressed.
 

inf64

Diamond Member
Mar 11, 2011
3,764
4,223
136
I posted what will this be 1 month ago

Excavator is supposed to bring more threads per module capability, some form of CMT/SMT hybrid (although Bulldozer already has it since FP unit is SMT) - so 4 threads per dual "core" module. Oh and AVX2+BMI2 capability.
The source for this is very reliable .

They are not ditching CMT per se. It seems they are just combining it with more SMT in this iteration of the BD core since BD already does SMT in the FP unit. More resources are added to the core and are now being allocated as per workload needs.

The *new* core, from the grounds up/post-excavator, is in the works and this one will most likely also contain some form of SMT. But since we know AMD is pushing for APU-like designs with intention to move iGPU closer to the x86 cores, I'd expect them to integrate it in the very FPU (taking into consideration trade-offs will be worth it).
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |