EETimes: ST plans for Dresden FDSOI production

itsmydamnation · Feb 2, 2013

again no it isn't

Code:

bench      bulldozer piledriver difference note
 
Windows 8 - x264 HD 5.0.1 - 1st Pass   52.3 62.2 18.93% 
Windows 8 - x264 HD 5.0.1 - 2nd Pass   12.7 15.1 18.90% 
Windows 8 - POV-Ray 3.7RC6 - Single Threaded  225.6 252.1 11.75% single thread
Windows 8 - POV-Ray 3.7RC6 - Multi Threaded  1259.2 1504.4 19.47% 
Windows 8 - Visual Studio 2012 - Firefox Compile 36.8 31.3 14.95% 
Windows 8 - Mozilla Kraken Javascript Benchmark  6905 5812 15.83% 
Windows 8 - Skyrim - 1680 x 1050   186.9 209.4 12.04% 
Windows 8 - Diablo 3 - 1680 x 1050   197 215.8 9.54% 
SYSMark 2012 - Overall     147 176 19.73% 
SYSMark 2012 - Office Productivity   125 147 17.60% 
SYSMark 2012 - Media Creation    134 158 17.91% 
SYSMark 2012 - Web Development    109 154 41.28% 
SYSMark 2012 - Data/Financial Analysis   212 241 13.68% 
SYSMark 2012 - 3D Modeling    200 239 19.50% 
SYSMark 2012 - System Management   129 146 13.18% 
Adobe Photoshop CS4 - Retouch Artists Speed Test 14.8 13.3 10.14% 
DivX 6.8.5 Encode (Xmpeg 5.0.3)    35.1 31.4 10.54% 
x264 HD Encode Test - 1st pass - x264 0.59.819  81.7 90.4 10.65% 
x264 HD Encode Test - 2nd pass - x264 0.59.819  37.8 44 16.40% 
Windows Media Encoder 9 x64    27 25 7.41% 
Cinebench R10 - Single Threaded Benchmark  3938 4319 9.67% single thread
Cinebench R10 - Multi-Threaded Benchmark  20254 23437 15.72% 
POV-Ray 3.7 beta 23 - SMP Benchmark   4512 5008 10.99% 
Par2 - Multi-Threaded par2cmdline 0.4   17.6 16.1 8.52% 
Blender 2.48a Character Render    50 44.6 10.80% 
Microsoft Excel 2007 SP1 - Monte Carlo Simulation 14.2 12.6 11.27% 
WinRAR 3.8 Compression - 300MB Archive   82.3 74.9 8.99% single thread
Cinebench 11.5 - Single Threaded   1.02 1.1 7.84% single thread
Cinebench 11.5 - Multi-Threaded 5.99   6.89 15.03% 
x264 HD Benchmark - 1st pass - v3.03   75.5 89.6 18.68% 
x264 HD Benchmark - 2nd pass - v3.03   35.8 41.9 17.04%
7-zip Benchmark      21041 23407 11.24%
Dragon Age Origins - 1680 x 1050 - Max Settings  118.4 139.2 17.57%
Dawn of War II - 1680 x 1050 - Ultra Settings  51.5 70.5 36.89%
World of Warcraft     77.7 91.5 17.76%
Starcraft 2      47.8 47.9 0.21%
 total average       15.36%
 average no single thread     16.11%
 average single thread      9.56%

the code here doesn't like tabs so i cant format it nicely.

mrmt · Feb 3, 2013

itsmydamnation said:
i think what people here are forgetting is that the 30% increase instruction throughput is going to remove a lot of the CMT penality, so in single thread you will see some benifit in multithread you will see a lot more. CMT penality is generally around 30%. The other core changes are what will drive single thread performance.

If it's just a matter of bring able to issue more instructions, why didn't AMD go for that since Bulldozer? They had plenty of time to do that since canning 45nm Bulldozer in 2009, they could have added a bigger decoder in 2011 already.

itsmydamnation · Feb 3, 2013

mrmt said:
If it's just a matter of bring able to issue more instructions, why didn't AMD go for that since Bulldozer? They had plenty of time to do that since canning 45nm Bulldozer in 2009, they could have added a bigger decoder in 2011 already.

why have instruction cache aliasing issues, why have such poor L1D write. the entire front end is shared; Instruction cache prediction/fetch and decode. scheduling , int execution, address gen, L/S and L1D isn't. So how about you tell me where the bottleneck lies that limits 2 thread performance but not one.

also they went from 3 wide decode in K10 per core to 4 wide per module. maybe the inital plan was to have both threads being able to decode in the same clock, but there does look to be a bottleneck there.

SocketF · Feb 3, 2013

AtenRa said:
32nm to 20nm will have a substantial increase in performance and lower power, it is after all a full node shrink unlike the half node that 28nm is over the 32nm. Also, price is not a number one problem because they will manufacture the big cores with it, that means server and high end desktop chips.

Yes it is full node, but the improvements of a bulk full node shrink is getting worse. I quoted it already earlier:

IC makers that moved from 40nm to 28nm have experienced a 35% average increase in speed and a 40% power reduction, said Jack Sun, vice president of R&D and chief technology officer at TSMC. In comparison, IC vendors that will move from 28nm to 20nm planar are expected to see a 15% increase in speed and 20% less power, Sun said.

In the special case of AMD&GF it would be even worse, bc. of the gate last <> first penalty. All in all I still would say that 20nm bulk is a node to skip. The lead time is also worse because the double pattering costs extra time. 14nm is around the corner, and I wasn't referring to GF's 14XM process, I meant 14nm FD-SOI ;-)

mrmt said:
Regardless on whether you are fond of AMD or not, everybody here can agree that the Bulldozer architecture is very inefficient from performance/area and performance/watt POV.

Depends on the POV ... have you see the performance per watt with FMA4 code? It is quite good, and probably the reason why there are several HPC-systems with Bulldozer CPUs. These guys code their own code, thus will use FMA. But for the normal end-user it is like you said.

mrmt said:
Those measures, dedicated decoders, bigger memory controller, bigger caches, are all measures that will demand *more* area and *more* thermal envelope, so AMD may be adding a bit for their inefficiency tax in order to get more raw performance from steamroller.

I beg to differ, there is one example against it, namely Intel's µcode cache since the SandyB-generation. If there is a hit in the µcode Cache the whole front end is shut down, thus saving power. However, the µcode's dispatch bandwith is higher, thus the performance even increases. All in all the performance/watt is better, the only disadvantage is that you need more area. Now if AMD would do sth similar ... They said that SR will get a loop-cache, I wonder if it will be more like the small old Nehalem cache or sth like SandyB's solution.

People tend to assume that AMD will improve simply because they need to. This is a false assumption,

I think they will improve, because it believe that it cant get worse *G*

Ed: One has to wonder what kind of market share and OEM relationship AMD will have in 2014 when Steamroller arrives.

Good question, it seems they want to have more high-volume OEMs, but I assume none of the current big players would risk their relationship with intel. I'll expect some previously unknown Chinese/Indian/Brazilian OEMs.

mrmt said:
Promises like costs comparable to non-SOI processes, things that they never could deliver.

If you are really interested have a peek on the SOI consortium site and look for the articles around 2009-2010. I'm sure you'll find plenty of interesting things, many of them soooooooo good to be true. And yet here we are in 2013 and the only relevant customer the consortium got is... AMD.

Decoder:

mrmt said:
I'm assuming that AMD engineers looked at the design and decided that the current iteration of Bulldozer could not deal with a dedicated decoder for the reasons I listed. The fact you are issuing instruction does not mean you'll have the available resources to deal with them down on the pipeline, and a bigger decoder wouldn't yield Bulldozer better performance than with the split decoder. This is a possibility that I think it is the most probable. The other possibility is that AMD architects are a bunch of screw-ups.

While this idea seems counter-intuitive, that's *exactly* the decision made by AMD engineers and I doubt they would leave a low-hanging fruit like that if it really was a low-hanging fruit.

I think the background with the decoder lies in the origin of its design, which is Glen Henry. He designed the basic BD-design at his stint at AMD in ~2002. His design needed the shared decoder, because he thought about the speculative execution of one INT cluster's thread at the other cluster. However he didn't finish that before he left AMD. So basically the shared decoders were useless for AMD, but nobody at AMD decided to change it and/or it probably didnt matter too much.

mrmt said:
Bulldozer lower IPC would be consequence of the designs trade offs of the architecture, not a CMT restriction.

Well there is a connection, you should use smaller, less complex cores for CMT, because there is generally not much code that has a IPC greater than 2. Thus in theory 2 INT-pipelines are enough. That this is more or less true could be also seen from Intel's SMT speedup. It is around 33% max for a 2nd thread on the same core, i.e. one of the 3 pipelines is not used. Problem is only that BD's 2 pipelines should be fed like a goose before Christmas, but 16kB L1 WT doesn't sound like that ... also there are no loop-caches like in the intel case.

erunion said:
Interestingly, before Vishera launched people built their claims that Vishera would improve IPC by 10-15% on this slide. Then they added things like clock bumps from RCM to declare Vishera would offer 20-25% more performance. Basically what we've seen in this thread.(Vishera did end up hitting the up to 10% total performance mark, but no better)

Vishera doesn't have the RCM, and Trinity was a A1 revision with lots of bugs. Seems it were so many bugs in the A1 revision, that limited its performance that AMD was now confident enough give the Rev. B a new name ("Richland") ;-)

The 30% number that is floating around for Steamroller doesn't seem to have anything to do with 30% IPC. Rather its relates to "Ops per Cycle" for the fetch/decode units.
(I'm not an EE so feel free to clue me on what exactly that means, anyone)

Well the IPC is about instructions, i.e. before the decoder, the Ops are the smaller entities after decoding. Lots of modern Instructions however are decoded into exactly 1 Op. With Micro/Makro-Op Fusion there could be also more than 1 instruction in one Op, but that is not the general case, also AMD's decoders have to generate 2 Ops sometimes e.g. for 256bit AVX instructions. I would expect a 1:1 translation in the very best cases, thus one can hope for a +30% higher IPC in the best cases BUT only when using 2 threads. This is a bit odd, because the IPC is normally used for single-thread tasks, only. I assume that was also a reason for AMD not to use IPC.

mrmt said:
The other part of your improvements come from FD-SOI. First AMD never stated they would go FD-SOI, they said they would go to standard 28nm technology and as of now this technology is bulk. There is no 28nm FD-SOI HVM, STM and GLF are still negotiating the deal to allow the process. When the two reache an agreement, AMD will have to port a big and acknowledged complex design to this process and both the design and the node will have to have to perform as advertised and have good yields.

What I see here is two *very* long shots. Too long to be deemed reasonable expectations, but much more a wish/hope that AMD succeeds.

Oh yes don't forget the topic ^^
In general I agree to your POV, however I got a bit more optimistic the last time, because AMD's CEO spoke about using cheaper process with less layers previously and in his last speech he was also referring to a faster Time To Market. The first two features could be accomplished with a simpler design@bulk, and he could mean Jaguar with it. But a faster time to market is the sole feature of FD-SOI as far as I know. Finfets are time-consuming, 20nm is, too, only the FD-SOI slides claim a faster TTM. But I am not an expert, thus I could be wrong. Any other ideas how to reduce the TTM?

mrmt said:
If it's just a matter of bring able to issue more instructions, why didn't AMD go for that since Bulldozer? They had plenty of time to do that since canning 45nm Bulldozer in 2009, they could have added a bigger decoder in 2011 already.

I guess they were busy enough with the resigned from SSE5 to AVX&FMA4. Let's not forgot that it was for the first time that AMD supported an Intel Instruction extension *before* Intel itself.

ShintaiDK · Feb 3, 2013

Ajay said:
The slides AtenRa posted where marked up after the above perf/watt graph from AMD. And it does say 30% outright (not per watt):

I bolded 'simulations/simulated' because what they are really saying is that this is an estimate of an estimate, also known as a blue sky guess in most industries

That 30% number is still not IPC.

Again, you and the rest ran with something you either dont understand or intentionally manipulate to be something its not.

NTMBK · Feb 3, 2013

ShintaiDK said:
That 30% number is still not IPC.

Again, you and the rest ran with something you either dont understand or intentionally manipulate to be something its not.

I was under the impression that Ops per cycle was equivalent to IPC? If they were talking on a lower scale, they would be referring to micro-ops per cycle.

Third_Eye · Feb 3, 2013

Ajay said:
GloFo could change their "Standard Process Technology" if it will help them win addition business (look at what ST-Ericsson was able to with their 28nm FD-SOI Quad A9 Arm SoC - it showed massive gains!). If 28nm FD-SOI become the standard process tech @ Dresden, then AMD will not have to pay R&D. Right now STM & GF have an MoU under which they may come to a deal to make STM's FD-SOI a standard process @ Dresden. This should be good for GF and AMD. We'll just have to wait and see. Sounds like the timeline is to get a deal done by the end of February - so, hopefully, we won't have to wait long.

Global Foundries wants to be a pure foundry attracting as much clientele as possible. Do you know that GF has not made any profit ever since it came into being?

Right now GF has AMD as its only client and its "paying customer". And this client uses 32nm PD-SOI process. If it wants to grow beyond its client base, it should offer a standard process too.

Soon IBM will become a 32nm PD-SOI customer. We know that IBM will first fab its server Power 7+ MPUs at that node first and likely to shrink its Wii U SOC(which currently in 2013 is still manufactured in 45nm SOI) to 32nm SOI.

Also 28nm bulk HKMG being the "official" 28nm from IBM Hardware Consortium, it means that all consortium members will work with each other to streamline the process, commercialize and improve it. So who-ever tries to implement SOI at 28nm has to do that on its own.

So STMicro which has never been a manufacturing powerhouse, comes up with its FD-SOI which sounds great and even has chips to highlight the improvements. But STM is now bleeding and unable to invest so as to commercialize its 28nm FD-SOI. So it tries to rope in the current SOI commercialization leader Global Foundries for the process. But GF might not get the ROI if ST-Ericsson alone is the client to manufacture chips in a "special" process.

Always remember, in practice for any technology, you got to worry about clients, costs and yields. How much trouble did GF have with its 32nm PD-SOI ramp-up where in AMD was unable to get Llano in enough quantities even though there was a heavy demand for the product. AMD only paid in 2011 for "GOOD 32nm DIES". Only in mid 2012 did the yields improve and now AMD is able to even bring RICHLAND the last hurrah of 32nm SOI.

And if you take the whole manufacturing history of GF including the time it was AMD's manufacturing arm we it was doing PD-SOI from 130nm->90nm->65nm->45nm->32nm. Inspite of close to a decade dabbling in PD-SOI, GF took more than a year since the first 32nm PD-SOI chip was released to tame the process. Imagine 28nm FD-SOI being totally new to GF?

AMD cannot afford to wait till 28nm FD-SOI is tamed.. Period! And GF cannot (once again)put its eggs in SOI basket when most of chip manufacturers are designing for Bulk.

Ajay · Feb 3, 2013

ShintaiDK said:
That 30% number is still not IPC.

Again, you and the rest ran with something you either dont understand or intentionally manipulate to be something its not.

I didn't notice that that the slide was talking about post decoder Ops. So it was neither, I just wasn't paying close enough attention - my bad. It is clear that that slide made no mention of performance/watt - which is very curious. It just makes that 30% number even more ambiguous wrt any actual performance increase on 28nm bulk.

Based on what I've posted most recently, I'm surprised that you'd claim I might be intentionally manipulating this information purposefully. I was totally over optimistic earlier in the thread, but based on information I've read about AMD/GF in the past couple of days, I've done a complete 180 on AMD. Plus it just not my MO here on AT.

Ajay · Feb 3, 2013

Third_Eye said:
AMD cannot afford to wait till 28nm FD-SOI is tamed.. Period! And GF cannot (once again)put its eggs in SOI basket when most of chip manufacturers are designing for Bulk.

Fair enough. Thanks for the info.

ShintaiDK · Feb 3, 2013

NTMBK said:
I was under the impression that Ops per cycle was equivalent to IPC? If they were talking on a lower scale, they would be referring to micro-ops per cycle.

Certainly not. A micro ops is simply a decoded x86 instruction.

ShintaiDK · Feb 3, 2013

Ajay said:
I didn't notice that that the slide was talking about post decoder Ops. So it was neither, I just wasn't paying close enough attention - my bad. It is clear that that slide made no mention of performance/watt - which is very curious. It just makes that 30% number even more ambiguous wrt any actual performance increase on 28nm bulk.

Based on what I've posted most recently, I'm surprised that you'd claim I might be intentionally manipulating this information purposefully. I was totally over optimistic earlier in the thread, but based on information I've read about AMD/GF in the past couple of days, I've done a complete 180 on AMD. Plus it just not my MO here on AT.

I said that you either didnt understand it, or manipualted it. So if you feel hit on manipulated, then its time to look inwards.

Ajay · Feb 3, 2013

ShintaiDK said:
I said that you either didnt understand it, or manipualted it. So if you feel hit on manipulated, then its time to look inwards.

WTH? I said that is not my MO here a AT? Or, if you missed that "So it was neither, I just wasn't paying close enough attention - my bad." Geez, last couple of days peeps have been acting like judge, jury and executioner, is it a full moon?

Fwiw, I think you are right about the ultimate trajectory of AMD, although I think odds are pretty good that they'll go through Chapter 11 before becoming a niche player like Via.

SocketF · Feb 3, 2013

This story is so funny ... another chapter of the "too good to be true" story, this time from a 3rd party, so I think is is rather credible.

http://www.goldstandardsimulations.com/index.php/news/blog_search/fd-soi/

These people here state that the voltage on 28nm GateLAST would be not only around 0.6-0.7 V like it is with STE's ARM SoC, produced in 28nm FD-SOI GateFirst, but below 0.5V ....

Really unbelievable...

Haserath · Feb 3, 2013

.5V is near threshold voltage... Switching speed is supposed to severely slow at that point.

Edit: I guess these may differ from Intel's transistors too much to compare.

Greenlepricon · Feb 3, 2013

SocketF said:
This story is so funny ... another chapter of the "too good to be true" story, this time from a 3rd party, so I think is is rather credible.

http://www.goldstandardsimulations.com/index.php/news/blog_search/fd-soi/

These people here state that the voltage on 28nm GateLAST would be not only around 0.6-0.7 V like it is with STE's ARM SoC, produced in 28nm FD-SOI GateFirst, but below 0.5V ....

Really unbelievable...

I could actually see this happening for a semi-sleep mode on the pc during heavy idle periods. Maybe it's not practical for doing much at all (maybe hardly or not even web browsing), but it would be cool to be able to look at something on your pc and pulling less than half the voltage of when your pc is working.

Haserath · Feb 3, 2013

ShintaiDK said:
That 30% number is still not IPC.

Again, you and the rest ran with something you either dont understand or intentionally manipulate to be something its not.

Can you describe to us what it is then?

ShintaiDK · Feb 3, 2013

Haserath said:
Can you describe to us what it is then?

I think the slide itself should be pretty clear, that this only applies to the fetch and decode. In other terms, macro(x86) to micro(risc) ops. The entire backend with execution units is unaccounted for.

Ajay · Feb 3, 2013

Haserath said:
Can you describe to us what it is then?

Reduced miss-prediction should help, but I can't recall what PD's branch prediction efficiency was, but given that modern CPUs are typically > 95%, that doesn't sound like a very big improvement.

The increased dispatch width means that 5 Macro-Ops can be dispatched to each integer core for each cycle (Max), but unless the the number of execution ports goes up from 4 to 5 ports, that Max will never be reached, even with highly optimized code.

So we've only been give a small part of the story. It sounds like big news, because the single 4 port decoder in BD/PD with a Max of 4 Mops/cycle per module severely crippled the two integer cores per module which has 8 execution ports between them (2 AGUs and 2 ALUs per core).

So it's looking like good news for some improvement in performance, but without knowing anything about:
1) The execution cores. and...
2) The memory subsystem (specifically, the L1$ & L2$ performance & whether or not the SWCC performance has been improved).

We are simply left guessing or, more to the point, we are left with nothing.

If you want to understand this better, start here: RealWorldTech - Bulldozer

Haserath · Feb 3, 2013

The front end bottlenecks the module, thus 30% can be seen with all resources in use.

This won't help as much for single thread, but it will be great for the module loads and will benefit everything a little bit.

I'm hoping that they at least pull 30% with the module.

But I've been hoping for a long, long time.

Haserath · Feb 3, 2013

Ajay said:
Reduced miss-prediction should help, but I can't recall what PD's branch prediction efficiency was, but given that modern CPUs are typically > 95%, that doesn't sound like a very big improvement.

The increased dispatch width means that 5 Macro-Ops can be dispatched to each integer core for each cycle (Max), but unless the the number of execution ports goes up from 4 to 5 ports, that Max will never be reached, even with highly optimized code.
If you want to understand this better, start here: RealWorldTech - Bulldozer

I thought the +25% max width dispatches was the increase of 4 op dispatches over the previous design.

And modern designs have close to 98% branch prediction most of the time, but any reduction in mispredictions can save a lot of energy and performance.

I've already looked through that at well but ty anyway.

Idontcare · Feb 3, 2013

SocketF said:
This story is so funny ... another chapter of the "too good to be true" story, this time from a 3rd party, so I think is is rather credible.

http://www.goldstandardsimulations.com/index.php/news/blog_search/fd-soi/

These people here state that the voltage on 28nm GateLAST would be not only around 0.6-0.7 V like it is with STE's ARM SoC, produced in 28nm FD-SOI GateFirst, but below 0.5V ....

Really unbelievable...

What does all this mean for the power dissipation and, particularly, the supply voltage of SRAM. Starting with bulk at 28 nm you need approximately 0.9V to secure the reliable operation of large SRAM arrays. With metal-gate-first FD-SOI you will be able to reduce this voltage to below 0.7V. However the technologist who that could develop and deliver metal-gate-last FD-SOI at 28nm will be able to offer you supply voltage below 0.5V.

I can believe it. In R&D though, the question isn't "can it be done?" because the answer is almost always "yes, physics says it can be done, but engineering says it can only be done for a price".

Rather in R&D the question is "can it be done within the need-timeline, within the allocated development budget, and will it meet the desired production metrics for yield, cycle-time, production cost, etc?"

I believe the gate-last simulated results. But is it a pathway to sub 0.7V operation that is superior in both development and production costs when compared to other alternative integration schemes? That is the real question.

Idontcare · Feb 3, 2013

Ajay said:
WTH? I said that is not my MO here a AT? Or, if you missed that "So it was neither, I just wasn't paying close enough attention - my bad." Geez, last couple of days peeps have been acting like judge, jury and executioner, is it a full moon?

Fwiw, I think you are right about the ultimate trajectory of AMD, although I think odds are pretty good that they'll go through Chapter 11 before becoming a niche player like Via.

Folks, this (the pedantic fanaticism afoot in going after Ajay's comments) is just not what I want to cultivate here in my threads. I can't control the rest of the forum, but in my threads I like to think the evolving conversation is one of exploration through dialogue, openly thinking aloud, a brain-storming session.

You know what kills a brain-storming session? The dude who just wants to jump on top of everyone elses ideas, shouting them down and telling them they are stupid ideas.

That is counter-productive, it creates an environment that is no longer conducive to people feeling comfortable sharing their ideas, it creates an environment where people will intentionally refrain from posting their thoughts simply out of a desire to not have to deal with one more flame-back post with a poster who seems to have an axe to grind over petty pedantic sticking points that are needless levels of detail in the first place.

Now I have no issue with the folks who want to get serious, the folks who aren't content with the flow of conversation being akin to pub talk and writings on the back of napkins. Break out those text books, take to the chalk board, and really hammer home your thoughts in bulletproof rigor.

But don't hold everyone else to your standard, that just isn't cool, its a buzz-kill. Ajay isn't here to piss on anyone's wheaties, I assure you, and there really isn't any reason to go after his posts just to piss on his wheaties.

It is clear to me that Ajay is speaking in wide generalizations, using terms and concepts in a fast-and-loose approach. There is no reason to scrutinize his posts as if this were a PhD dissertation defense. There is a time and place for that, but I would hope folks kinda recognize that is not my demeanor and as such I like to see my threads be a little less rigorous (and by extension, less contentious).

Joshing and friendly-ribbing of one another is expected, it is what colleagues do. But we cross a line when we start getting mean about it, rubbing noses in it by capturing passing comments out of context for sig lines and so forth. That is not conducive to friendly discussions, that is turning people against each other.

Lets please reconsider why we are here, as individuals, and what it is that we seek to leave in our wake when we logoff these forums every day.

Abwx · Feb 3, 2013

ShintaiDK said:
Certainly not. A micro ops is simply a decoded x86 instruction.

And ?...

If there s 30% more uops that means that 30% more macro ops
are decoded as well.

Abwx · Feb 3, 2013

NTMBK said:
I was under the impression that Ops per cycle was equivalent to IPC? If they were talking on a lower scale, they would be referring to micro-ops per cycle.

This slide does not tell the whole story as they purposely stated
the 30% higher IPC at the whole module level knowing that we
couldnt extract the IPC at single core level since we dont know
how much they reduced the CMT penalty.

According to Hardware.fr the CMT penalty is about 20% when
all four modules are used but only 15% with two modules since
it doesnt stress as much the CPU L3 cache bandwith , this
with a FX8150 Zambezi , wich is likely the reference used
by AMD for this slide.

If they manage to reduce to 0 the CMT penalty a mere 5-10% better
IPC at the core level would be enough to yield the 30% figure at the
module level , hardly a possibility given that even separate cores
do not have 0 CMT penalty.

If CMT penalty is reduced to 10% then single core IPC should
be increased by roughly 15% to get the same 30% at the module
level and this is the most probable scenario.

Ajay · Feb 3, 2013

Abwx said:
And ?...

If there s 30% more uops that means that 30% more macro ops
are decoded as well.

A macro-op (Mop) can consist of one of two (fused) micro-ops. Most well optimized compilers will produce x86 code that decodes into one Mop. So for code optimized for Core2 or later Intel x86 (SB improved uop fusion, don't know about BD exactly), a 30% increase in Mops should roughly equal 30% more x86 instructions being decoded. But as already indicated, we know nothing about the execution resources or the local memory hierarchy.

So what the actual performance increase? No way to tell at the moment. Since Kaveri uses SR modules, we will hopefully see something from Intel @ Hot Chips, that when Trinity's architecture was revealed.

uop fusion for several cpu's is covered here: http://www.agner.org/optimize/microarchitecture.pdf

EETimes: ST plans for Dresden FDSOI production

Platinum Member

Diamond Member

Platinum Member

Senior member

Lifer

Lifer

Member

Lifer

Lifer

Lifer

Lifer

Lifer

Senior member

Senior member

Senior member

Senior member

Lifer

Lifer

Senior member

Senior member

Elite Member

Elite Member

Lifer

Lifer

Lifer