EETimes: ST plans for Dresden FDSOI production

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

MisterMac

Senior member
Sep 16, 2011
777
0
0
Yeah, you have a point. Intel delivered Core2, Nehalem/Bloomfield and Sandy Bridge close to expectations (well, Core2 was probably above expectations). There is no point in being an optimist wrt AMD. I formally unsubscribe from this thread.


1. Your of course invited to my birthday.
Especially since you felt like writing me a personal message - and whining about me using your statements in a SIG.

The mature thing to do would be backpeddle and state - maybe you chose some dubious data grounds for your statement - and you'll be more careful.

Which i and anyone else would respect - you didn't choose that.

2. If Runner A increases his YoY time on a Lap - by his own expected goals.
That doesn't somehow equal Runner B to be able to.
Especially when Runner B actually got slower in some of his laps YoY.
And his been known to waddle - unlike Runner A which has continiously increased his YoY time on lap.


3. PD was great step - but it was also a step taken after taking 2 steps backwards. The Bar was not very high was it?

The problem is the fundamental design of AMD's CMT.
Not something they can magicly fix.

Not something anyone believes they'll magicly "fix".
(But kudos if they do).

You just read something on the internet and decided to spurt it out as accurate assesment - then your credibility should support.
Then you decided - hey let's ADD 66% more to that estimate and claim if AMD wanted to - they could!


Then we call you out on it - and you call me a child.
You'll fit well on the Internet.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
HMm good question if AMD had to pay or not. I guess STM and GF are discussing that right now.

Yes I am also a bit cautious, everything just sounds soooo good ... suspicious ^^
By the way what where the old promises that they failed to deliver? I wasnt interested in that topic back then I guess, dont know.

Promises like costs comparable to non-SOI processes, things that they never could deliver.

If you are really interested have a peek on the SOI consortium site and look for the articles around 2009-2010. I'm sure you'll find plenty of interesting things, many of them soooooooo good to be true. And yet here we are in 2013 and the only relevant customer the consortium got is... AMD.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
SR == fix, fixed instruction cache, fixed L1D/WCC , Fixed Decode, fixed branch miss penality ( loop buffer).

EX == not a clue
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Dedicated decoders, bigger caches, more dispatch bandwidth will surely due their job ;-)

Regardless on whether you are fond of AMD or not, everybody here can agree that the Bulldozer architecture is very inefficient from performance/area and performance/watt POV.

Those measures, dedicated decoders, bigger memory controller, bigger caches, are all measures that will demand *more* area and *more* thermal envelope, so AMD may be adding a bit for their inefficiency tax in order to get more raw performance from steamroller.

Take for example the dedicated decoder. People are expecting a lot more from here, but one thing is to issue an instruction to the pipeline, another thing is to keep the units feed all the time, and here is the catch. It would be too asinine, even for AMD standards, to not add a decoder if the pipelines could handle additional two instructions per clock. The fact that they chose not to add this decoder was because adding the ability to issue two more instructions per clock wouldn't add significant performance. More instructions in the pipeline raises demand on cache and on the units and raises the number of stalls, and given the longer Bulldozer pipeline, this is an issue with the uarch. Improvements to the OoO windows and better cache management might have changed the balance of the decoder trade off, but to expect 30-50% improvements, well, this is really far fetched. If AMD has this card on the deck, let them show to us, and not assume they will deliver because they say so.

People tend to assume that AMD will improve simply because they need to. This is a false assumption, as someone who thinks this assumes that AMD is immune to become irrelevant or even bankrupt. AMD isn't improving at the needed speed and AMD will become *more* irrelevant if they don't improve, and market *is* pricing liquidation at some point in the future.

Ed: One has to wonder what kind of market share and OEM relationship AMD will have in 2014 when Steamroller arrives.
 

MightyMalus

Senior member
Jan 3, 2013
292
0
0
SR == fix, fixed instruction cache, fixed L1D/WCC , Fixed Decode, fixed branch miss penality ( loop buffer).

EX == not a clue

If memory does not fail me, when Anandtech reviewed the SR architecture, AMD themselves commented on the fact that the latency in the L3 cache will not be fixed in SR.

EXV was commented by AMD on using the automation process, we have that. A probable fix to L3 and a claimed 30% performance increase over SR.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
If memory does not fail me, when Anandtech reviewed the SR architecture, AMD themselves commented on the fact that the latency in the L3 cache will not be fixed in SR.

EXV was commented by AMD on using the automation process, we have that. A probable fix to L3 and a claimed 30% performance increase over SR.

Well it'll certainly be fixed on Kaveri- given that Kaveri won't have an L3, if it's anything like the previous APUs.

Interesting how AMD seem to be shipping more and more L2 only designs, and their L2 caches are growing in size. Are they ditching L3 entirely? Could be a worthwhile gamble, if doing so gives them die area for better GPU.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,868
3,419
136
If memory does not fail me, when Anandtech reviewed the SR architecture, AMD themselves commented on the fact that the latency in the L3 cache will not be fixed in SR.

EXV was commented by AMD on using the automation process, we have that. A probable fix to L3 and a claimed 30% performance increase over SR.


L3 latency isn't an issue, its an eviction cache, L1D write is an issue, L1I is an issue.

mrmt i wrote a massive post about how i think your wrong, then hit submit and lost it.....GGGRRRR to sum up an hour of effort into one line, when your spinning your wheels you still burning just as much juice.
 
Last edited:

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
You just read something on the internet and decided to spurt it out as accurate assesment - then your credibility should support.
Then you decided - hey let's ADD 66% more to that estimate and claim if AMD wanted to - they could!


Then we call you out on it - and you call me a child.
You'll fit well on the Internet.

I apologize for belittling you, it was childish of me to engage in a 'tit for tat'. I guess my ego deserves what it got.


Aside from that, the numbers have better sources than I thought, as Atenra points out. If AMD winds up being accurate in their estimates, SR on bulk should be 30% faster than BD/PD (the slides are not clear about which is the base). If SMT winds up being accurate on the estimates for the transition to FD-SOI, then there is at least another 30% possible performance boost. So let's just say that 1.3*1.3 is a good ballpark numer. That's an ~70% boost, in an ideal world. But I admit that it isn't an ideal world (especially given what mrmt posted this morning on AMD's credit rating). AMD is dead to me for now; until they show reason why they should get my attention. They have no product that interests me, only my intellectual curiosity about computer architecture held my interest till now.
 
Last edited:

ThePeasant

Member
May 20, 2011
36
0
0
Regardless on whether you are fond of AMD or not, everybody here can agree that the Bulldozer architecture is very inefficient from performance/area and performance/watt POV.

Those measures, dedicated decoders, bigger memory controller, bigger caches, are all measures that will demand *more* area and *more* thermal envelope, so AMD may be adding a bit for their inefficiency tax in order to get more raw performance from steamroller.

Would the marginal increase in utilization of execution resources outweigh the marginal increase in power and area considerations for adding an additional decoder for each module? Is it more efficient, as you are suggesting, from a perf/power perspective to have execution resources available (clock gated?) but underutilized?

BD's relatively high power consumption vs IPC compared to SB seems to indicate BD consumes quite a bit more energy while doing nothing more often.

Anyway, duplicating the decoders should only help in multi-threaded scenarios, unless they have also improved the decoders. Also, if the loop buffer in SR is anything like the uop cache in SB, it should help mitigate mis-predict penalties and reduce power consumption.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Would the marginal increase in utilization of execution resources outweigh the marginal increase in power and area considerations for adding an additional decoder for each module? Is it more efficient, as you are suggesting, from a perf/power perspective to have execution resources available (clock gated?) but underutilized?

I'm assuming that AMD engineers looked at the design and decided that the current iteration of Bulldozer could not deal with a dedicated decoder for the reasons I listed. The fact you are issuing instruction does not mean you'll have the available resources to deal with them down on the pipeline, and a bigger decoder wouldn't yield Bulldozer better performance than with the split decoder. This is a possibility that I think it is the most probable. The other possibility is that AMD architects are a bunch of screw-ups.

While this idea seems counter-intuitive, that's *exactly* the decision made by AMD engineers and I doubt they would leave a low-hanging fruit like that if it really was a low-hanging fruit.

BTW, this is a design trade off, not a manufacturing problem. AMD could have an alien manufacture process node and would have face the same choice.
 

ThePeasant

Member
May 20, 2011
36
0
0
I'm assuming that AMD engineers looked at the design and decided that the current iteration of Bulldozer could not deal with a dedicated decoder for the reasons I listed. This is a possibility that I think it is the most probable. The other possibility is that AMD architects are a bunch of screw-ups.

Yet they've now decided to provide two decoders. Does that seem like a validation of their previous strategy or a reneging?

The fact you are issuing instruction does not mean you'll have the available resources to deal with them down on the pipeline, and a bigger decoder wouldn't yield Bulldozer better performance than with the split decoder.

Except module sharing has a demonstrably non-trivial effect on per core throughput (some 18% if we are to go by IDCs numbers). Dedicated execution resources that become less utilized when a module is shared indicate that there is some bottleneck in the shared resources feeding them. This is a natural consequence of sharing resources (Caches, RAM) but BD is more prone to the effects. I suppose they expected this to matter less because the overall throughput of a module would have been higher than it is now.

While this idea seems counter-intuitive, that's *exactly* the decision made by AMD engineers and I doubt they would leave a low-hanging fruit like that if it really was a low-hanging fruit.

BTW, this is a design trade off, not a manufacturing problem. AMD could have an alien manufacture process node and would have face the same choice.

BD is sub-par in many regards even when compared to the previous generation. Was that a design "trade-off" or was it the result of a relatively poor/misguided design? I think they did screw-up and now they're trying to salvage what they can.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Regardless on whether you are fond of AMD or not, everybody here can agree that the Bulldozer architecture is very inefficient from performance/area and performance/watt POV.

Not everyone here agrees with your evaluation. The Bulldozer Architecture (CMT) is very efficient at performance/area and performance/watt when compared against a similarly Intel CPU design (Core i7 3820) at the same 32nm.

Intel Sandy Bridge E (4C) (Core i7 3820) at 32nm = 1.27B Transistors with 294mm2 die size, 2+8MB(10MB) L2/L3 Cache and 130W TDP.

vs

FX8350 (4 Module) at 32nm = 1.2B Transistors with 315mm2 die size, 8+8MB L2/L3(16MB) Cache and 125W TDP.

At highly MultiThreaded applications(x264, CB11.5, Pov-Ray etc) they trade blows.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Yet they've now decided to provide two decoders. Does that seem like a validation of their previous strategy or a reneging?

I think reneging. They are empowering back the core, making it beefier.

BD is sub-par in many regards even when compared to the previous generation. Was that a design "trade-off" or was it the result of a relatively poor/misguided design? I think they did screw-up and now they're trying to salvage what they can.

IDC posted here a few days ago that CMT shouldn't impact single threaded performance, it is when sharing resources that IPC should take a hit. If this assumption is correct, then we cannot rule out poor/misguided design.

Bulldozer lower IPC would be consequence of the designs trade offs of the architecture, not a CMT restriction.

What I still think CMT impact negatively is that you need to keep units sitting idle if you have no threads to feed them. While Intel i3 chips needs only two threads to have most of their units working, Trinity would need at least 4. This is where I think as one of the main sources of Bulldozer inefficiencies.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I'm assuming that AMD engineers looked at the design and decided that the current iteration of Bulldozer could not deal with a dedicated decoder for the reasons I listed. The fact you are issuing instruction does not mean you'll have the available resources to deal with them down on the pipeline, and a bigger decoder wouldn't yield Bulldozer better performance than with the split decoder. This is a possibility that I think it is the most probable. The other possibility is that AMD architects are a bunch of screw-ups.

While this idea seems counter-intuitive, that's *exactly* the decision made by AMD engineers and I doubt they would leave a low-hanging fruit like that if it really was a low-hanging fruit.

BTW, this is a design trade off, not a manufacturing problem. AMD could have an alien manufacture process node and would have face the same choice.

1: SteamRoller will have a more optimized FrontEnd, that means it will feed the cores faster with more instructions than BD and PD per cycle. By having more instructions in the FrontEnd you creating a bottleneck in the decoding and execution units.

2: In order to elevate the bottleneck you simple double the decoders per module.

3: So the next step is to increase the Execution units capability to take advantage of the new FrontEnd and the second Decoder.

Since AMDs engineers and CPU Architects aren't Screw-ups, they did all three steps and more.
You wouldn't be good as CPU Architect, stick to what you know.

http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture



 

ThePeasant

Member
May 20, 2011
36
0
0
Not everyone here agrees with your evaluation. The Bulldozer Architecture (CMT) is very efficient at performance/area and performance/watt when compared against a similarly Intel CPU design (Core i7 3820) at the same 32nm.

Comparing TDPs is disingenuous, where are the real world power numbers? What about single threaded performance/watt or area? What about games and other less multi-threaded applications?

Also, why not compare it to an i7-3770k? Why does it matter, for comparison purposes or to a consumer, that the 8350 is more "similar" in size and transistor count to the i7-3820?
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Comparing TDPs is disingenuous, where are the real world power numbers? What about single threaded performance/watt or area? What about games and other less multi-threaded applications?

Don't even try. Time to market is a lost dimension to Atenra.
 

ThePeasant

Member
May 20, 2011
36
0
0
IDC posted here a few days ago that CMT shouldn't impact single threaded performance, it is when sharing resources that IPC should take a hit. If this assumption is correct, then we cannot rule out poor/misguided design.

I've already said that the additional decoder should only help in multi-threaded scenarios, unless they improved each decoder.

Bulldozer lower IPC would be consequence of the designs trade offs of the architecture, not a CMT restriction.

We mostly agree. CMT inherently reduces IPC because less resources are available to each thread at any time. However, I don't think the concept of CMT is fundamentally a bad one I just think the implementation in BD is poor.

What I still think CMT impact negatively is that you need to keep units sitting idle if you have no threads to feed them. While Intel i3 chips needs only two threads to have most of their units working, Trinity would need at least 4. This is where I think as one of the main sources of Bulldozer inefficiencies.

This is were we disagree. I believe BD was designed for muli-threaded workloads and some per thread performance was sacrificed for overall throughput and efficiency per CMT. So in that regard needing 2 threads per module to fully utilize the resources was the intention and the expectation. The reasoning (which I find reasonable) is that of diminishing returns and increasing complexity. Having an 8-wide core is more complex than having two 4-wide cores and an 8-wide core offers diminishing returns to a single thread compared to a 4-wide core. What they probably didn't count on was BDs relatively weak performance even when it is saturated with threads.

Even at the same width, a bulldozer module still needs a significant frequency advantage to outperform SB and that tells me that either those ports are not being fed adequately or that scheduling restrictions limit the effective width of the core. I'm inclined to believe it's a mixture of both and that with the spilt decoders they are attempting to address the former.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Comparing TDPs is disingenuous, where are the real world power numbers? What about single threaded performance/watt or area? What about games and other less multi-threaded applications?

Also, why not compare it to an i7-3770k? Why does it matter, for comparison purposes or to a consumer, that the 8350 is more "similar" in size and transistor count to the i7-3820?

1: I didnt compare the TDPs, i have only provided the specs of each CPU.

2: In most MT applications like the x264, PovRay, 7zip, AES and more, the FX8350 is faster than the Core i7 3820. The FX8350 uses more power but it finish the job quicker, that makes them almost equal in power usage. The FX will be faster using a little more power while the Core i7 3820 will be slower consuming less power.

3: We compare architectures not process nodes. Therefore we have to compare both at 32nm. At the same process node, at the same CPU characteristics (Core i7 3820) the FX8350 is equal, faster at some, loosing in others.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Don't even try. Time to market is a lost dimension to Atenra.

You dindnt compare products, you made a statement that BD as an architecture is not efficient in performance/size and performance/watt. Time to market has no relevance here.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
We mostly agree. CMT inherently reduces IPC because less resources are available to each thread at any time. However, I don't think the concept of CMT is fundamentally a bad one I just think the implementation in BD is poor.

CMT doesn't reduces the IPC, by your logic HT reduces the IPC in Intel CPUs. CMT or HT increase the PERFORMANCE/throughput of the CPU.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
You dindnt compare products, you made a statement that BD as an architecture is not efficient in performance/size and performance/watt. Time to market has no relevance here.

How do you take a niche product like the 3820 that exists only because of the huge cache (faster than AMD's for that matter) and then reaches the brilliant conclusion that Bulldozer is a very efficient architecture? And this after you artfully cherry picking the scenario, if not why else would you mention multi-threaded workload?

This is raw intellectual dishonesty at work. Go find someone else to troll, not me or my posts.

Ed: And TTM has EVERYTHING to do with this. You have to compare IVB, not SNB with Bulldozer/PD chips. BD main competitor isn't a niche product like the 3820.
 
Last edited:

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
I've already said that the additional decoder should only help in multi-threaded scenarios, unless they improved each decoder.

The point of the bigger decoder is to be able to issue more instructions per clock for a single core, how this isn't aimed at improve the performance of a single core?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |