AMD on track for launch of Kaveri in February 2014

Page 9 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

PPB

Golden Member
Jul 5, 2013
1,118
168
106
That's why I said "So the big IPC boost at the cost of clocks might be discarded already."

You are trying to dismiss something that wasnt even placed as a fact in the first place. But hey, not that almost all people on this very subforum isn't doing already.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I believe the primary concern has been that electrical performance would regress as a result of the move from 32nm PDSOI to 28nm bulk, resulting in lower clocks.

Not necessarily directed at you, but something to note as it is commonly misunderstood:
I should also point out that the majority of the IPC increases will be seen when loading the second core in a module, given Steamroller's design. The additional L1D cache and doubled decode hardware is there to keep the execution engines fed when more than one thread is loaded. Bulldozer and Piledriver experience poor scaling when moving from 1T -> 2T within the same module. The main goal of Steamroller is to address this.

What you won't see, based on the information given to us about Steamroller, is a large increase in single threaded performance. You likely won't see 8 decoders being tossed on a single thread, for example, as it would be extremely inefficient. I have no doubts that single threaded performance will improve, but it is important to note that the major changes there will be coming with Excavator.

To reiterate, the largest gains in the Steamroller architecture will be with multi-threaded performance, not single-threaded. At least based off the information we've been given, it will still have relatively weak lightly-threaded performance, however it should achieve the good heavily-threaded performance that AMD intended to gain when they switched from Phenom to Bulldozer.

All those enhancements will raise IPC as well Multi-Thread,

http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture


 
Last edited:

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
All those enhancements will raise IPC as well Multi-Thread
No. Almost all of the information in the first slide applies only to the second thread. That's what they mean by "No Compromises Two Thread Performance."

They do have changes that will benefit single threaded performance as well, but they're not the focus of Steamroller.
 
Aug 11, 2008
10,451
642
126
http://www.linkedin.com/in/shardendushekhar


Discredite something without even reading it.... priceless :awe:

I was trying to be nice and admit I made a mistake. If you want to be snide about it, I will state that what is basically a personal resume is not a particularly reliable source of technical detail. What do you expect him to say, "I designed a chip that could not reach its desired clockspeed"?
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
That's why I said "So the big IPC boost at the cost of clocks might be discarded already."

You are trying to dismiss something that wasnt even placed as a fact in the first place. But hey, not that almost all people on this very subforum isn't doing already.

Sorry, you lost me. Not that that would be very difficult to do today.
 

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
No. Almost all of the information in the first slide applies only to the second thread. That's what they mean by "No Compromises Two Thread Performance."

They do have changes that will benefit single threaded performance as well, but they're not the focus of Steamroller.
You do realize that the title of that 2nd slide contradicts what you are saying?



"STEAMROLLER": improving single-core execution.

Everything in the slide is about improving how the core receives and handles data.

The "no compromise 2 thread performance" was obviosuly (in other slide) a reference for 8 ops/cycle decode capability which is 2x improvement over BD/PD.

SR should be a big improvement of BD and PD. Heck PD brought on average 7-10% more IPC and it was just souped Bulldozer. This thing is PD on steroids.
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
You do realize that the title of that 2nd slide contradicts what you are saying?
You do realize that the second slide in no way conflicts with what I am saying? The transistor count being dedicated to single-threaded performance is a pittance compared to the extra decode and instruction cache.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,106
136
You do realize that the title of that 2nd slide contradicts what you are saying?



"STEAMROLLER": improving single-core execution.

Everything in the slide is about improving how the core receives and handles data.

The "no compromise 2 thread performance" was obviosuly (in other slide) a reference for 8 ops/cycle decode capability which is 2x improvement over BD/PD.

SR should be a big improvement of BD and PD. Heck PD brought on average 7-10% more IPC and it was just souped Bulldozer. This thing is PD on steroids.

I do expect a significant bump in IPC from Steamroller, they'll likely need it to make up for a likely loss in clock speed. But, that slide is self contradictory - the contents do not address the title. That marketing for ya :awe:

The funny thing, in a way, is that SR is finally becoming the low end server core that BD should have been (if they were releasing it with 6-8 cores on an SHP node).
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
Since you guys seem to be so lost, let me show you what I'm talking about:



See the crappy scaling you get when you load the second core in a module? That's what Steamroller's primary aim is to fix. You'll get single-threaded improvements in addition to the scaling fixes -- nowhere have I stated otherwise.

Even if I were somewhat generous and estimated that SR pushes single-threaded IPC up 20%, it's not much compared to the 50+% improvement you'd see when loading the second thread in a module (Disclaimer: these are arbitrary numbers to demonstrate a point).

Bulldozer's CMT was supposed to be the ultimate compromise between area and performance, and as everyone can recall, it was a total flop at achieving that. Steamroller fixes that, however it is at a significant areal cost.

AMD was hoping that they could get by with feeding two integer cores and a floating point unit with a single 4-wide decode unit and 64KB of instruction cache. Clearly their hopes did not come true, and Steamroller brings things back to reality. They are still shaving costs off a bit by sharing some hardware, like the BPU, and the hope is that Steamroller offers the best compromise between low costs and performance.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
No. Almost all of the information in the first slide applies only to the second thread. That's what they mean by "No Compromises Two Thread Performance."

They do have changes that will benefit single threaded performance as well, but they're not the focus of Steamroller.

How does a larger I-cache and better branch prediction not help single threaded performance?
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
If you want to be snide about it, I will state that what is basically a personal resume is not a particularly reliable source of technical detail. What do you expect him to say, "I designed a chip that could not reach its desired clockspeed"?

Oddly enough you don't see the Phenom I crop up on many resumes :awe:
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
How does a larger I-cache and better branch prediction not help single threaded performance?
The instruction cache is in 48KB partitions nestled up to each decode. It's unlikely that you'd see 96KB feeding a single thread. It's possible, I'm sure, but I'm going to bet that it's powered off when only a single thread is being run. As a result, there might be some slight performance regression from going to 64KB -> 48KB, but it'll obviously be masked by all of the other changes going in to SR.

As far as branch prediction goes, that's why I said "almost all." Better branch prediction certainly would help for branches
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Since you guys seem to be so lost, let me show you what I'm talking about:



See the crappy scaling you get when you load the second core in a module? That's what Steamroller's primary aim is to fix. You'll get single-threaded improvements in addition to the scaling fixes -- nowhere have I stated otherwise.

Even if I were somewhat generous and estimated that SR pushes single-threaded IPC up 20%, it's not much compared to the 50+% improvement you'd see when loading the second thread in a module (Disclaimer: these are arbitrary numbers to demonstrate a point).

Bulldozer's CMT was supposed to be the ultimate compromise between area and performance, and as everyone can recall, it was a total flop at achieving that. Steamroller fixes that, however it is at a significant areal cost.

AMD was hoping that they could get by with feeding two integer cores and a floating point unit with a single 4-wide decode unit and 64KB of instruction cache. Clearly their hopes did not come true, and Steamroller brings things back to reality. They are still shaving costs off a bit by sharing some hardware, like the BPU, and the hope is that Steamroller offers the best compromise between low costs and performance.

Are you sure ??? Second Thread scaling shows to be close to 80% that AMD have said. I dont see why you say it was a flop.
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
So the flop thing was for Bulldozer in general and not the CMT scaling ??
Both. It doesn't matter if AMD achieved their scaling goal (that's not what I was even talking about in the part you highlighted, and remember that an 8150 would have fared worse). When you set the bar that low...

It's pretty obvious that AMD sacrificed far too much performance in the pursuit of cost savings. Their attempt to compete with Intel on performance/dollar blew up in their face. Regardless of whether or not they achieved their measly scaling goal, it severely tarnished their brand.

The end result was a chip that was decidedly less efficient than their previous architecture from an areal standpoint. AMD could have easily ported Thuban with updated ISA support to the 32nm process and would have come out well ahead of Bulldozer, from a perf/mm2 standpoint, and I guarantee they'd have come out significantly farther ahead on R&D costs as well. It took them an extra year to actually make some progress.

Piledriver definitely fixed a lot of the issues that Bulldozer had, but it's run into a wall. How do you fix single threaded performance if you're decode bound? You could drive up clock speed, which they've already done with Richland. You could increase the capability of the execution units, but then scaling suffers significantly when you load the second core. Obviously there are substantial inefficiencies still left in the pipeline that you could work on, but you're still going to run into that decode and I-cache wall.

The obvious answer is to expand the decode capability and increase the size of the instruction cache, which is exactly what Steamroller's doing. The work going into Steamroller predominately focused on the front end: branch prediction, pre-fetch, dispatch, 50% larger instruction cache and doubled decode. The last two changes alone probably count for around half or more of the die real estate being spent in Steamroller, and they're solely being implemented to improve second-thread scaling -- it won't help single threaded performance at all. This is why I've been stating that the focus of Steamroller is to improve second-thread scaling -- that "30% ops per cycle improvement" is all about the second thread.

That doesn't mean that Steamroller won't move single-threaded performance forward -- it absolutely will. There's so much garbage left to clean up from Bulldozer that it's virtually impossible not to improve things. The front end improvements alone will account for a sizable improvement in single-threaded performance, and there are some pretty significant changes going into the execution units as well. However, Excavator will bring the biggest changes there.

Even if I were optimistic and predicted that Steamroller gives a 30% improvement to single threaded performance, it'd still behind Intel by a good margin, and I'm also worried about the unfixed memory latency issue, which will only become a larger bottleneck with Steamroller. Where you will see Steamroller show its tenacity is with multithreading, but this is where we start getting into the land of unknowns. Loading up a second core will take up more power than it did with Piledriver, so in thermally constricted designs like mobile Kaveri, performance will be very much up to how AMD is able to keep power draw down. You'd all better hope that RCM can deliver.
 

NTMBK

Lifer
Nov 14, 2011
10,269
5,134
136
The end result was a chip that was decidedly less efficient than their previous architecture from an areal standpoint. AMD could have easily ported Thuban with updated ISA support to the 32nm process and would have come out well ahead of Bulldozer, from a perf/mm2 standpoint, and I guarantee they'd have come out significantly farther ahead on R&D costs as well. It took them an extra year to actually make some progress.

I'm so sick of hearing this. If you want a comparison of (improved) Stars cores and Piledriver cores, take a look at Llano vs Trinity. On the same process, Llano was 228mm2 and Trinity was 246mm2- and Trinity had a considerably higher proportion of its die devoted to iGPU.

Llano:



Trinity:



And in CPU benchmarks within the same power envelope, Trinity generally performed significantly better than Llano. There were only a handful of performance regressions, such as Cinebench multi-threaded, but the majority of cases saw healthy performance increases. A shrunk Thuban was not the answer to AMD's problems- if it was, they would have kept iterating on the Llano core instead of pushing out Trinity and Richland.
 

PPB

Golden Member
Jul 5, 2013
1,118
168
106
CMT is not what made Bulldozer a flop. The crappy IPC is what made Bulldozer a flop. CMT is a great idea, because it actually worked as intended. SR is moving a little away from CMT, but it is by no means they are dropping it.

SMT, on the other hand, is a lot more tricky with it's scaling and it might just hurt your performance in some applications. Gladly Intel has such an efficient uArch that can makes us forget quickly about their SMT being mediocre, at best.
 

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
Good post NTMBK :thumbsup:
Trinity(even more so Richland) is very solid x86 improvement over improved K10 core while having much faster iGPU and all that on the same process while having just ~8% larger die.

SR should be a big perf. and perf./mm^2 jump from Trinity since the die is basically the same (or even smaller?) while the GPU and CPU parts will be faster by a good amount.
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
I'm so sick of hearing this. If you want a comparison of (improved) Stars cores and Piledriver cores, take a look at Llano vs Trinity. On the same process, Llano was 228mm2 and Trinity was 246mm2- and Trinity had a considerably higher proportion of its die devoted to iGPU.
"It took them an extra year to actually make some progress"
And in CPU benchmarks within the same power envelope, Trinity generally performed significantly better than Llano. There were only a handful of performance regressions, such as Cinebench multi-threaded, but the majority of cases saw healthy performance increases. A shrunk Thuban was not the answer to AMD's problems- if it was, they would have kept iterating on the Llano core instead of pushing out Trinity and Richland.
I never said it was the answer, however it would have certainly been a better solution than Zambezi was. The fact that they didn't go with Bulldozer cores in Llano only reinforces my point.
CMT is not what made Bulldozer a flop. The crappy IPC is what made Bulldozer a flop. CMT is a great idea, because it actually worked as intended. SR is moving a little away from CMT, but it is by no means they are dropping it.

SMT, on the other hand, is a lot more tricky with it's scaling and it might just hurt your performance in some applications. Gladly Intel has such an efficient uArch that can makes us forget quickly about their SMT being mediocre, at best.
You clearly did not read my post.
Good post NTMBK :thumbsup:
You think a post that completely missed my point was a good one?
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
Nope I didn't miss anything. But I'm amazed how you think you are smarter than so many engineers that worked on Llano's successor. Maybe apply for a job at AMD?
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
Nope I didn't miss anything. But I'm amazed how you think you are smarter than so many engineers that worked onLlano's successor. Maybe apply for a job at AMD?
Good grief, you really are lost. Why is the sentence "it took them an extra year to actually make some progress" so difficult for the both of you to understand?
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
Good that we have a "shepard" like you to show us the "right" way... Nice personal attack there btw. Keep posting your dreams about what AMD should have done, I'm sure they are reading all this and taking very careful notes .
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |