AMD Q4/2013 Desktop Roadmap

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
The "NB" is integrated into the APU die, and has been for years. Any extra memory bandwidth would need to be attached through the FM2+ socket... which doesn't have the pins for it.

D'oh! I thought AMD still used a north bridge. Thanks for making that clear.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Since when have AMD been more interested in desktops? Could you provide a link to this hint Rory dropped? Or are you basing this off Kaveri launching on destops first? I find it very hard to believe the desktop side of the business is showing more promise than mobile

AMD Q313 Q&A said:
http://seekingalpha.com/article/175...anscript?all=true&find=advanced+micro+devices

Romit Shah - Nomura
And the competing results are my biggest concern looking at the quarter, and Rory, you talked [ph] to the weakness to just general softness in consumer notebooks. But if I look at your numbers, the computing business was down 6% sequentially, so desktops were up, which means notebooks are probably down more than 10% and I compare that to Intel whose PC business was up, I think mid single digits and are you seeing Gartner were also up. So how do we reconcile the difference between AMD’s consumer business, notebook business and Intel and just the general market?

Rory Read - President and CEO
I think you kind of summed it up properly. The consumer market is feeling more pressure, but all parts of the PC markets are down. This market, this industry is down 10% and at rates it’s never experienced before. It’s going to continue.

From our perspective, AMD is over index the client notebook. We have always been. We have had, just like we have to diversify our portfolio across high-growth segment, we need to diversify this core business. We need to move stronger into desktop and as we talked about a year ago, we worked on the inventory on the desktop segment and we built and repaired that and we have seen two quarters of consistent revenue growth in that segment and we believe that we have the right product stack to continue to make progress and that’s part of our business as well.


2M/4C makes more sense anyway with Steamroller removing the CMT penalty. The only real use for octo core FX's is being able to load 1 execution unit per module to avoid said penalty.

To actually remove the CMT penalty AMD would have to get rid of CMT itself, no? If you are talking about mitigating the CMT penalty then that would mean beefier cores and less sharing, making it much more like a conventional architecture.

The L3 is useless for consumer workloads (from AMDs own mouth) so whats the point in producing those enormous dies when they'll only sell for $200.

I think it's correct to say that AMD L3 slow cache is useless for consumer workloads.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
To actually remove the CMT penalty AMD would have to get rid of CMT itself, no? If you are talking about mitigating the CMT penalty then that would mean beefier cores and less sharing, making it much more like a conventional architecture.

Right. SR just removes the excess penalty being imposed on the CMT Module. The CMT penalty is still there, just now it will be as it should have been; module loading order shouldn't matter either with SR.

I think it's correct to say that AMD L3 slow cache is useless for consumer workloads.

Only because AMD wants to move stream processing apps to the iGPU, IMHO. Other than that, there are not allot of apps in general use that are FP heavy, but those that are do suffer from no L3$. Hopefully, AMD's L2$ didn't lose much performance going to 4 MB (though that hard to believe given cache performance wasn't one of AMD's strengths).

PS Thanks for the link.
 

NaroonGTX

Member
Nov 6, 2013
106
0
76
The only thing "shared" by a module in CMT is L2 cache, the decode unit (no longer as of SR and beyond), and the fetch unit. And SR also introduces dynamic resizing for the L2, I would assume to help with lowering power usage and such. So there's not really any big "penalties" left in CMT, the biggest bottleneck -- which was the shared decode unit, which was too narrow -- has been removed with SR. MT performance will no longer be gimped so much, hence AMD listing Steamroller as having "greater parallelism" in the various slides.
 

NaroonGTX

Member
Nov 6, 2013
106
0
76
FPU is shared in a sense. The FPU is actually 2x 128-bit FMACs which can work independently of each other, or combine as 1x 256-bit FMAC for AVX instructions. In comparison, K10 had 1x128bit FPU per logical core.
 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
K10 has two 128bit FP pipelines but not as flexible as BD/PD FP units since they can execute all kind of operations while K10 has a unit for FP ADD and another for FP MUL hence it takes two K10 cores to barely match a single BD FP unit.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
FPU and SMIDs are also shared.

Great Chart!

FPU is shared in a sense. The FPU is actually 2x 128-bit FMACs which can work independently of each other, or combine as 1x 256-bit FMAC for AVX instructions.

Depend on the Decode and Dispatch capabilities of SR. I don't know if SR can decode and dispatch up to 5+ uops simultaneously (int cores have 4 ports). Also, do you know if the FPU can start 1 FP + 1 MMX op simultaneously?
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,706
1,232
136
Depend on the Decode and Dispatch capabilities of SR. I don't know if SR can decode and dispatch up to 5+ uops simultaneously (int cores have 4 ports). Also, do you know if the FPU can start 1 FP + 1 MMX op simultaneously?
Bulldozer = 4+ macro-ops to one different core every other cycle
Steamroller = 8+ macro-ops to both cores every other cycle.

x86 Bulldozer cores have 4 ports, 2 EX and 2 AGLUs.
x86 Steamroller cores have 8 ports, 4 EX and 4 AGLUs.

Depending on the instructions used you can have all four pipes in Bulldozer/Piledriver FPU be used. While in Steamroller, only depreciated and specific unit instructions will prevent maximizing of the FPU.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,706
1,232
136
What we know about PUMA+ cores right now?
IOMMU 2.0(Windows only HSA)
Switchable Graphics V7 from V5.5


Massively improved power and thermal control. Which lead to a 2x increase in perf watt.
CPU & GPU now have boosts.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Only because AMD wants to move stream processing apps to the iGPU, IMHO. Other than that, there are not allot of apps in general use that are FP heavy, but those that are do suffer from no L3$. Hopefully, AMD's L2$ didn't lose much performance going to 4 MB (though that hard to believe given cache performance wasn't one of AMD's strengths).

I'm rather curious to see whether AMD improved cache hit rates or if they went for yet another brute force solution, just throwing more transistors at the problem.
 

sniffin

Member
Jun 29, 2013
141
22
81
Only because AMD wants to move stream processing apps to the iGPU, IMHO. Other than that, there are not allot of apps in general use that are FP heavy, but those that are do suffer from no L3$. Hopefully, AMD's L2$ didn't lose much performance going to 4 MB (though that hard to believe given cache performance wasn't one of AMD's strengths).

I haven't really seen any hard evidence of their being any real kind of penalty. I don't remember Trinity vs Zambezi comparisons showing any regression. AMD did mention that L3 is useful for server, but weren't specific about what kind of server workloads.

Also, Trinity/Richland already have 4mb L2 (2mb/module) so they aren't losing anything there

I'm rather curious to see whether AMD improved cache hit rates or if they went for yet another brute force solution, just throwing more transistors at the problem.

Aside from dynamic resizing of the L2 I don't think cache has been touched

From our perspective, AMD is over index the client notebook.
That is such a weird expression. I always took over index to mean that you've got more than enough of something. If that's what he means then he's saying they have their bases more than covered and can focus elsewhere, which is obviously wrong as they aren't exactly killing mobile right now. AMD need Kaveri in the mobile space pretty badly actually. Speak English Rory
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,864
4,546
136
x86 Steamroller cores have 8 ports, 4 EX and 4 AGLUs.
From where did you get this? This is not in agreement with data we have so far (given by AMD publicly) nor we have any updates on GCC that support such pipeline organization.

What would be correct to say is that x86 Steamroller modules have 8 ports, 4 EX and 4 AGLUs.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,923
3,549
136
FPU is shared in a sense. The FPU is actually 2x 128-bit FMACs which can work independently of each other, or combine as 1x 256-bit FMAC for AVX instructions. In comparison, K10 had 1x128bit FPU per logical core.

this is completely incorrect.

K10 had 3x 128bit FPU/SIMD units per core. it had one ADD one MUL and one MISC.
bulldozer has 2x 128 bit FMA FPU's and 2x 128 SIMD/MMX per module.

So bulldozer has less peak resource per core then K10 in FPU but more flexibility. Bulldozer is very strong in mixed FP/SIMD workloads where it can get good execution across all 4 units.

Also bulldozer doesn't do 256bit ops over both FMA units, that would create scheduling hell. it does 256bit ops as two instructions over the same 128bit unit. remember the only different between 128 bit and 256 bit here is that there are 4 vs 8 32 bit floats.
 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
this is completely incorrect.

K10 had 3x 128bit FPU/SIMD units per core. it had one ADD one MUL and one MISC.
bulldozer has 2x 128 bit FMA FPU's and 2x 128 SIMD/MMX per module.

So bulldozer has less peak resource per core then K10 in FPU but more flexibility. Bulldozer is very strong in mixed FP/SIMD workloads where it can get good execution across all 4 units.

BD units can each do both FMUL and FADD while K10 units
are capable of only a kind of operation per unit , not counting
FMA wich double the throughput on some instances.

Also bulldozer doesn't do 256bit ops over both FMA units, that would create scheduling hell. it does 256bit ops as two instructions over the same 128bit unit. remember the only different between 128 bit and 256 bit here is that there are 4 vs 8 32 bit floats.

A 256bit op is executed as a single instruction using
the two 128bit units wich can be used as a single
256bit exe unit, besides FMA has nothing to do with AVX.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
That is such a weird expression. I always took over index to mean that you've got more than enough of something. If that's what he means then he's saying they have their bases more than covered and can focus elsewhere, which is obviously wrong as they aren't exactly killing mobile right now. AMD need Kaveri in the mobile space pretty badly actually. Speak English Rory

That's Rory style, he never gives a straight answer. To be fair with him there isn't much he can do about the mobile market with their current product portfolio, is it?

Intel is rolling out Haswell, and to counter that AMD only has the same Richland line up that was already losing market share to Ivy Bridge, and Jaguar, which wasn't making strides on the mobile market before BT was launched but now face strong competition from it. In the future they can show Steamroller and Puma, but Intel will counter them with Airmont and Broadwell on a fancy 14nm node with 30% less power consumption than the already good 22nm, while AMD will be stuck in 28nm.

So yes, they are in a pretty bad spot right now and that will only get worse in the future, which means that will lose market share. And instead of saying this loud and clear, Rory prefers to get away with "AMD is over index the client notebook" and point desktop as a growth opportunity.

What he didn't say is that while the PC market is going down, the desktop market is losing share to the mobile market, so even if AMD can post a healthy growth in desktops for the next few quarters, they may end up selling less chips than before.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,923
3,549
136
BD units can each do both FMUL and FADD while K10 units
are capable of only a kind of operation per unit , not counting
FMA wich double the throughput on some instances.
which is why i said FMA and more flexible :awe: to be able to execute FMA but not MUL or ADD is going back a long way in CPU history.

A 256bit op is executed as a single instruction using
the two 128bit units wich can be used as a single
256bit exe unit, besides FMA has nothing to do with AVX.
one X86 op. two internal OP's (MOP's). they are broken into two MOP's and executed as 2 128bit ops. just like SSE was done on A64 (2x64 for 128). Also if you go back and watch the original Bulldozer hotchips presentation the presenter specificity says that the FPU is doublepumped for 256bit ops.

also i never said FMA had anything to do with AVX ( not that i said anything about AVX) i was using it as a descriptor for the execution unit itself.......

i dont know why you quoted me because nothing you said has anything to do with what i said and where you tried to tell me i was wrong you were flat out wrong............

but if you dont believe me i will quote the guru
http://www.agner.org/optimize/blog/read.php?i=187
Supports AVX instructions. Intel announced the AVX instruction set extension in 2008 and the AMD designers have had very little time to change their plans for the Bulldozer to support the new 256-bit vectors defined by AVX. The Bulldozer splits each 256-bit vector into two 128-bit vectors, as expected, but the throughput is still good because most floating point execution units are doubled so that the two parts can be processed simultaneously.
 

Abwx

Lifer
Apr 2, 2011
11,543
4,327
136
i dont know why you quoted me because nothing you said has anything to do with what i said and where you tried to tell me i was wrong you were flat out wrong............

but if you dont believe me i will quote the guru
http://www.agner.org/optimize/blog/read.php?i=187

Among other i disagreed about this :

Also bulldozer doesn't do 256bit ops over both FMA units, that would create scheduling hell. it does 256bit ops as two instructions over the same 128bit unit. remember the only different between 128 bit and 256 bit here is that there are 4 vs 8 32 bit floats.

A 256bit op is executed as a single instruction using
the two 128bit units wich can be used as a single
256bit exe unit, besides FMA has nothing to do with AVX.

As per Agner Fog s very quote :

Supports AVX instructions. Intel announced the AVX instruction set extension in 2008 and the AMD designers have had very little time to change their plans for the Bulldozer to support the new 256-bit vectors defined by AVX. The Bulldozer splits each 256-bit vector into two 128-bit vectors, as expected, but the throughput is still good because most floating point execution units are doubled so that the two parts can be processed simultaneously.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,109
136
I haven't really seen any hard evidence of their being any real kind of penalty. I don't remember Trinity vs Zambezi comparisons showing any regression. AMD did mention that L3 is useful for server, but weren't specific about what kind of server workloads.

BD/PD partitioned the L3$ into a 2MB/module cache, the same size as L2$ - so the advantage of having a large L3$ was somewhat lessened. This makes their design choice even more confusing since L3$ was slow as was the IMC. I can't recall the exclusivity/inclusivity of the L3$, but it wouldn't have been very effective if it was being used as a victim cache at that size ratio (1:1).

Also, Trinity/Richland already have 4mb L2 (2mb/module) so they aren't losing anything thereAside from dynamic resizing of the L2 I don't think cache has been touched

Well, the cache design, at the very least, had to be ported to 28nm. If AMD didn't have or take the time to improve it - then that is unfortunate. I must have read something recently about a 1 module Richland part - hence my error on cache size. Thanks for the correction.
 

sniffin

Member
Jun 29, 2013
141
22
81
The changes they've made are probably all they could do in the amount of time they were given. The cache seems like a huge mess, probably too big of a job. If Excavator is the last big core they plan to design they probably won't bother attempting to fix it at all.

As for L3, it would be interesting to have on an APU if it was also addressable by the GPU
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,706
1,232
136
This makes their design choice even more confusing since L3$ was slow as was the IMC.
It was slower than the IMC bandwidth wise while latency was faster than the IMC.
What would be correct to say is that x86 Steamroller cores have 8 ports, 4 EX and 4 AGLUs.
fixed.
If Excavator is the last big core they plan to design they probably won't bother attempting to fix it at all.
There will be two more 15h architectures after Excavator.
As for L3, it would be interesting to have on an APU if it was also addressable by the GPU
The two L2 caches from the CPU and GPU are coherent by a 256-bit interconnect.
 
Last edited:

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
BD/PD partitioned the L3$ into a 2MB/module cache, the same size as L2$ - so the advantage of having a large L3$ was somewhat lessened. This makes their design choice even more confusing since L3$ was slow as was the IMC. I can't recall the exclusivity/inclusivity of the L3$, but it wouldn't have been very effective if it was being used as a victim cache at that size ratio (1:1).

Zambezi's and Vishera's L3 is shared between all cores, and is a "mostly exclusive" cache, meaning it basically is a big victim cache for the L2s. What do you mean that a 1:1 size ratio is bad for a victim cache? Using a 1:1 sized L3 as anything but a victim cache would be absurd.

On an unrelated note, the place where large caches help the most for server workloads is in caching instructions, not data. Generally not a lot of data locality in server workloads.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |