[AT] AMD Kaveri APU Launch Details: Desktop, January 14th

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
The interesting thing about this is, mature processes have something to gain. As we know, an immature process has worse characteristics and leads to more salvage parts.

I'm not quite sure where AMD/Glofo stands on this with 28nm. The 28nm process is a year old at least, but HP is newish. One point worth noting is that so far AMD has only announced 3 Kaveri chips - 2 A10's and 1 A8. Those should all be quads, so that appears to be good news on the manufacturability issue.

The lower-than-anticipated clock speeds on graphics could simply be a case of it being the sweet spot in terms of perf/W and the bandwidth wall. Some of the earlier graphics were too highly clocked for too little gain, I believe. It could also just be the best way to increase yields, either way it seems to be the sensible option for them.

But yeah, the lower end salvage parts could be in trouble from the ARM guys below, and their own Atom/Kabini's. This is probably going to hurt Intel badly at 14nm as well with the issues they are currently having almost certainly leading to a situation where the quality of dies is weighted to the lower end. One thing that might save AMD's dual cores is that the graphics should be at least playable at lower settings. Any dual core with 256 or 384 SP's should be approaching old Llano A8 3870K levels in gaming terms, which isn't bad for such cheap ($30-$50) APU's.
 
Last edited:

Kallogan

Senior member
Aug 2, 2010
340
5
76
That would be so awesome if FM2+ supports both DDR3 and DDR4 modules. That way we could go DDR3 Kaveri from the start and then upgrade to DDR4 memory when the prices become decent and get a nice boost without changing rig.

But maybe i'm dreamin' too much.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
That would be so awesome if FM2+ supports both DDR3 and DDR4 modules. That way we could go DDR3 Kaveri from the start and then upgrade to DDR4 memory when the prices become decent and get a nice boost without changing rig.

But maybe i'm dreamin' too much.

Yep, no DDR4 until FM3.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
I don't think DDR4 will be affordable until end of 2014 beginning of 2015, so the lack of it on the FM2+ platform is not a big deal. imo
 

ph2000

Member
May 23, 2012
77
0
61
i dont think there would DDR4 support until Intel come out with DDR4 support
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
That would only mean Intels diespace is much more efficient used.

As i have said, 32nm Sandybridge Core is almost the same die size as 32nm PileDriver Module. And both have almost the same performance, i dont see how you come to the conclusion that Intel's diespace is more efficiently used.


And again show the weakness of AMDs CMT, due to the scaling issue.

What scaling issue ??? the AMD CMT gives you 80% more when Intel SMT only gives you up to 30%.

AMDs CMT only works in very high scaling multithreaded cases without demanding main threads.

SMT also works when you are multithreaded limited, you dont get SMT when you only have a single thread.

In short, you get all the good with Intels SMT without being penalized, unlike AMDs CMT solution that depends on very high scaling to yield the same efficiency.

In short you are talking BS trying to make the SMT the best thing in the universe. SMT only scales when you have multiple threads, just like CMT. There is no penalizing in performance in both cases, one will give you up to 30% and the other up to 80% EXTRA performance.

It also explains why AMD is slowly moving away from CMT one step at a time.

AMD doesnt moving away from CMT, they are continuously upgrading the CMT design. Both SteamRoller and Excavator are CMT architectures.
 

nehalem256

Lifer
Apr 13, 2012
15,669
8
0
As i have said, 32nm Sandybridge Core is almost the same die size as 32nm PileDriver Module. And both have almost the same performance, i dont see how you come to the conclusion that Intel's diespace is more efficiently used.

What scaling issue ??? the AMD CMT gives you 80% more when Intel SMT only gives you up to 30%.

Sandybridge gives you higher single thread performance and similar multithread performance. Overall this seems like a win for SB over PileDriver.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Sandybridge gives you higher single thread performance and similar multithread performance. Overall this seems like a win for SB over PileDriver.

The similar multithread performance is mainly due to its stronger individual core performance, SMT provides a smaller boost compared to that. In other words a Sandybridge core can already run 1+X threads, where X is a significant fraction of 1, as fast as Piledriver core can run 1 thread.
 
Last edited:

NTMBK

Lifer
Nov 14, 2011
10,270
5,136
136
SMT vs CMT is not the topic of this thread. If you want to argue any more about SMT vs CMT, go start a new topic, or I'm calling in the mods. (And yes, I was guilty of this too.)
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Is that Java book search demo code available to the public yet? I'm assuming it will be part of the documentation section of the OpenCL aware Java SDK. We could run it and compare scores to Kaveri.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I would say that AMD CMT does not work at all.

FUD much ??? Do you even know how CMT works ???

It's misleading to compare core per core, because all the support infrastructure that the core needs to work just isn't there.

Do you even know how a MicroProcessor works ??? what infrastructure is not there ??

32nm Sandy Bridge without the CPU part is 30% smaller than the CPU part of a 32nm APU.

Are you on pills ??? what the hell are you talking about ???


In servers we can see how CMT designs scale badly, because once AMD tried to go over 4 modules they could achieve only paltry clocks, while Intel could get much higher clocks of their big die parts.

You dont even know that 16 cores Opterons are two 8 core (4 module) dies in the same package. And you speak about CMT scaling badly ??

While here on the forums people are claiming for 8 core Steamroller, Intel has 12 core/24 threads parts that would mop the floor with AMD chips in whatever multithreaded task you can think of, and even the 8C parts would be enough to hold the line against whatever AMD throws at them. AMD can only look for the poor scaling of their designs to blame for their server debacle.

Just like the $300+ Iris Pro against a $150 Ritchland, next time you will compare a single Module to 24 cores 48 threads $4000+ Server CPU

And speaking about sharing... SMT is about sharing *all* resources of the core, while CMT is about sharing just a few of the resources. Intel (and IBM, and Sun, and everyone with SMT) can go huge on core resources and IPC because the resources will be used by more than one thread at a given time, while AMD cannot, because if they go huge on core resources it might end up with only added leakage while the core sits still for lack of threads. This is the reason for the anemic core (which they tried to compensate with high clocks), and this is why you cannot expect much IPC from AMD parts(I'll really save that 90% post for the posterity). CMT ends up delivering a much more inflexible processor than CMT, one that only shines when there are a lot of light threads on the fly, and sucks at everything else.

I will strongly advise you to read about SMT and CMT, how they work, what they aim at etc and then talk about them. It will be good to educate your self before you speak about technical stuff that you dont have a clue.
I also understand you have this endless urge to FUD and negative PR about everything AMD but when it comes to technical things you should read first and talk later.
 

Khato

Golden Member
Jul 15, 2001
1,225
281
136
As i have said, 32nm Sandybridge Core is almost the same die size as 32nm PileDriver Module.

I'd be curious as to how you arrived at the Sandybridge core being comparable in die size to Piledriver? A quick comparison of the available die shots of each has the Piledriver core over 10% larger at 19.8 mm^2 vs 17.7 mm^2. Note that that comparison is, if anything, still slightly favorable to AMD as it's not including any of the 'blank' silicon space they have in their floor plans that's being used for signal routing. (Intel takes a slight hit to the core die size metric by routing all signals in the same areas as logic because it results in a smaller die size over all.)
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
And for people saying that CMT doesnt work or that it scales badly

FX8150 vs Core i7 2600







 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
I'd be curious as to how you arrived at the Sandybridge core being comparable in die size to Piledriver? A quick comparison of the available die shots of each has the Piledriver core over 10% larger at 19.8 mm^2 vs 17.7 mm^2. Note that that comparison is, if anything, still slightly favorable to AMD as it's not including any of the 'blank' silicon space they have in their floor plans that's being used for signal routing. (Intel takes a slight hit to the core die size metric by routing all signals in the same areas as logic because it results in a smaller die size over all.)

I said almost same size, 10% is very close.
 

Khato

Golden Member
Jul 15, 2001
1,225
281
136
I said almost same size, 10% is very close.

No, it's really not. With core sizes in this range 1% would be very close. 10% is a touch over 2 mm^2... even on 32nm I'd expect that one could almost fit a quad core A7 in that space. Or half a Jaguar core.

Oh, and regarding those core scaling charts... They do an excellent job demonstrating the magic of AMD's module approach! Who would have ever expected that running Cinebench with 4 threads on 4 modules would be 4.5526 times faster than running 1 thread on 1 module? (See the problem there? If not, compare against the Intel scores which never show greater than perfect scaling...)
 

Lepton87

Platinum Member
Jul 28, 2009
2,544
9
81
Oh, and regarding those core scaling charts... They do an excellent job demonstrating the magic of AMD's module approach! Who would have ever expected that running Cinebench with 4 threads on 4 modules would be 4.5526 times faster than running 1 thread on 1 module? (See the problem there? If not, compare against the Intel scores which never show greater than perfect scaling...)

In that test Phenom II also has better then perfect scaling, also Intel i7 dual core has a smidgen over perfect scaling as well. i7 dual core has 200,75% PII dual has 204,26% so it shows strange scaling on all CPUs.
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
The thing about CMT is that it's not surprising nor is it even unusual that it has good "scaling". Not that much of the CPU "core" is shared across the two modules. AMD basically doubled up most of the resources so if it DIDN'T scale at least 80%, it'd be worse than useless. This is reflected in the really large die (which ultimately doesn't give better performance per area compared to Intel).


SMT (Intel's name is HT) is something else entirely. It's not meant to scale to 80% because it's not about throwing hardware at a problem. Instead, you add very little hardware to the CPU and then you recapture performance that is otherwise _wasted_.

That's why the scaling is only 30% if you're lucky! But you end up with a more efficient CPU because you're using closer to 100% of the it with HT. This is also why HT is used with Atom.

SMT isn't just an Intel thing either. If you look at other high performance CPU's outside of x86, you'll find SMT being used to even greater degrees than Intel (POWER8 (since POWER5), UltraSPARC T2 [Niagara], etc.). The research for the concept started in the 70's too.
 

Lepton87

Platinum Member
Jul 28, 2009
2,544
9
81
So it's worthless? Got it.

I don't know POV ray also shows better then perfect scaling on all CPUs so all of those results seem questionable to me.

That's why the scaling is only 30% if you're lucky! But you end up with a more efficient CPU because you're using closer to 100% of the it with HT. This is also why HT is used with Atom.

It's only used in the old atom, and the reason for it is in-order execution that leaves a lot of computing resources idle with just one thread. The new out of order Atom does not have it.
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,769
4,233
136
If you would to have L3-less Orochi or Vishera die you would cut the die size significantly while performance may suffer 5% or so in common desktop workloads. It was there mostly for server workloads .
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
It's only used in the old atom, and the reason for it is in-order execution that leaves a lot of computing resources idle with just one thread. The new out of order Atom does not have it.

I know the new Atom doesn't have it but it ties to what I was talking about. It's for more efficient usage of the CPU. If the expect workload and the CPU design coincides such that there's few execution resources idle then the trade-off to implement HT isn't worth it.

In the case of Silvermont, Intel was able to put in more cores (die shrink) so they used the space for HT to do OOE instead.

In the future, there may be enough power budget to do both and extract greater utilization. Maybe not if the usage efficiency is already high enough that HT doesn't give much. It's a low power system so you're not going to be able to simply throw more hardware at it.
 

Khato

Golden Member
Jul 15, 2001
1,225
281
136
In that test Phenom II also has better then perfect scaling, also Intel i7 dual core has a smidgen over perfect scaling as well. i7 dual core has 200,75% PII dual has 204,26% so it shows strange scaling on all CPUs.

While true, the Intel results are easily within the margin of error while those for the AMD CPUs are well beyond that. The only reason I can think of for such a discrepancy is that the baseline for the AMD CPUs is lower than it should be, and it should be noted that the results do support this hypothesis reasonably well. (If you divide the obtained scaling by the number of cores for those where AMD is showing greater than 100% scaling you get roughly 103 for the Phenom II on Cinebench, 113.8 and 109.2 for the FX on Cinebench, 106.5 for the Phenom II on POV-ray, and 108.8 and 105.5 for the FX on Cinebench.)
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Oh, and regarding those core scaling charts... They do an excellent job demonstrating the magic of AMD's module approach! Who would have ever expected that running Cinebench with 4 threads on 4 modules would be 4.5526 times faster than running 1 thread on 1 module? (See the problem there? If not, compare against the Intel scores which never show greater than perfect scaling...)

So it's worthless? Got it.

The fact that in single thread the Bulldozer core DOESNT use the entire capability or 100% of its resources say anything to you ???
I quess you thought that a single Thread will use 100% of single core resources. well that never happens, also having a large SHARED L2 cache within the two Cores of the Module help's with data from previous threads that are ready to be executed and you gain cycles because you dont have to fetch from main memory or even L3.
That helps elevating the Multithreading Scaling more than 100% between one and two or more Cores.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
If you would to have L3-less Orochi or Vishera die you would cut the die size significantly while performance may suffer 5% or so in common desktop workloads. It was there mostly for server workloads .
It is the opposite, the L3 in Orochi Rev B and Orochi Rev C causes a five percent decrease in performance for common personal use workloads.

Only in servers does the L3 give any performance as it can be shared across nodes.

64 Cores -> 2 MB L2 per core -> 56 MB L3 for all cores across nodes (8 MB L2 used if HT Assist)

For desktop usage, it would be better to use an unified L2 with fast ports between modules. 35-40 cycle latency across modules versus the 55-60 cycle latency from the L3.
 
Status
Not open for further replies.
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |