[Techpowerup] AMD "Zen" CPU Prototypes Tested, "Meet all Expectations"

Page 19 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

cytg111

Lifer
Mar 17, 2008
23,560
13,120
136
Generally when you see IPC talked about here, it's measured and compared to other designs by measuring performance in a variety of different applications rather than something based on the number of pipelines in a core or similar. In that case, those external factors are already included in "IPC increase".

When AMD is putting up a slide saying 40% IPC increase I am going to assume thats running on code fitting solely in L1 cache(trace cache?)/best case scenario. - I reserve possibility to be pleasently surprised of course, but....
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
-32nm PDSOI - 12 Track (High Performance Lib)- vs

14nm LPP - 10.5 Track (High Performance Lib)
-20% frequency
+50% lower power
+85% lower leakage+
+55% area shrink

22nm FDSOI - 8 Track + ABB (Fast High Density Lib // FBB Focus)
+30% frequency(est.)
+45% lower power(est.)
+50% lower leakage(est.)
+65% area shrink(est. the shrink is bigger for Mixed-signal)

22nm FDSOI - 8 Track + ABB (Fast High Density Lib // RBB Focus)
+10% frequency(est.)
+65% lower power(est.)
+70% lower leakage(est.)
+65% area shrink(est. the shrink is bigger for Mixed-signal)


It should be noted that the numbers for all these are based on a reference design at a fixed voltage, under 0.8V, IIRC. They will all respond differently to extra voltage, adding 0.2V may do nothing for 22nm FDSOI, and change 14nm nicely. We lack the scaling details to know what to expect from a higher voltage part.

And, of course, design itself is critical.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
How not? IPC is supposed to refer to some average measurement of real world application's work/MHz.

In its purest form IPC is a general average of the rate which all instructions are executed.

If we use this pure definition of IPC, Piledriver is massively faster than Bulldozer. Our best way of estimating this is via MEASURED instruction latencies.

http://looncraz.net/research/cpu/ipc/amd_lat/

In the overly simplified description lower average latencies will result in nearly identical improvements in performance. In the real world, it greatly depends on exactly which instructions a given application was seeing as its performance bottleneck.

From Bulldozer to Piledriver AMD focused on a general improvement that enabled pretty much every instruction to be retired 10% sooner. Not surprisingly, we saw a 10% improvement in performance from Bulldozer to Piledriver.

Steamroller, however, shows almost no improvement in average instruction latencies, but it known to average about 6.7% faster than Piledriver. Its improvement came from increased ILP, as can be witnessed by the throughput latency reduction. Excavator, meanwhile, looks to be a better improvement, with mostly targeted instructions being improved to go along with some general improvement. On the latency graphs the difference looks much less than the Bulldozer to Piledriver improvement, but the benchmark changes are typically as large, or even larger. In addition, it is the first change that helped out Cinebench scores in any meaningful way (whereas Intel has focused on Cinebench for some obvious, and some not so obvious, reasons).
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
It should be noted that the numbers for all these are based on a reference design at a fixed voltage, under 0.8V, IIRC.
I believe the Vref for the comparisons are 0.95V for 22FDX and 0.75V for 14LPP. 32nm PDSOI is higher than 1V, but lower than 1.2V for Vref.

FDSOI has the highest frequency for the lowest given leakage. Especially, given 95%+ efficiency AVFS FIVRs can be placed on FDSOI at low die area cost.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I believe the Vref for the comparisons are 0.95V for 22FDX and 0.75V for 14LPP. 32nm PDSOI is higher than 1V, but lower than 1.2V for Vref.

FDSOI has the highest frequency for the lowest given leakage. Especially, given 95%+ efficiency AVFS FIVRs can be placed on FDSOI at low die area cost.

That makes it an even worse comparison. What was the common thread with these figures? (A source would be helpful).
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
That makes it an even worse comparison. What was the common thread with these figures? (A source would be helpful).
The common architecture is a Cortex A17. Note that the 22FDX is 8 track, not 12 track. Also made an oops, the 32nm PDSOI is 0.85V for Vref.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
The common architecture is a Cortex A17. Note that the 22FDX is 8 track, not 12 track. Also made an oops, the 32nm PDSOI is 0.85V for Vref.

Ah, okay, Cortex A17 makes sense.

The standard cell library difference would have a fairly meaningful impact, would it not?

Fewer tracks generally means higher density, lower cost, and lower performance, correct? And probably less power handling capabilities if I am understanding what "track" means in this instance (explanations are welcome!).
 

DrMrLordX

Lifer
Apr 27, 2000
21,812
11,165
136
So what's the exact reasoning you think we wont see 40% more perf per clock for int but we will for FP?

It was my impression that Zen would feature a greater leap forward in floating point execution resources than int. Did I misinterpret the latest leaks?

The actual claim is a single Zen core{1C|2T} has 40% higher IPC over a single XV core{1C|1T}.

That's . . . more specific than anything I recall seeing from AMD. Regardless, nobody actually runs single-threaded code on an XV module except for odd cases such as SuperPi or when deliberately running a multi-threaded benchmark in single-threaded mode for . . . whatever reason.

If Zen is only going to be 40% faster than an XV module running in "one core per module/compute unit" mode, then color me massively unimpressed.

Case in point: My Kaveri (SR) @ 4.5 GHz, 2100 mhz NB, DDR3-2400 CL10 can put up a Cinebench R10 score of 16709 (+/- some amount, it can go a bit higher or lower depending on stuff). If I go into the UEFI and switch the CPU into "one core per compute unit" mode, which is the UEFI's way of disabling half of the cores so that the OS will assign one thread per module, the score takes a nosedive and is around 9100 points (9186 I think). Enabling full threading on the CPU increases performance by 81%, despite the fact that an SR module only has a pair of 128-bit FMACs. Pretty interesting stuff.

Anyway, The Stilt's testing showed XV to be 5% faster than SR in Cinebench R10 at the same clockspeed with the same number of modules. So, based on the assumption that XV would take the same hit running in "one core per compute unit" mode, XV @ 4.5 GHz (lulz) would put up a total score of ~9640-9650 (let's say 9645) in running as 2m/2t, which would amount to a per GHz per thread score of 1072. If Zen is 40% faster than that AFTER taking SMT into account (?!?), then overall Zen would have 750.4 CB per GHz per thread (1072 * 1.4 / 2). That would be simply awful. A 4c/8t Zen would wind up being slower - much slower - than 4m/8t XV. Yuck.
 
Mar 10, 2006
11,715
2,012
126
When AMD is putting up a slide saying 40% IPC increase I am going to assume thats running on code fitting solely in L1 cache(trace cache?)/best case scenario. - I reserve possibility to be pleasently surprised of course, but....

I don't think AMD will fall short of its IPC target. The question is what clocks will they hit with Zen, and that is something that I think they are going to soon learn.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Even with IPC met, high frew and excellent perf/w, AMD have a huge K2 like mountain to climb to compete with Intel. - its not a crappy P4 soution idiotic for server loads they are competing with. Its a damn fine overall product solution, support and organizational setup.

And they have to do the climbing without O and support. If zen is excellent its just the ticket to basecamp.
 
Mar 10, 2006
11,715
2,012
126
Even with IPC met, high frew and excellent perf/w, AMD have a huge K2 like mountain to climb to compete with Intel. - its not a crappy P4 soution idiotic for server loads they are competing with. Its a damn fine overall product solution, support and organizational setup.

And they have to do the climbing without O and support. If zen is excellent its just the ticket to basecamp.

Good post, agree fully.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,867
3,418
136
It was my impression that Zen would feature a greater leap forward in floating point execution resources than int. Did I misinterpret the latest leaks?
you can pretty much consider both doubled( varies from instruction to instruction, 0% more fp div for example) but integer was significantly more bottlenecked to begin with.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I don't think AMD will fall short of its IPC target. The question is what clocks will they hit with Zen, and that is something that I think they are going to soon learn.
I would add: What clocks at what power consumption will they achieve? Low power -> simply add core clusters (with shared L3 slice).
 
Feb 4, 2009
34,703
15,951
136
I really want this to be great and I really want a successful AMD to add another choice however I personally am sick and tired of the excuses for AMD's overall poor performance.
 

carop

Member
Jul 9, 2012
91
7
71
Ah, okay, Cortex A17 makes sense.

The standard cell library difference would have a fairly meaningful impact, would it not?

Fewer tracks generally means higher density, lower cost, and lower performance, correct? And probably less power handling capabilities if I am understanding what "track" means in this instance (explanations are welcome!).

Taller gates allow larger transistor widths to be used, equating to higher performance:

Ultra High Density: 7 or 8-track
High Density: 9 or 10-track
High Performance: 12-track

OTOH, when designers compare their 12-track library at N28 (planar bulk) to their N16/N14 (FinFET bulk) 9-track library, they are observing that the 9-track FinFET library offers greater performance. So, they have to decide if they need a library with more tracks at N16/N14. (At any rate, TSMC 16FF+ has 7.5, 9, 10.5 and 12-track stdcells, whereas Samsung/GF 14LPP seem to have 9 and 10.5-track stdcells.)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
FWIW VC 2013's C library uses them for a few things. We found out right away since MS didn't bother checking if AVX was enabled before using them (this didn't get fixed until VC 2015). You can do "bcdedit xsavedisable 1" and see if anything crashes.
OK, so it might matter for arbitrary apps. Unfortunately this doesn't tell much about how much the application's performance depends on that. What's the perf hit w/ lower FMA throughput?


When AMD is putting up a slide saying 40% IPC increase I am going to assume thats running on code fitting solely in L1 cache(trace cache?)/best case scenario. - I reserve possibility to be pleasently surprised of course, but....
There is some data about a given SR 10% IPC increase and measured 11% (app mix dependent):
http://forums.anandtech.com/showthread.php?t=2365766
That includes the whole system. So a given 40% might mean a similar scenario.


Even with IPC met, high frew and excellent perf/w, AMD have a huge K2 like mountain to climb to compete with Intel. - its not a crappy P4 soution idiotic for server loads they are competing with. Its a damn fine overall product solution, support and organizational setup.

And they have to do the climbing without O and support. If zen is excellent its just the ticket to basecamp.
AMD might compete on the $/perf or $/perf/W fronts or in different/niche markets. Nobody expects them to push Intel out of the markets but maybe take 5% of Intels share during the first years. What growth would that be for AMD's server business?


The actual claim is a single Zen core{1C|2T} has 40% higher IPC over a single XV core{1C|1T}.
Is the "2T" for sure? This would mean a 7.7% improvement in ST over XV if SMT adds 30%. Doesn't look plausible given the units.
 

krumme

Diamond Member
Oct 9, 2009
5,956
1,595
136
Tco cost is what matters at the servermarket. Cpu cost is utterly uninteresting.
Amd either needs to
A. Flat out beat Intel for single thread performance and gain access to the market where software license cost is set by core count.
B. Beat or be similar in perf w
Go look at what eg cooling cost. Redundancy here to. The cost list is huge.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Tco cost is what matters at the servermarket. Cpu cost is utterly uninteresting.
Amd either needs to
A. Flat out beat Intel for single thread performance and gain access to the market where software license cost is set by core count.
B. Beat or be similar in perf w
Go look at what eg cooling cost. Redundancy here to. The cost list is huge.

A might be an option (chances growing with later Zen cores) but more as a side effect, not a primary goal.
B is not an improbable option.

@IPC discussion:
The per clock analysis of XV vs. SR vs. PD on Planet3DNow!
https://translate.google.com/transl...stungsvergleich-der-architekturen/&edit-text=
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
It was my impression that Zen would feature a greater leap forward in floating point execution resources than int. Did I misinterpret the latest leaks?

In most ways Zen's FPU is wider than Haswell's (except for fmul, where it is apparently at a 50% disadvantage). Unfortunately, we don't really know how well this compares with the internal workings of the FlexFPU in excavator (due to it being hidden behind another scheduler).

From the gcc patches (I've been examining them very carefully) it would appear that AMD has floating point advantages (vs Integer)... in fact, it looks like it would, in theory, lay the smack-down on Haswell in both areas, but there are more considerations than just these for performance:

INTEGER

Zen Integer advantages over Haswell:
Double the shift or rotates (4 vs 2)
Double the LEA instruction throughput (4 vs 2)
1/4 Division Port Usage (Haswell locks 4 ports for division)*

Zen Integer same as Haswell:
2x branch
1x indirect branch
4x mov, movx, add, cmp, etc.
1x mul, imul, iulx

Zen Integer disadvantage vs Haswell:
4x indirect branch pipeline usage vs Haswell's*

* These are ILP issues that prevent other instructions from executing/being scheduled on ANY ALU at the same time.

FLOATING POINT

Zen FPU advantage over Haswell
33% more FPU pipelines
2x fdiv
50% more mmx_add and sse_add
2x mmx_cvt
33% more sse_logic
Averages nearly twice as wide (much better ILP)

Zen FPU Identical with Haswell:
fcmp
fop
fsgn
mmxshift

Zen FPU Disadvantages vs Haswell:
1/2 ssemuladd (FMA?)
- Zen pairs with FP3 for every ssemuladd

On the whole, if the instruction latencies and the cache system were up to par, we'd expect Zen to beat Haswell in most cases per clock. However, Haswell has, 50% more AGU and a dedicated store data unit which is ganged with one of the three AGUs for stores to double the potential bandwidth (AFAICT).

I see a few low hanging fruit in the design to give it the 15% extra performance with Zen+, but they're hypothetical (the core should be able to handle it, I'm betting the front or back ends are not up to the task).

Please note, all of my information comes from the gcc source code, and I made assignment spreadsheets for Zen and Haswell:

http://looncraz.net/ZenAssignments.html
http://looncraz.net/HswAssignments.htm

Regardless, nobody actually runs single-threaded code on an XV module except for odd cases such as SuperPi or when deliberately running a multi-threaded benchmark in single-threaded mode for . . . whatever reason.

Most of the time, the extra multi-threaded performance is all you need. Games and many browser benchmarks, however, do not scale well, if at all, so higher IPC is certainly more desirable.

Anyway, The Stilt's testing showed XV to be 5% faster than SR in Cinebench R10 at the same clockspeed with the same number of modules.

I saw a 35W Excavator to 9.85% better. I think The Stilt's numbers are TDP limited, so aren't valuable for direct comparison to higher power Steamroller parts.

A 4c/8t Zen would wind up being slower - much slower - than 4m/8t XV. Yuck.

Not gonna happen. I don't think AMD could make Zen slower than Excavator if they tried. To be honest, I'm trying to figure out how they are only claiming 40% higher IPC, though I base my math exclusively on 40% - which puts Zen in Haswell territory on average with a FPU deficit.

I'm thinking the 40% IPC is integer only, and the FPU is closer to 60~80%. If the caches can keep pace, then Zen will be a great alternative to Intel's current lineup, per clock. Of course, we have a year to wait and we have no idea where the clocks will fall.

If Zen hits 3.5GHz and overclocks to 4GHz, their SMT performance will need to be impressive or they will need to add extra cores.

Up until examining the pipeline assignments I was certain AMD would have an inferior SMT design, but I seriously think they have the potential to match, or even exceed, Hyperthreading.
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
AMD might compete on the $/perf or $/perf/W fronts or in different/niche markets. Nobody expects them to push Intel out of the markets but maybe take 5% of Intels share during the first years. What growth would that be for AMD's server business?

TCO costs on datacenters far exceed acquisition costs, so whenever you put AMD in the same phrase along with $/perf and servers you are implying that AMD will somehow have a sizable advantage in perf/watt, or at least be close enough that the tie-break would happen on the acquisition costs. I don't even know how to point out how pie-in-the-sky these assumptions are, especially when we have SUN, IBM and the rest of the ARM ecosystem being bloodied when trying to reach the same goals.

Btw, did AMD release any info about their interconnect? It seems that Intel rings were exhausted at about 18 cores, andQPI was far more advanced than anything AMD had on its wings. If we are talking about 32 core Zen it's better AMD have something good for their interconnect, otherwise we'll watch just Bulldozer part 2.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
TCO costs on datacenters far exceed acquisition costs, so whenever you put AMD in the same phrase along with $/perf and servers you are implying that AMD will somehow have a sizable advantage in perf/watt, or at least be close enough that the tie-break would happen on the acquisition costs. I don't even know how to point out how pie-in-the-sky these assumptions are, especially when we have SUN, IBM and the rest of the ARM ecosystem being bloodied when trying to reach the same goals.
That's the reason why I brought this up and mentioned $/perf/W, too. As posted earlier, HSW doing heavy HPC stuff (Livermore loops) has a power consumption comprised of ~75% fixed cost / static power (burnt anyway if not power gated) and 25% caused by instruction execution itself. A large part of that fixed cost is caused by the big core: 256b wide datapaths, avx2 prfs, wide decoders, lots of heavily mixed issue ports etc.

If AMD created a small core (I'm right in the analysis phase of this hypothesis), this mix could shift to their advantage.

Btw, did AMD release any info about their interconnect? It seems that Intel rings were exhausted at about 18 cores, andQPI was far more advanced than anything AMD had on its wings. If we are talking about 32 core Zen it's better AMD have something good for their interconnect, otherwise we'll watch just Bulldozer part 2.
Besides having seen ring busses mentioned somewhere in AMD patents/research publications, I think they might do an hierarchical approach + maybe rings somewhere. They might create 4C+L3 slice blocks and combine them + GPU + MCs + NB + SB via a new XBar or a ring bus.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Besides having seen ring busses mentioned somewhere in AMD patents/research publications, I think they might do an hierarchical approach + maybe rings somewhere. They might create 4C+L3 slice blocks and combine them + GPU + MCs + NB + SB via a new XBar or a ring bus.
The Coherent Data Fabric is a hybrid of a hierarchical ring interconnect and a 2D Mesh/Torus interconnect.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |