AMD Zen Features Double the Per-core Number Crunching Machinery to Predecessor

csbin

Senior member
Feb 4, 2013
858
412
136
http://www.techpowerup.com/216541/a...umber-crunching-machinery-to-predecessor.html


AMD "Zen" CPU micro-architecture has a design focus on significantly increasing per-core performance, particularly per-core number-crunching performance, according to a 3DCenter.org report. It sees a near doubling of the number of decoder, ALU, and floating-point units per-core, compared to its predecessor. In essence, the a Zen core is AMD's idea of "what if a Steamroller module of two cores was just one big core, and supported SMT instead."

In the micro-architectures following "Bulldozer," which debuted with the company's first FX-series socket AM3+ processors, and running up to "Excavator," which will debut with the company's "Carrizo" APUs, AMD's approach to CPU cores involved modules, which packed two physical cores, with a combination of dedicated and shared resources between them. It was intended to take Intel's Core 2 idea of combining two cores into an indivisible unit further.


AMD's approach was less than stellar, and was hit by implementation problems, where software sequentially loaded cores in a multi-module processor, resulting in a less than optimal scenario than if they were to load one core per module first, and then load additional cores across modules. AMD's workaround tricked software (particularly OS schedulers) into thinking that a "module" was a "core" which had two "threads" (eg: an eight-core FX-8350 would be seen by software as a 4-core processor with 8 threads).

In AMD's latest approach with "Zen," the company did away with the barriers that separated two cores within a module. It's one big monolithic core, with 4 decoders (parts which tell the core what to do), 4 ALUs ("Bulldozer" had two per core), and four 128-bit wide floating-point units, clubbed in two 256-bit FMACs. This approach nearly doubles the per-core number-crunching muscle. AMD implemented an Intel-like SMT technology, which works very similar to HyperThreading.


 

csbin

Senior member
Feb 4, 2013
858
412
136
The Patch Allows Us To Get A Glimpse Into The Inner-Workings Of AMD’s Next Generation High Performance x86 CPU Core “Zen”

Today, with the information that we’ve learned from the patch, we can get a better idea of how Zen looks like from a high-level design standpoint.
So let’s dive straight into the new details that made their into the patch, but first I’d like to give a shout-out to “Dresdenboy” aka Matthias Waldhauer who spotted the patch and reported on it in his blog.


Read more: http://wccftech.com/amd-zen-cpu-core-microarchitecture-detailed/#ixzz3ngUPwdPF


 

csbin

Senior member
Feb 4, 2013
858
412
136
The wider floating point unit also means that Zen will be able to process less complex instructions at double the rate of Steamroller. Which would mean a massive boost in floating point performance, an area where AMD had historically excelled in with Phenom II and other microarchitectures prior to bulldozer.
I should mention that AVX-512 support was not listed for Zen in the Linux patch that was released in March, which revealed the new instruction set extensions that Zen will support. This is slightly odd but could be explained by a possible lack of 512bit integer support in Zen, which is required for the AVX-512 extension.
There was also one particularly important improvement with Zen that Mr. Waldhauer has managed to spot in a number of patents filed by AMD CPU engineers working on Zen.

A lot of the new functionality has been filed for patenting. For example there was a mention of checkpointing, which is good for quick reversion of mispredicted branches and other reasons for restarting the pipelines. Some patents suggest, that Zen might use some slightly modified Excavator branch prediction.




 

h4rm0ny

Member
Apr 23, 2015
32
0
0
Nice info. One thing I read in another article is that it's going to have 46 PCI-e lanes. I don't know what their source for that is but as someone obsessed with high-speed storage and lots of it, that's another strong positive.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
It's one big monolithic core, with 4 decoders (parts which tell the core what to do), 4 ALUs ("Bulldozer" had two per core), and four 128-bit wide floating-point units, clubbed in two 256-bit FMACs.

Nice spin on numbers, but according to the scheme it only has 2 pairs of 128bit add/mul hardware. That is not the same as 2x256 FMACs. Intel can execute 2x256FMA per cycle ( with 5 lat on hsw). Can't do that with 256bits of hw.

Also BD already had respectable number of resources, but performance was disastrous. They really need to fix cache BW/latencies to compete with Intel in FP.
 

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
Nice spin on numbers, but according to the scheme it only has 2 pairs of 128bit add/mul hardware. That is not the same as 2x256 FMACs. Intel can execute 2x256FMA per cycle ( with 5 lat on hsw). Can't do that with 256bits of hw.

Also BD already had respectable number of resources, but performance was disastrous. They really need to fix cache BW/latencies to compete with Intel in FP.
From Dresdenboy's blog:
Interestingly, as there are two 128b FP mul and two 128b FP add units (with only 3 cycles latency for these ops), the FMA instructions will be executed by combining one FP MUL and one FP ADD unit, resulting in 2 issues and 5 cycles latency (as that of the Bulldozer family). This saves some register file ports and increases throughput and reduces latencies of the more common FP ops. It even remembers me of the bridged FMA unit.
 

Burpo

Diamond Member
Sep 10, 2013
4,223
473
126
Nice spin on numbers, but according to the scheme it only has 2 pairs of 128bit add/mul hardware. That is not the same as 2x256 FMACs. Intel can execute 2x256FMA per cycle ( with 5 lat on hsw). Can't do that with 256bits of hw.

Also BD already had respectable number of resources, but performance was disastrous. They really need to fix cache BW/latencies to compete with Intel in FP.

"AMD Zen Core Microarchitecture (with some speculated parts)"
 
Last edited:

jones377

Senior member
May 2, 2004
451
47
91
haha I knew this would end up in an article on one of these so called "tech" sites, rewritten from speculation to truth.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Nice spin on numbers, but according to the scheme it only has 2 pairs of 128bit add/mul hardware. That is not the same as 2x256 FMACs. Intel can execute 2x256FMA per cycle ( with 5 lat on hsw). Can't do that with 256bits of hw.

Also BD already had respectable number of resources, but performance was disastrous. They really need to fix cache BW/latencies to compete with Intel in FP.
You nailed it. Most media got it wrong. I hope to get an answer of the AMD guy who created the patch, because some informations in that patch (latencies, non-pipelined multiplier (?), FMA pairing of MUL0 + ADD1 (not ADD0?) and MUL1 + ADD1) look strange.

If the multiplier is pipelined, then SuperPi, CB single threaded and games should look good on Zen.
 

JoeRambo

Golden Member
Jun 13, 2013
1,814
2,105
136
You nailed it. Most media got it wrong. I hope to get an answer of the AMD guy who created the patch, because some informations in that patch (latencies, non-pipelined multiplier (?), FMA pairing of MUL0 + ADD1 (not ADD0?) and MUL1 + ADD1) look strange.


I have noticed it as well.
"znver1-double,(znver1-fp0+znver1-fp3)|(znver1-fp1+znver1-fp3)")
Probably a mistake, since several lines before fp2 / fp3 work in tandem on 256bits. THO with AMD you never know for sure, they might shoot themselves in the foot with some crazy stuff like trying to execute FMA with 5 lat and some crazy bw that depends on moon phase
 

Shehriazad

Senior member
Nov 3, 2014
555
2
46
With a real launch not being anytime soon...I still question if the architecture will be able to go with the times or end up being another FX type failure.


If only they had the ability to push it out in H1 2016...Q1 2017 seems more likely.

I really wonder if AMD can even stay relevant until then...(not that I wouldn't like them to...but even the best news about the future doesn't help if that future is really far away)
 

Boze

Senior member
Dec 20, 2004
634
14
91
With a real launch not being anytime soon...I still question if the architecture will be able to go with the times or end up being another FX type failure.

If only they had the ability to push it out in H1 2016...Q1 2017 seems more likely.

I really wonder if AMD can even stay relevant until then...(not that I wouldn't like them to...but even the best news about the future doesn't help if that future is really far away)

That's the $64,000 question right there. I, like you, hope AMD can get their house in order and launch early 2017.

To be quite honest, this processor really needed to come out around the end of this year to the middle of next year, not late 2016 / sometime in 2017. AMD's going to go through hell staying afloat, but maybe they can pull it off.
 

Blitzvogel

Platinum Member
Oct 17, 2010
2,012
23
81
I'm thinking that Zen will have great floating point performance, but there is no doubt that AMD will have to sell 6-core variants to match quad i7s in all metrics. There is also the AVX-512 issue, but I don't see games making use of it anytime soon. I'm doubtful anything runs AVX at the moment, probably just 2x/4x 64 bit floats at once (SSE4?).
 

Borealis7

Platinum Member
Oct 19, 2006
2,914
205
106
if AMD is just going to implement "Intel-style XYZ" in their new architectures, whats the point of having competition? it's like everyone is selling the same Asetek coolers with their sticker on it.

the point of having 2 kinds of processors is to have differentiation. if both companies implement the same features (resource counts, SMT etc') whats the advantage of buying an AMD CPU over an Intel CPU?

not IGP - that's for sure - these FXs aren't meant for the IGP crowd.
power consumption? AMD have a bad track record there.
price? the FXs have never been "cheap", they are flagship products.

Intel is unlikely to make another NetBurst in the foreseeable future, someone at AMD needs to get creative and fast.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
if AMD is just going to implement "Intel-style XYZ" in their new architectures, whats the point of having competition? it's like everyone is selling the same Asetek coolers with their sticker on it.
Like this?


the point of having 2 kinds of processors is to have differentiation. if both companies implement the same features (resource counts, SMT etc') whats the advantage of buying an AMD CPU over an Intel CPU?

not IGP - that's for sure - these FXs aren't meant for the IGP crowd.
power consumption? AMD have a bad track record there.
price? the FXs have never been "cheap", they are flagship products.

Intel is unlikely to make another NetBurst in the foreseeable future, someone at AMD needs to get creative and fast.
The x86 CPUs are and continue to be very different. You get different featuresets, prices, power consumption values, core counts, etc. But still there are some common denominators.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
The bulldozer/excavator design looks good on paper. But the performance just wasnt there in practice. We have zero reason to believe the same wont be true here. They are putting the right pieces in place, but have they addressed cache latency? If this thing does not specifically outperform intel in cache latency, then it is going to be slower. Really it is the latency of all operations. Intel could easily spend $10 million just to reduce one single operation by one clock cycle. Repeat that about 50 times and you get your 5% annual improvement in IPC. AMD has to pull out something big in order to compete with that.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
The bulldozer/excavator design looks good on paper. But the performance just wasnt there in practice. We have zero reason to believe the same wont be true here. They are putting the right pieces in place, but have they addressed cache latency? If this thing does not specifically outperform intel in cache latency, then it is going to be slower. Really it is the latency of all operations. Intel could easily spend $10 million just to reduce one single operation by one clock cycle. Repeat that about 50 times and you get your 5% annual improvement in IPC. AMD has to pull out something big in order to compete with that.
BD was a controversial case as some AMD people insisted on increased IPC, while patents early revealed 2 ALUs, for example. Then there was the CMT design and it's limitations vs. advantages (2 threads on a module had compared to 1 thread a relatively higher throughput than on a SMT core). SR improved the shared front end. Final clock speeds and power consumption can not be seen on paper and even not estimated that easily (from outside), as the error margin at the limits of a process get rather large.

You're right about caches. They (esp. the L1 I$ and D$) have to provide the data. But the rate of mem operands (either as load, store or load+store) is lower than 1/instruction. So the efficiency of execution itself is important as well. There are many places to create something less efficient. But it's not just a matter of money, but also of the way the chip is being designed and what tradeoffs were made early in the design process. It's easy to reduce longer latencies, but that either costs time (complex design), area (cost per die) or power.

BTW, with XV AMD just reduced sqrt/fdiv and some other latencies and throughput rates significantly vs. previous gens.
 
Last edited:

MikeA65

Junior Member
May 16, 2015
16
0
0
The wider floating point unit also means that Zen will be able to process less complex instructions at double the rate of Steamroller. Which would mean a massive boost in floating point performance, an area where AMD had historically excelled in with Phenom II and other microarchitectures prior to bulldozer.
I should mention that AVX-512 support was not listed for Zen in the Linux patch that was released in March, which revealed the new instruction set extensions that Zen will support. This is slightly odd but could be explained by a possible lack of 512bit integer support in Zen, which is required for the AVX-512 extension.
There was also one particularly important improvement with Zen that Mr. Waldhauer has managed to spot in a number of patents filed by AMD CPU engineers working on Zen.







Where did you get the CPU comparison chart? It's wrong about Skylake, which has 5-wide decode not 4.
 

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
if AMD is just going to implement "Intel-style XYZ" in their new architectures, whats the point of having competition?

AMD has been badly burned by benchmark cheese for a long time.

1) Having websites focus primarily on benchmarks that are favorable to Intel's designs.
2) Poison benchmarks so that they don't take full advantage of AMD chips.
3) Poison compilers so that they don't take full advantage of AMD chips.

Did I forget anything?

Plus, the software industry is going to cater to Intel's designs by default because it has more market share.

Mimicking Intel closely makes it hard to use benchmark cheese and makes it hard for the software industry to give Intel an advantage.

However, since Zen apparently isn't optimized for or capable of 512-bit AVX integer, that may be (and/or with 512-bit AVX FP) be a way benchmarks can be skewed in Intel's favor.

I think a higher level of differentiation would be good but it's hard enough for AMD to compete at all so it's going to have to compete on price.
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
AMD has been badly burned by benchmark cheese for a long time.

1) Having websites focus primarily on benchmarks that are favorable to Intel's designs.
2) Poison benchmarks so that they don't take full advantage of AMD chips.
3) Poison compilers so that they don't take full advantage of AMD chips.

Did I forget anything?

Plus, the software industry is going to cater to Intel's designs by default because it has more market share.

Mimicking Intel closely makes it hard to use benchmark cheese and makes it hard for the software industry to give Intel an advantage.

However, since Zen apparently isn't optimized for or capable of 512-bit AVX integer, that may be (and/or with 512-bit AVX FP) be a way benchmarks can be skewed in Intel's favor.

I think a higher level of differentiation would be good but it's hard enough for AMD to compete at all so it's going to have to compete on price.
There are practically no commercial apps that utilize AVX256 instructions so this a non-issue for Zen. Zen has to have good legacy SSE and AVX performance which seems to be a given looking at the design (or what we know about it).
 

Boze

Senior member
Dec 20, 2004
634
14
91
Like this?


The x86 CPUs are and continue to be very different. You get different featuresets, prices, power consumption values, core counts, etc. But still there are some common denominators.

This image... I had to go to the emergency room for oxygen... couldn't stop laughing.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |