New Zen microarchitecture details

el etro · Apr 9, 2016

Kaveri unerdelivered pretty much. Carrizo promissed far less and ended overdelivering a bit.

monstercameron · Apr 9, 2016

el etro said:
Kaveri unerdelivered pretty much. Carrizo promissed far less and ended overdelivering a bit.

having had a kaveri desktop/laptop long term and a carrizo laptop currently, I will tell you that kaveri was a great piece of kit.

Carrizo on the other hand was so boring and under performing, it has been so underwhelming that it has been hard getting motivated to update my blog and subreddit.

Those are just my thoughts on the matter.

Adored · Apr 9, 2016

There's only so much you can do with a fundamentally broken arch. Everything post Piledriver is just open research in preparation for Zen.

JDG1980 · Apr 9, 2016

The Stilt said:
I feel completely the opposite. For example the APUs released after Trinity are something I previously would never have thought someone would actually release to the market.

With Steamroller a huge amount of features which are either not fully working or completely broken / missing, stuff that doesn't work as documented, etc. It's like someone pulled the plug on the project and many parts of the design were left incomplete, broken or untested. The situation improved quite alot in Carrizo / Bristol Ridge, but even they still contain some silly errors.

IIRC Trinity was the last project which had John Bruno as the chief engineer. It might be a coincidence, but Trinity is IMO the very last AMD design which is fully functional.

I think it's more likely because Trinity is the last APU that AMD thought might actually be competitive. At some point after the Piledriver architecture, AMD's R&D pipeline was flushed; the original plans for Steamroller and Excavator were canned, and what we got under those names were basically hacked-together stopgaps. There's a reason why AMD never even bothered to make HEDT or server chips for anything past Piledriver.

In other words, AMD's best CPU/platform engineers have all been working on Zen for the past couple of years, and the holdover construction core products got the "B" team.

nismotigerwvu · Apr 9, 2016

JDG1980 said:
I think it's more likely because Trinity is the last APU that AMD thought might actually be competitive. At some point after the Piledriver architecture, AMD's R&D pipeline was flushed; the original plans for Steamroller and Excavator were canned, and what we got under those names were basically hacked-together stopgaps. There's a reason why AMD never even bothered to make HEDT or server chips for anything past Piledriver.

In other words, AMD's best CPU/platform engineers have all been working on Zen for the past couple of years, and the holdover construction core products got the "B" team.

Given their total R&D budget it was the right move to make. All of these sorts of plans play out on the "years" timescale but AMD saw two key facts, 1) That they were "generations" ahead on the GPU side and 2) They were "even more generations" behind on the CPU side. Fact number 1 promised them at least some sort of niche, however small it may be, by simply trotting out iterative APUs to keep the ship afloat. Fact number 2 meant it simply wasn't reasonable to keep pouring money into Bulldozer derived cores as the ROI was never going to be favorable. Zen represented the best case for higher ROI and considering their budget limitations could you really blame them for jumping in with both feet?

YBS1 · Apr 9, 2016

monstercameron said:
having had a kaveri desktop/laptop long term and a carrizo laptop currently, I will tell you that kaveri was a great piece of kit.

I agree, I couldn't be happier with my Kaveri. For the money I put into it, and the versatility it offers, I can't complain about any aspect of it. I wonder if there is any technical reason they couldn't have offered an 8 core, iGPU less version of this for the FM2+ platform? In theory that would have outperformed the FX line. I think it would have done well for them as a holdover until Zen.

deasd · Apr 9, 2016

Dresdenboy said:
Sorry for picking this up late, but I still had that tab open.
Did you see the different scores at different TDP settings? I think, the consensus so far was, that while discussing CMT or SMT scaling, we left out power constraints and turbo modes. Constant clock frequency tests were ideal. With the P3DNow! data I get 73% (CB15) and 79% (CB11.5).

I agree with this, TDP setting could affect result so much. At least a module penalty is heavier than most expected especially those who still believe it's dual core.

Abwx · Apr 9, 2016

deasd said:
At least a module penalty is heavier than most expected especially those who still believe it's dual core.

It s the other way around, a lot of people exagerate the penalty, for instance in CB R15 if a single thread is 100% then two threads will yield 188%, the penalty is 6%...

looncraz · Apr 10, 2016

The Stilt said:
I would be happy to hear that people who worked on Hounds, Stars, Bulldozer, Piledriver and Cat cores also worked on Zen. It's the people who worked on Steamroller and Excavator who should be keelhauled, flogged and then banned working on semiconductors for life.

It's interesting you should mention that because the rumors I heard some years ago was that AMD had stripped their main team from the construction cores and put Steamroller and Excavator into the hands of a team mostly dedicated to power efficiency.

I believe the core designs for SR and XV were already figured out, so you had bits left dangling as the best talent went to work on Zen and Zen+ under Keller.

Sheep221 · Apr 10, 2016

looncraz said:
It's interesting you should mention that because the rumors I heard some years ago was that AMD had stripped their main team from the construction cores and put Steamroller and Excavator into the hands of a team mostly dedicated to power efficiency.

I believe the core designs for SR and XV were already figured out, so you had bits left dangling as the best talent went to work on Zen and Zen+ under Keller.

I'm sure it was because of bad management and organization, not the engineers. Many brilliant ideas don't make it to design/volume production, or are left incomplete. Only reason that AMD is not making good CPUs is because they have bad attitude towards things.

btw, your avatar is so scary

jhu · Apr 10, 2016

YBS1 said:
I agree, I couldn't be happier with my Kaveri. For the money I put into it, and the versatility it offers, I can't complain about any aspect of it. I wonder if there is any technical reason they couldn't have offered an 8 core, iGPU less version of this for the FM2+ platform? In theory that would have outperformed the FX line. I think it would have done well for them as a holdover until Zen.

There is no technical reason why this didn't happen. It was entirely economic reasons.

DrMrLordX · Apr 10, 2016

monstercameron said:
Kaveri was a pretty good chip, don't understand how you came to this conclusion that it is "sad".

What did they a reasonably good job of doing was covering up the flaws in Kaveri. It did have some serious problems that still dog its users today. Pointless throttling and poor configurability of the memory controller (stuck @ DDR3-2400) are the two most glaring flaws.

Dresdenboy · Apr 10, 2016

jhu, let me quote you from another thread here:

jhu said:
Well, depends on the test and how it's run. From my own tests with Povray, I got a 5% IPC increase (1 thread per core) or 30% IPC increase (2 threads per core) over Core 2.

We've got the 40% IPC increase claim for Zen over XV. I often heard the question, whether this is about per thread (ST) or per core throughput. I think, your example gives a good datapoint, as the microarchitectural+architectural changes seem to be much less than between XV and Zen. Yet there is a 30% IPC increase for the 2-threaded core over the 1-threaded one, without adding lots of execution ressources, large increases in L1 B/W, etc. (even latency went up from 3 to 4).

More here and here.

So I question any claims, that the 40% number is only achieved with SMT.

looncraz · Apr 10, 2016

Sheep221 said:
I'm sure it was because of bad management and organization, not the engineers. Many brilliant ideas don't make it to design/volume production, or are left incomplete. Only reason that AMD is not making good CPUs is because they have bad attitude towards things.

btw, your avatar is so scary

Products being released half-done is definitely an issue of management. I think the only reasons we saw Steamroller and Excavator at all was because of prior promises AMD made to release them back even before Bulldozer released. Legal issues abound when you abandon promised products.

Management otherwise didn't care about SR or XV - they didn't help market positioning, they were expenses that would not bring higher revenue, and so on. A six core SR or XV could have made a nice chip, but graphics performance was more important to maintain the existing market.

Thanks, I love my avatar

majord · Apr 10, 2016

Thanks Stilt for the detailed info!

DrMrLordX said:
What did they a reasonably good job of doing was covering up the flaws in Kaveri. It did have some serious problems that still dog its users today. Pointless throttling and poor configurability of the memory controller (stuck @ DDR3-2400) are the two most glaring flaws.

Well (And this applies to Stilt's complaints too) whilst these are valid complaints in isolation, I think for the target market it's a non issue to be honest.

Whilst AMD Marketing push some of these as semi-enthusiast chips, the reality is, MOST buyers are not enthusiasts buying APU's just for the 'fun' of pushing them to the limit, and therefore aren't going to ever want to pair fast RAM with these things, purely because it costs too much, and is really poor value for the benefit. 2133Mhz support was more then enough at the time Kaveri was released.

If anything, stilt's comments regarding 2400 being the maximum Multiplier on BR, (and from what we've heard possibly also Zen) is more of a concern going forward. It could hurt enthusiasts enthusiasm for sure. As 2400+ speeds are already starting to become more cost effective.

DrMrLordX · Apr 10, 2016

majord said:
Well (And this applies to Stilt's complaints too) whilst these are valid complaints in isolation, I think for the target market it's a non issue to be honest.

. . . and they covered up the problems by isolating Kaveri to markets where the throttling and RAM speeds would be non-issues. There was so much potential in Kaveri that was wasted . . . it's really quite sad. Don't get me wrong, I like my 7700k, but I've been fighting with it for awhile to get the most I can out of it.

swilli89 · Apr 10, 2016

Dresdenboy said:
jhu, let me quote you from another thread here:

We've got the 40% IPC increase claim for Zen over XV. I often heard the question, whether this is about per thread (ST) or per core throughput. I think, your example gives a good datapoint, as the microarchitectural+architectural changes seem to be much less than between XV and Zen. Yet there is a 30% IPC increase for the 2-threaded core over the 1-threaded one, without adding lots of execution ressources, large increases in L1 B/W, etc. (even latency went up from 3 to 4).

More here and here.

So I question any claims, that the 40% number is only achieved with SMT.

Dresdenboy, can you elaborate how this applies in the case of XV's 1 module versus Zen's 1 core and the implications of Zen's performance?

Dresdenboy · Apr 10, 2016

swilli89 said:
Dresdenboy, can you elaborate how this applies in the case of XV's 1 module versus Zen's 1 core and the implications of Zen's performance?

This will be one of my next topics to work on.

Exophase · Apr 10, 2016

Glo. said:
Again, bits & chips

http://www.bitsandchips.it/52-english-news/6815-speculations-about-zen-after-our-april-s-fool

They were right before on many occasions, but forum users will always neglect that. There is much more reason to believe in words of Fottemberg about this. You have to digest which is info he provides and which is his speculation(Fabbing process).

I'm skeptical about their Zen speculation. The freq/W given is extremely optimistic. And the claim that 2x128-bit FPU is a simplification vs 1x256-bit makes no sense. Fewer but wider SIMD units are simpler and use less power, area, etc for the amount of work done. That's the entire point of SIMD.

What are some of these many occasions Bits and Chips was right before? nVidia's interposer? The impression I get is that they could be a lot like Charlie at SemiAccurate: some legitimate sources and leaks but poor interpretation and analysis.

swilli89 · Apr 10, 2016

Dresdenboy said:
This will be one of my next topics to work on.

Awesome! I very much look forward to reading your thoughts on that.

Abwx · Apr 10, 2016

Exophase said:
I'm skeptical about their Zen speculation. The freq/W given is extremely optimistic.

Frequency/watt improvement is about the one announced by GF for 14nm LPP LVT in respect of their 28nm HPP used for Kaveri/Carrizo.

Such a 4C APU would fit within 25W/3.7Ghz if shrinked to 14nm, a 8C Zen is logically 4x more power hungry for the expected 4x FP throughput.

itsmydamnation · Apr 11, 2016

Exophase said:
I'm skeptical about their Zen speculation. The freq/W given is extremely optimistic. And the claim that 2x128-bit FPU is a simplification vs 1x256-bit makes no sense. Fewer but wider SIMD units are simpler and use less power, area, etc for the amount of work done. That's the entire point of SIMD.

What are some of these many occasions Bits and Chips was right before? nVidia's interposer? The impression I get is that they could be a lot like Charlie at SemiAccurate: some legitimate sources and leaks but poor interpretation and analysis.

I think the Zen FPU can be viewed from a few different perspectives:

1. FPU design
2. Core/SIMD width
3. SIMD throughput
4. SIMD latency
5. X86 code

1. FPU design,
This looks anything but simple, it looks very much like a future derivative of the Bridged FMA design that AMD released in a white paper pre Bulldozer, its focus was to not increase the latency or power consumption of ADD's or MUL's while still allowing FMA.

2. Designing a core for 256 bit datapaths vs 128bit has a power cost and not just in the execution units but from atleast from the L1D and possible L2 ( bandwidth) through the core and all the way back again. i guess you can call this a simplification, especially on effort to optimizing your pipeline for power when executing ops that aren't 256bit wide.

3. This is where it gets interesting, as Zen has 4 pipelines but only 256/128bits load store and mixed operations across those pipes . So if the data is "in the core" ( FPU PRF) then a Zen core can be 512bit wide for non FMA operations, but obviously this would be very hard to sustain.

4. This is one area where Zen will hopefully be good especially compared to bulldozer having "simple" none SIMD FP instructions like add taking 5 cycles compared 3 in STARS or 2 in Jaguar isn't awesome.

5. In Enterprise Server software SIMD is rare, and 256bit and FMA being extremely more rare. In consumer software that is also the case. The thing where it gets interesting is what is popular software using and outside of things like encoders and other throughput workloads thax to intel themselves that is firmly in SSE.

Putting all of that together i think for most workloads the Zen team looks to have made very good choices but a Zen core isn't going to look great in workloads that have a large amount of 256bit FMA where it wold be something like 4*2*2 ops for Zen vs 8*2*2 ops for >= Haswell.

TLDR i dont think its simple

Exophase · Apr 11, 2016

itsmydamnation, I'm not making an argument that there aren't reasonable tradeoffs between 128-bit and 256-bit SIMD, or that the latter is always preferable. Simply saying that what Bits and Chips stated, that 2x128-bit is simpler than 1x256-bit, is nonsense. You need a lot more than just twice as many data lines to make the former happen. Also, once you do have 2x128-bit symmetrical pipelines there isn't really that much you need to add to make it split AVX to both pipes simultaneously. The instruction set was even designed with this in mind, eg by not having full shuffles. So I never really got what Bulldozer's SIMD "fusing" was supposed to amount to.

itsmydamnation · Apr 11, 2016

itsmydamnation, I'm not making an argument that there aren't reasonable tradeoffs between 128-bit and 256-bit SIMD, or that the latter is always preferable. Simply saying that what Bits and Chips stated, that 2x128-bit is simpler than 1x256-bit, is nonsense

i was agreeing with you

The instruction set was even designed with this in mind, eg by not having full shuffles. So I never really got what Bulldozer's SIMD "fusing" was supposed to amount to.

As far as i know bulldozer never fused together its 128bit units, but executed the two 128 bits back to back for 256bit ops, unless you mean something else.

Exophase · Apr 11, 2016

itsmydamnation said:
i was agreeing with you

Okay, nevermind then

itsmydamnation said:
As far as i know bulldozer never fused together its 128bit units, but executed the two 128 bits back to back for 256bit ops, unless you mean something else.

AFAIK, a fastpath double instruction was issued, and there was nothing stopping it from executing on two SIMD ports in parallel, so long as they were available.

But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.

New Zen microarchitecture details

Golden Member

Diamond Member

Senior member

Golden Member

Golden Member

Golden Member

Senior member

Lifer

Senior member

Golden Member

Lifer

Lifer

Golden Member

Senior member

Senior member

Lifer

Golden Member

Golden Member

Diamond Member

Golden Member

Lifer

Platinum Member

Diamond Member

Platinum Member

Diamond Member