New Zen microarchitecture details

cytg111 · Jan 5, 2017

TemjinGold said:
Something I've been wondering: Will 1 Ryzen Ghz = 1 KL Ghz? Not asking to cast doubt on Ryzen, I'm just curious if anyone would know because I remembered back in the old days (Thunderbird era or so) AMD used to market that one of their mhz was better than one of Intel's. We see leaks of 3.6 base release and whatnot but is that equal to 3.6 in Intel numbers too?

yea 1hz = 1hz.. back in the day amd marketed their procs with performance ratings instead oh Hz's ... not today..

majord · Jan 5, 2017

Doom2pro said:
If they can't harvest 4 out of 8 cores, they messed up big time... No way you create a design like this and them gimp your binning prospects in the process.

Look at the latest slide - It says "up to" 8 core, which implies there will be harvested 4 core at least (if not 6, in the 3+3 config Stilt suggested)

JDG1980 · Jan 5, 2017

lolfail9001 said:
AMD actually harvested half of Polaris 11 that is tiny on it's own.

No, they didn't. The P11 die has 1024 shaders; the most cut-down version we've seen is the Radeon Pro 450 with 640 shaders, and that's a Macbook Pro exclusive, so it was probably done at Apple's express request. The furthest cut-down version that has made it into an AIB product was 896 shaders (only 12.5% cut-down).

On the bigger P10, the shader count goes down from 2304 on the full-die RX 480 to 2048 on the cut-down RX 470. That's only about a 11% cut.

If similar yields appear on Zen (and they should be better, since it's now over 6 months after the Polaris launch), then there would be no reason for AMD to offer anything less than a 6-core version.

happy medium · Jan 5, 2017

For you AMD guys.

RYZEN Demo At CES 2017 Was Running At 3.6 GHz Base Clock, F4 Stepping Can Turbo to 4.0 GHz

http://wccftech.com/ryzen-ces-2017-3-6-ghz-base-clock-f4-stepping-4-0-ghz/

HurleyBird · Jan 5, 2017

We already knew that without the need to link to that site

happy medium · Jan 5, 2017

HurleyBird said:
We already knew that without the need to link to that site

Ok just trying to help, this thread is huge, I'm not reading through it,

deasd · Jan 5, 2017

I just feel a hard knock on my head. 3.6-4Ghz is well beyond my 3.1-3.2 expectation half year ago, and this might not be final silicon? God dxnm it. Looks like the 4C8T APU price is very likely to be much more expensive than A10 on the market, I doubt I could afford it now.

happy medium · Jan 5, 2017

deasd said:
and this might not be final silicon?

Yes it is. Its releasing soon, they will not do a respin. The 6900k turbos at 3.7

Quote:
"The good folks over at Canard PC Hardware also revealed that the Ryzen F4 stepping has already been finalized and has a turbo clock of 4.0 GHz"

itsmydamnation · Jan 5, 2017

Abwx said:
That s for AVX2 but if there s only 128bit loads , like AVX128, then it will be one 256b load and one 128b store, because it doesnt have the ressources to exe two 128 bit loads using two different instructions, we are talking of SIMD here, so we are back to the same number as Zen.

i dont understand what you are saying?

both Zen and >haswell have two loads and one store a cycle.
Zen has one less AGU but has the stack engine/memfile reducing pressure on the AGU.
So they have approximately the same amount of load and store capacity ( even in terms of load and store queue sizes)
>Haswell can perform 256bit loads or store, Zen can only do 128bit loads or stores.

superstition · Jan 6, 2017

blublub said:
And for those clock AMD can even charge 600-800$ for the top end SKU...fine with me!.

That price without Intel's AVX-2 performance?

superstition · Jan 6, 2017

inf64 said:
I'm not sure people realize how fast this 3.6/4Ghz part will be . It will be brutal across the board and likely good deal cheaper than any 8C intel part.

AVX-2?

superstition · Jan 6, 2017

psolord said:
I swear to god, sometimes I think mobo development is going backwards, ffs.

It has in some instances. Chasing lower price points has caused board makers to lower the standards, rather than raise them, in some board revisions. The third 990FX Sabertooth board, for instance, has a VRM sink designed for a 40mm fan. The drawing I saw even showed that airflow is supposed to go backward, too — into the case. L-O-L.

Some other very old boards have better VRM sinks than more recent ones at the same market target.

superstition · Jan 6, 2017

laamanaator said:
I could argue that there's no difference. In R11.5 FX-4300 is about 14% faster than i3-4130T. In R15 it is also 14% faster than i3-4130T. Conclusion, no bias towards Intel. But what we could get from these results is that cache size matters for R15. X4 845 has half the L2 per core compared to FX-4300. In R11.5 845 is about 6% faster than FX-4300, but in R15 it is exactly the opposite(6%). So half the L2 hurts X4 845 for about 12% in R15. Roughly the difference between your comparison between R15 and R11.5 results.

Is the 4300 really very useful for comparison, since it's not a full Piledriver CPU? I think those charts should include an 8320E or 8370E.

laamanaator · Jan 6, 2017

JDG1980 said:
No, they didn't. The P11 die has 1024 shaders; the most cut-down version we've seen is the Radeon Pro 450 with 640 shaders, and that's a Macbook Pro exclusive, so it was probably done at Apple's express request. The furthest cut-down version that has made it into an AIB product was 896 shaders (only 12.5% cut-down).

Umm, P11 has 16CUs, meaning 1280 shader units. The RX 460 has two of them disabled, so 14CUs and 1024 shaders.So the most cut down version is half of P11.

Nothingness · Jan 6, 2017

superstition said:
AVX-2?

No matter how much I like SIMD (where's my preciooouus AVX-512? ), I don't think the impact of AVX2 is that important at the moment for the vast majority of people.

Abwx · Jan 6, 2017

itsmydamnation said:
>Haswell can perform 256bit loads or store.

As said only with AVX2 because the load is 256bit, but that s the same registers that are used for 128bit loads, in that latter case you cant use the 128 upper bits to load a second and separate 128bit load, because the 256bit load is attached to a single AVX2 instruction while two separate 128bit loads would be attached to two separate instructions that would require two separate registers.

Notice that if it was possible to use a single 256bit register for two separate 128bit loads and instructions then AVX2 would be useless, as said we are talking of SIMD, a single instruction attached to a 256bit load, not of multiple instructions and multiple 128bits loads..

Ajay · Jan 6, 2017

KompuKare said:
Well, sense with what resources they have.
But if AMD had more manpower I'm sure they could release more and sooner. I mean with the looses over the last decade it's amazing they have (hopefully) made such a comeback.
However, if they had spare cash I'm sure that once they had accurate projections of Zen's performance they would have hired lots of people for validation etc. Risky and might have been unpopular with shareholders (hiring before revenue), but if it would have knocked of a few months (or more for the lower end parts) it would probably be money well spent.
I think currently AMD are rather stretched for everything which was also noticeable with the Polaris launch where I'd guess the PS4 Pro work held up the launch. Would not be surprised if the Project Scorpio work is delaying Vega.
Obviously some things like tapeouts take time and simply throwing more people at problems doesn't always work, but in this case it probably would have made sense.

Throwing more manpower into a project late in the game actually slows down productivity instead of increasing it. Keller made a number of investment in teams and tools during his tenure - and verification was one of them. The main delays for Zen seems to be getting the process up to spec (HVM yield at frequency X) and limitations in volume capabilities at Fab 8 (Malta, NY).

itsmydamnation · Jan 6, 2017

Abwx said:
As said only with AVX2 because the load is 256bit, but that s the same registers that are used for 128bit loads, in that latter case you cant use the 128 upper bits to load a second and separate 128bit load, because the 256bit load is attached to a single AVX2 instruction while two separate 128bit loads would be attached to two separate instructions that would require two separate registers.

I think we are saying the same thing.

>haswell can do two loads a cycle upto 256b per load but a single load instruction might only be 32b,64b,128b etc
Zen can do two loads a cycle upto 128b per load but a a single load instruction might only be 32b,64b,,etc

superstition said:
AVX-2?

You mean 256bit operations right? There is nothing wrong with Zen AVX, AVX2 or FMA support or latency. its 256bit workloads ( regardless of the instruction set (yes right now only AVX,AVX2,FMA support 256 bit ops)) where >Haswell has an advantage.

But outside of prime/OCCT/IBT what are these monster 256bit workloads?

Renders nope
Encoders nope
Games nope
Web browsers nope
Encryption maybe but Zen has dedicated AES pipes + sha extensions
linpack , yep because everyone uses that everyday!
F@H, everyone uses GPU's anyway.

One workload i do know that got a big benefit when going to haswell was SAP Hana, but i dont know how much was from 256bit ops, as they use the avx2 reg to reg ops and there was a big QPI upgrade as well.

hojnikb · Jan 6, 2017

JDG1980 said:
No, they didn't. The P11 die has 1024 shaders; the most cut-down version we've seen is the Radeon Pro 450 with 640 shaders, and that's a Macbook Pro exclusive, so it was probably done at Apple's express request. The furthest cut-down version that has made it into an AIB product was 896 shaders (only 12.5% cut-down).

On the bigger P10, the shader count goes down from 2304 on the full-die RX 480 to 2048 on the cut-down RX 470. That's only about a 11% cut.

If similar yields appear on Zen (and they should be better, since it's now over 6 months after the Polaris launch), then there would be no reason for AMD to offer anything less than a 6-core version.

There is also asia specific 470D, which has 1796SP

hojnikb · Jan 6, 2017

laamanaator said:
Umm, P11 has 16CUs, meaning 1280 shader units. The RX 460 has two of them disabled, so 14CUs and 1024 shaders.So the most cut down version is half of P11.

Guess you need to go back to preschool, because you math doesn't check out.
16x64 is 1024.

Dresdenboy · Jan 6, 2017

Nothingness said:
No matter how much I like SIMD (where's my preciooouus AVX-512? ), I don't think the impact of AVX2 is that important at the moment for the vast majority of people.

Here it might help to split AVX2 benefits in new instructions, more efficient 3 operand mode for int SSE2 stuff and adding 256b to the latter.

bjt2 · Jan 6, 2017

itsmydamnation said:
arg this bugs me so much (it important to get the semantics right dammit!)
Its not AVX workloads, its 256bit ops, any 256bit ops, meaning both avx and avx 2 can target 128bit vectors if they want. For example the extra instructions in avx/avx2 vs see, or because their console code is tuned to 3 operand 128bit avx etc. ( first thing that bugs me out of they way)

Also its not really the width of the units that's the limiting factor for Zen because it has more FP units then skylake. Its the load store bandwidth in and out of the core's. Zen has 256bits load and 128bits store vs 512/256 of >Haswell.

The point being AMD AVX and AVX2 performance is fine, there isn't some magically thing making AMD crap at those instruction sets( BD and PD had real 256bit instruction issues). Its that intel have an advantage on anything thats at 256bit operation. At the same time AMD/ZEN have an advantage on 128bit operations because they have more units.

If i was amd i wouldn't go chasing 256 or 512bit avx performance or SMT 4*, i would be using the massive die and power budget those things cost to increase clocks and IPC. If you look at AMD GPU's ( or NV) they are becoming much better at being CPU like. The more GPU's compute capacity becomes flexible and general the more a 512bit CPU becomes a jack of all trade master of none. If you have a master at both you can just eat them from both sides.

The General server base doesn't care about really wide vectors, thax to intels own segmentation neither does the consumer market.

*those rumors from Fottemberg aren't worth the bits in the database they are stored on unless AMD plan on basically copying a power9 style methodology to core design which is really a more unified version of CMT.

Zen, with its 4x128 FPU pipeline, NOT SHARED with INT resources, can do (256 FP wise):
- 1 FMUL + 1 FADD or 1 FMAC
Skylake with its 2x256 pipelines:
- 1 FMUL + 1 FADD or 2 FMAC

If not using the FMACs (not all algorithm allows that), Zen can lose only on simple calculations that have lots of load and store and that are cache friendly.
If there is a FDIV, an FSQRT or more that 3-4 instructions each load or store, the bottleneck becomes the other units and not the load/store
If the code is not cache friendly (a stream of data to be added or multiplied), then the bottleneck becomes the RAM.
Even if the code is full of FMACs, if there are the two conditions above, the limiting factors are the same.

Think of Blender. Do you think that to do raytracing you don't need any division, sqrt or complicated calculus (more that 3-4 instructions) for each memory data bit?
Only simple BLAS (linear algebra) routines will see high gains from SKL architecture...

prtskg · Jan 6, 2017

happy medium said:
Yes it is. Its releasing soon, they will not do a respin. The 6900k turbos at 3.7

Quote:
"The good folks over at Canard PC Hardware also revealed that the Ryzen F4 stepping has already been finalized and has a turbo clock of 4.0 GHz"

That just means f4 stepping has been completed not that it's the final stepping.

Abwx · Jan 6, 2017

itsmydamnation said:
I think we are saying the same thing.

>haswell can do two loads a cycle upto 256b per load but a single load instruction might only be 32b,64b,128b etc
Zen can do two loads a cycle upto 128b per load but a a single load instruction might only be 32b,64b,,etc

We agree here, and as you point it in your post most loads are 32 and 64bit, so in this respect HW/SKL has no Store/Load advantage for anything up to AVX128.

superstition · Jan 6, 2017

Nothingness said:
No matter how much I like SIMD (where's my preciooouus AVX-512? ), I don't think the impact of AVX2 is that important at the moment for the vast majority of people.

Yet, as soon as Zen hits suddenly there could be a push to make it important. AVX-2 has been supported since Haswell on the Intel side, right?

Haswell, Broadwell, Skylake, Kaby — that's quite a large percentage of the higher-end desktop market. Too bad for Intel that they've not had the best relationship with Nvidia. Buttering them up to make GameWorks heavily lean on AVX-2 could put the double hurt on AMD.

New Zen microarchitecture details

Lifer

Senior member

Golden Member

Lifer

Platinum Member

Lifer

Senior member

Lifer

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Platinum Member

Member

Platinum Member

Lifer

Lifer

Platinum Member

Senior member

Senior member

Golden Member

Senior member

Senior member

Lifer

Platinum Member