New Zen microarchitecture details

Page 112 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I've heard that the SMT implementation on Zen would have a very large switching penalty. Meaning that it is not ideal to utilize a single thread on each physical core, and then switch to two threads per each physical core (or vice versa). Any idea why would that be the case with Zen?
 

KTE

Senior member
May 26, 2016
478
130
76
https://twitter.com/Daniel_Bowers/status/768270633125806081?s=09

Question: Did 40% uplift on Zen IPC include boost from SMT? AMD: No, that was just 1-thread improvement. #HotChips
Nice!

Now that clears so many murky waters...

(@Daniel_Bowers): https://twitter.com/Daniel_Bowers?s=09

AMD: "One of our biggest improvements [in Zen] was adding a large op cache" #HotChips

AMD's Mike Clark now on stage talking Zen. "As lead architect, I got to name it" #HotChips

AMD: "We threw SMT on top [of Zen] to give you performance when you need it" #HotChips

@addisonsnell asked Intel "What about server Skylake?" but got a "no comment" response. #HotChips

"Best part about #HotChips is Q&A, when an Intel engineer stands up to pepper AMD for clarifications"



Sent from HTC 10
(Opinions are own)
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
I don't think anyone actually ever expected the additional performance from SMT to be included in the quoted 40%. If they did, that would have been a disaster. The only remaining question was / is the yield AMD's SMT implementation produces.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
I've heard that the SMT implementation on Zen would have a very large switching penalty. Meaning that it is not ideal to utilize a single thread on each physical core, and then switch to two threads per each physical core (or vice versa). Any idea why would that be the case with Zen?
No further info so far. 3 initial hypotheses:
1) They simply need to let that 1T finish w/o further fetch to clear internal ressources.
2) There is some OS stuff involved.
3) To me less likely: power gating.

But a thread switch comes with a cost anyway.

I don't think anyone actually ever expected the additional performance from SMT to be included in the quoted 40%. If they did, that would have been a disaster. The only remaining question was / is the yield AMD's SMT implementation produces.
Sure, seeing XV and Zen side by side this is clear for many uarch enthusiasts. But official statements are better for the masses.
 

KTE

Senior member
May 26, 2016
478
130
76
I don't think anyone actually ever expected the additional performance from SMT to be included in the quoted 40%. If they did, that would have been a disaster. The only remaining question was / is the yield AMD's SMT implementation produces.
Sure, seeing XV and Zen side by side this is clear for many uarch enthusiasts. But official statements are better for the masses.
I maybe a minority but I certainly doubted the 40% single thread average IPC over EXC claim. To me, every impartial, scientific individual would probably do similar before performance hard data and release are available. Or at least reserve some doubt.

I do tend to play devils advocate before these releases without batting for either side. Talking about performance absolutes before this has a long history of turning pear shaped, especially with AMDs recent uarchs.

However, AMD answering to HC audience puts my doubt to rest. It's not much of a marketing event, like the rest of the platforms and they would be seriously discredited for outright lying if they said such dishonestly.

Re POWER9, I haven't had a chance to even read anything about it yet. Too busy at work. I am trying to get a look at the chip this following week tho.
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,168
136
https://twitter.com/Daniel_Bowers/status/768270633125806081?s=09

Question: Did 40% uplift on Zen IPC include boost from SMT? AMD: No, that was just 1-thread improvement. #HotChips

Woulda been nice if they had said that earlier! So all it is, is 40% single-threaded performance over XV? That isn't very good actually . . . and certainly not what we saw on display with Blender.

I don't think anyone actually ever expected the additional performance from SMT to be included in the quoted 40%. If they did, that would have been a disaster. The only remaining question was / is the yield AMD's SMT implementation produces.

Yes, that opens up a lot of questions about AMD's implementation of SMT. Their SMT would have to be pretty beastly to catch up to Broadwell-E in Blender even if AVX2 is a non-factor there.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,361
136
Woulda been nice if they had said that earlier! So all it is, is 40% single-threaded performance over XV? That isn't very good actually . . . and certainly not what we saw on display with Blender.



Yes, that opens up a lot of questions about AMD's implementation of SMT. Their SMT would have to be pretty beastly to catch up to Broadwell-E in Blender even if AVX2 is a non-factor there.

Or IPC in Blender is higher than 40%
 

Abwx

Lifer
Apr 2, 2011
11,172
3,869
136
Or IPC in Blender is higher than 40%

It s not possible to make a conclusion given Blender s large gains with HT on Intel CPUs, and we dont know what is Zen s SMT efficency..

It can be 40% IPC and 35% SMT gain but they could as well have 50% better IPC and 26% SMT, i suspect that they used this bench to not disclose the IPC and SMT numbers and to display only the throughput..

FTR looking at XV and HW speed in a short Blender rendering and taking account BDW s slight gain over the latter will yield almost 1.9x better throughput for Zen in respect of a single XV core, and slightly more than the XV module given this latter s 10% CMT penalty.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
Another die area calculation... 282.15 mm² to 319.0303872 mm² is the range I got. It really depends on if AMD is using long channel or extra long channel options for RVT. For Summit Ridge, that would make Raven Ridge around 239.8275 mm² to 271.17582912 mm².

Leakage tends to be;

100(sLVT 14nm?), 10(LVT 18nm?), 1(RVT 20nm?), 0.1(RVT-LC 22nm?), 0.01(RVT-ELC 28nm?), 0.001(HVT (32nm length)), 0.0001(HVT-LC (45nm length))
? - guesses based on HVT.
(RVT tends to be majority of transistors used in high performance designs, with LVT/HVT being relegated to critical speed/leakage paths.) ((FinFETs push RVT which is majority to a longer channel length than Planar would.)) (((Which is why SOI FinFETs 14HP(up to 70% lower leakage @ same channel length allows for RVT to have true node scaling) and UTBB FDSOI 22FDX(Lower voltage via FBB and same/smaller channel @ iso perf means up to ~70% lower leakage) would have been preferred.)))
 
Last edited:

KTE

Senior member
May 26, 2016
478
130
76
Another die area calculation... 282.15 mm² to 319.0303872 mm² is the range I got. It really depends on if AMD is using long channel or extra long channel options for RVT. For Summit Ridge, that would make Raven Ridge around 239.8275 mm² to 271.17582912 mm².
Didn't someone on here predict 180-200ish before?

I've glanced at the parts and can't see less than 230mm^2.

Sent from HTC 10
(Opinions are own)
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Woulda been nice if they had said that earlier! So all it is, is 40% single-threaded performance over XV? That isn't very good actually . . . and certainly not what we saw on display with Blender.
[...]
Yes, that opens up a lot of questions about AMD's implementation of SMT. Their SMT would have to be pretty beastly to catch up to Broadwell-E in Blender even if AVX2 is a non-factor there.
40% incl. SMT would be bad - except you'd expect 40% incl. SMT over a XV module. Now it looks more like a single core with SMT roughly equals or even beats a former module.

I'm sure, with a lot of experience and data, the avg. SMT yield could be estimated within 2-5% from known uarch and SMT implementation details.
 

hrga225

Member
Jan 15, 2016
81
6
11
@all:
BTW did anyone here get, that the stack engine + memfile reduces AGU pressure (remember the 3rd AGU discussions?). That memfile seems to be a small stack cache. A well known MPR editor will cover that topic soon. ^^
Any advances on this,especially last sentence?
 

KTE

Senior member
May 26, 2016
478
130
76
40% incl. SMT would be bad - except you'd expect 40% incl. SMT over a XV module. Now it looks more like a single core with SMT roughly equals or even beats a former module.

I'm sure, with a lot of experience and data, the avg. SMT yield could be estimated within 2-5% from known uarch and SMT implementation details.
40% on average for ST could never be less than awesome. We're accustomed to 5-20%.

Poses the question how this 40% is impacted with MT accesses in a DT/Mb soft. that scales well to 8 cores.

I mean are we potentially looking at scenarios of P=EXC 1C+20-30%+SMT?

The converse side is, if SMT is done well, we could be looking at superlinear scaling for servers with the bigger L1/L2 in situations where the data starts fitting into the caches (esp scientific).

Also CMT =! SMT for performance compares. 8 BD cores will be compared to these 8 full cores. That's how they are marketed. SMT is an additional extraction from the same cores, rather than a core duplication so it will be included.

Sent from HTC 10
(Opinions are own)
 

TheELF

Diamond Member
Dec 22, 2012
3,993
744
126

NostaSeronx

Diamond Member
Sep 18, 2011
3,689
1,224
136
They raised the op cache and then went ahead and partitioned the op queue and retire/store queue...STATICALLY.
What sense does that do?I have no idea what that does but it sounds bad to me,sounds like you have to be running two threads if you want to get the most out of it.
Red, Blue, Indigo, Green is the color palette;
Red is full dynamic. (Any possible allocation)
Blue is full dynamic with static tags. (Each thread gets their own tags)
Indigo is full dynamic with thread driven priority. (Thread A is wider thus can utilize more resources than thread B. X cycles given to X thread over Y thread or X-entries given to X thread over Y thread.)
Green is full static. (Full resources(1T/1 cycle) or Round-robin(2T/2 cycles))

Really, overall it isn't a big issue. Zen's 1000 cuts will not be from the above.
 
Reactions: HiroThreading

MajinCry

Platinum Member
Jul 28, 2015
2,495
571
136
Aye, 40% is pretty sweet. The average performance difference between Sandybridge -> Skylake is, what, 20%? Plenty of lads kicking about that say SB is long in the tooth.

+40% over Excavator is +60% over Piledriver. That's around Haswell level.

Not too shabby for a from-scratch CPU.
 
Mar 10, 2006
11,715
2,012
126
Aye, 40% is pretty sweet. The average performance difference between Sandybridge -> Skylake is, what, 20%? Plenty of lads kicking about that say SB is long in the tooth.

+40% over Excavator is +60% over Piledriver. That's around Haswell level.

Not too shabby for a from-scratch CPU.

Sandy Bridge to Skylake is more like 30% perf/clock.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
And the official IPC figure (avg, by AMD) between Piledriver and Excavator is 15.5% (PD to SR = 10%, SR to XV = 5%). Also there are cases where Excavator has up to 12% lower IPC than Steamroller and several cases where the two architectures perform the same. In certain workloads Skylake can have up to 108% higher IPC than Excavator.
 
Last edited:

inf64

Diamond Member
Mar 11, 2011
3,765
4,223
136
Sandy Bridge to Skylake is more like 30% perf/clock.

As per AT the IPC jump from SB to SKL is 25%.

Sandy Bridge to Ivy Bridge: Average ~5.8% Up Ivy Bridge to Haswell: Average ~11.2% Up Haswell to Broadwell: Average ~3.3% Up [B said:
Broadwell to Skylake (DDR3): Average ~2.4% Up
Broadwell to Skylake (DDR4): Average ~2.7% Up [/B]

I think Zen will be from 5 to 10% slower than Haswell on average in common desktop workloads. That would make it 10-15% slower than Skylake. In specific optimized workloads like FMA or AVX the gap will certainly be bigger but that is a different story and different segment.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |