- Mar 3, 2017
- 1,749
- 6,614
- 136
AMD hasn't OoOE'd the SMT away. 8-wide decode/maximum instruction rate on Zen5 is only really feasible with SMT-enabled.Which is why I am wondering about the wisdom of Lion Cove and Skymont dropping it.
True that. It seems like they have Zen 5 balanced pretty well with respect to decode, execution, and retire paths. It wouldn't be able to feed all those execution units in parallel without the wide decode front end.AMD hasn't OoOE'd the SMT away. 8-wide decode on Zen5 is only really feasible with SMT-enabled.
Intel has too much OoOE optimization to make SMT have a worthwhile massive uplift.
Ya know, I really don't quite understand this, but you are right (although I generally hear 10-15%). Intel was first to the desktop with SMT. Before P4, SMT was only used in high end server/workstation chip designs. Strange that they haven't managed to get better at it in all these years.Intel never got more than 10% for some reason. Their design must be garbage.
I'm really, really hoping AMD improves Zen 6 such that both sets of decoders can work on the same thread in SMT-off mode. Then for those who want maximum MT throughput, they can leave SMT on. For those who want a stronk core, they can turn SMT off.True that. It seems like they have Zen 5 balanced pretty well with respect to decode, execution, and retire paths. It wouldn't be able to feed all those execution units in parallel without the wide decode front end.
I'm really, really hoping AMD improves Zen 6 such that both sets of decoders can work on the same thread in SMT-off mode. Then for those who want maximum MT throughput, they can leave SMT on. For those who want a stronk core, they can turn SMT off.
If it’s a bug, I hope it can be addressed via some sort of microcode update.Given that Mike Clark himself was confused for a bit over whether both sets of decoders could be utilized without SMT, my guess is that it's just bugged in Zen 5.
No, hard errata is hard errata.If it’s a bug, I hope it can be addressed via some sort of microcode update.
The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.If it’s a bug, I hope it can be addressed via some sort of microcode update.
Well, there’s also the improved connection to the IOD, which hopefully reduces memory access latency. All in all, should lead to a core that’s better fed.The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
That or Zen5 hasn't reach all its performance targets, and Zen6's slated uptick in performance is rated against not-bugged Zen5.The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
Intel never got more than 10% for some reason. Their design must be garbage.
Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,You got it backwards. 2-way SMT scaling is exactly 100% when workload is 100% memory access bound. So obviously SMT scaling is at it's lowest when memory system is fast and efficient. Intel have pretty good SMT-scaling in server side where memory accesses are pretty slow.
Here we go again. I will quote Chester from Chips&Cheese:The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
In other words, they have more pressing issues to fix than clustered decode because uop cache is doing pretty well for now.My view is tunnel visioning on the decoders misses the elephant in the room. Backend memory access latency and frontend latency are holding back perf. You can find frontend bandwidth bound slots but there aren’t a lot of them. If the frontend was struggling to feed a 4-wide decoder due to BTB/iTLB/L1i miss latency, it’s not clear how much benefit you’d get from adding more decode slots that you also can’t feed. Also the uop cache covers most of the instruction stream
Zen 6 10+% IPC comes from the same slide stating Zen 5 10-15+%. Translated from marketing speak: Zen 6 10-14%, Zen 5 10-19%.That or Zen5 hasn't reach all its performance targets, and Zen6's slated uptick in performance is rated against not-bugged Zen5.
Larger decode width brings better decode bandwidth. But according to various profiling results that's not the issue for Zen 5. Zen 5 is frontend *latency*-bound. This means throwing wider decode or more bandwidth from the uOP cache won't change the overall latency.The problem is, we already know what to expect from Zen 6. If they are updating the clustered decode to work that way, a significant portion of the 10 or so percent gains will be entirely due to the decode change.
Has it been confirmed anywhere to be an errata though?No, hard errata is hard errata.
Like TSX in Zen3.
Which makes sense.Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,
wich say that SMT scale when a single thread related IPC is low and that there s lot of ressources left unused.
Nah, I don't think we have sufficient data to say Zen5 is universally faster in ST workloads or that it does universally better at MT.No matter how you slice it and analyze it, Zen 5 ends up with faster single thread performance than Arrow Lake (obviously SMT isn't helping it here, and could conceivably be hurting it), and higher MT performance than Arrow Lake (with SMT helping quite a lot).
Where? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.It has already been confirmed we can expect 16 core CCD's on N3P
Sometimes design changes are done to enable clock increaseunderstanding that the design changes will likely impact ST much more than any clock increase
It's doing better than Ampere that is using custom ARM cores.Turin D is even stomping ARM as I understand it
Even if it is, there will be no official confirmation. It'd send a bad message to customers (we're selling a "defective" product).Has it been confirmed anywhere to be an errata though?
At some point in the past there was a preso floating around that showed both a 16 and 32 core CCD for Zen6/Zen6cWhere? And was it made clear this will be consumer facing die? I mean we already have 16 core CCDs in Turin D.
I am wondering. I would have expected parallel compilation to benefit a lot from SMT.View attachment 109634
what the reason of this?
without smt it is as good as 2P
View attachment 109635
Cinebench R11.5 to CB R23 doesnt rely on memory access yet SMT scaling is 30%,
wich say that SMT scale when a single thread related IPC is low and that there s lot of ressources left unused.
What do you mean by "never"? Software that's well suited for SMT yields much more than that on Intel systems. I couldn't find new benchmarks, but an 8700K had several uplifts of 30% and more here. The SMT uplift on Ryzen/Epyc is higher, but Intel's was not just 10%. Or is this just about Xeons, and you can provide some data for this?Intel never got more than 10% for some reason