ryanjagtap
Member
- Sep 25, 2021
- 134
- 158
- 96
I think that both the Zen 5 and Zen 5c cores in STX are designed to have double pumped AVX-512 (2x256). We might see a bigger delta in performance between Granite Ridge and Strix Point.
What is the significance of the change? Wouldn't those inst need to be turned into macro-ops anyway?It looks like the op cache contents changed from macro-op to inst/fused inst.
They can get more use out of the same capacity. The raw capacity has been lowered from 6.75K to 6.0K, but given this, it may actually not be a regression.What is the significance of the change? Wouldn't those inst need to be turned into macro-ops anyway?
Yes, my understanding is that macro-op are very large.They can get more use out of the same capacity. The raw capacity has been lowered from 6.75K to 6.0K, but given this, it may actually not be a regression.
So,They can get more use out of the same capacity. The raw capacity has been lowered from 6.75K to 6.0K, but given this, it may actually not be a regression.
To be fair it does sound like a lot of design investment for that gain.16% gains. That's why people are disappointed.
It all seems fine to me with the caveat that N4P isn't a major improvement over N5 and the die size remained the same.So,
-4 to 8-wide decode/fetch and 33% increased dispatch
-Zero bubble branch + larger L1 BTB(11x!)
-Wider OpCache associativity and 33% more bandwidth
-Much larger ROB, PRF, and scheduler entries
-33% increased Load and 2x increased Store
-32KB 8-way to 48KB 12-way L1 cache
-Doubled L2 cache associativity and bandwidth
16% gains. That's why people are disappointed.
That seems to be the first iteration of a new design (not clean sheet, but still lots of changes). The next iteration(s) will pick the low hanging fruits from that point, along with, hopefully, improved processes.So,
-4 to 8-wide decode/fetch and 33% increased dispatch
-Zero bubble branch + larger L1 BTB(11x!)
-Wider OpCache associativity and 33% more bandwidth
-Much larger ROB, PRF, and scheduler entries
-33% increased Load and 2x increased Store
-32KB 8-way to 48KB 12-way L1 cache
-Doubled L2 cache associativity and bandwidth
16% gains. That's why people are disappointed.
So,
-4 to 8-wide decode/fetch and 33% increased dispatch
-Zero bubble branch + larger L1 BTB(11x!)
-Wider OpCache associativity and 33% more bandwidth
-Much larger ROB, PRF, and scheduler entries
-33% increased Load and 2x increased Store
-32KB 8-way to 48KB 12-way L1 cache
-Doubled L2 cache associativity and bandwidth
16% gains. That's why people are disappointed.
See this is where the few % matters. At 19%, there would be lot less complaints, and the 16% includes the Geekbench SHA result, which is boosted by the AVX-512 enhancements.That's how it works. Golden Cove had massive changes that bought <20% too.
Big-ticket items do not necessarily translate to huge gains on their own.
See this is where the few % matters. At 19%, there would be lot less complaints, and the 16% includes the Geekbench SHA result, which is boosted by the AVX-512 enhancements.
The disappointment of course goes to Intel with 14% as well.
Yea, I was skeptical of large predictions in the beginning. It is a good advancement in itself.If there's one thing I have learned in the semi industry, it's that there is no such thing as a free lunch.
AMD did a 15%-ish perf bump with no major shrink and seemingly no major increase in core power. That's a good thing.
@Saylick's and your interpretation of AMD's "uplift breakdown" pie chart is over-simplified. It's not as if independent gains from here and there simply add up. Rather, the various changes of different µarch components are interacting. And the end effect on performance depends on the particular workload.Reminder: http://www.portvapes.co.uk/?id=Latest-exam-1Z0-876-Dumps&exid=threads/zen-5-architecture-technical-discussion.2619688/post-41253043
2.05%
-Zero bubble branch + larger L1 BTB(11x!), doubled Fetch
4.29%
-4 to 8-wide decode
-Wider OpCache associativity and 33% more bandwidth
5.38%
-Much larger ROB(448 entry), PRF, and scheduler entries
-More ALUs
-33% increased dispatch, rename, retire
4.29%
-33% increased Load and 2x increased Store
-32KB 8-way to 48KB 12-way L1 cache
-Doubled L2 cache associativity and bandwidth
I wonder if it's split between the two decoders. But even.if that was the case 8k entries at L1 is quite large.The (relatively) huge L1 BTB is very interesting. Hopefully they talk about this at Hotchips or something.
Mike Clark said in C&C interview that the Z5 core performance will only get better once software is improved to the point where it can utilize all the expanded features properly. So what we have today is baseline legacy IPC of this core in current software. I expect that once some time passes (probably after Zen 6 hits the shelves), that we will have much better optimized software that will better utilize what on paper is much better architecture.Zen 5 is weird.
It is a core that was being worked on for over 6 years. The original Zen was done in about 5 years on a shoestring budget. The gains are not impressive given the 20+ months cadence.
Sure, its "foundation" role of a >4-wide machine is evident from cases such as:
* went back by not implementing the nop fusion due increased complexity (but it *might* come back in the future, yeah)
* wording used for the unified int scheduler as "symmetry and simplifying pick"
But still...
I'm wondering whether the growing lineup doesn't contribute to the slowdown. AMD now have 4nm/3nm CCDs, narrow/wide FPU, large/small cache, etc. The same goes for the SoCs - 12ch server, 6ch server, 2ch desktop, APU, chiplet APU, MI300/MI400 APUs, etc.
Mike Clark said in C&C interview that the Z5 core performance will only get better once software is improved to the point where it can utilize all the expanded features properly. So what we have today is baseline legacy IPC of this core in current software. I expect that once some time passes (probably after Zen 6 hits the shelves), that we will have much better optimized software that will better utilize what on paper is much better architecture.
Well one thing is (almost) for sure: the baseline performance cannot get worse with timeI think such claims are best viewed with a healthy skepticism. The history of claims of "just you wait! This will be so much faster when software is optimized!" is pretty grim.
Intel enters the chatWell one thing is (almost) for sure: the baseline performance cannot get worse with time
I keep forgetting 13/14gen issues, my bad.Intel enters the chat
And AMD TLB issue.I keep forgetting 13/14gen issues, my bad.