4x L1D a core ( or 2x vs excavator)
much reduced latency L2 ( also smaller)
wider execution resources
lower latency FPU
much better L3
I don't disagree with you that there are said to be some major changes...
But the question is always about by how much?
Can we QUANTIFY this in numbers?
The design philosophy is one thing and the process another, and both separately controlled. Both have to be insync to achieve processor design goals and a successful product at launch. Then there is the timing, which has to be right.
Do I think the Zen design is in the right direction? Yup. The process? It doesn't seem so. The timing? Nope.
L1/L2 latency hurt BD big time with poor branch misprediction rates on a deep pipeline, and then, no uop cache or efficient BTBs to mask this. Small L1D and low cache associativity exacerbated these problems even more. Anything FP/SIMD and there were serious bottlenecks, especially when it required memory ops.
Wider execution is only good where your dispatch/issue, prefetch and fetch, schedulers and all the pointers, trackers, buffers in between are smart, big and fast with excellent tuned schemes or algorithms (including trace cache/fetch buffers) so the EUs are not starved and stalls or structural hazards don't cause huge penalties to mitigate circumstances like when a single stall downstream will cause full i-buffers upstreams with high execution latencies.
Small bottlenecks like i-cache line fetch limited to selecting only one particular line a cycle are the ones that tend to cause major performance penalties when your branch prediction isn't particularly efficient.
then you have your general improvement in prediction/prefect/load store.
then there is also then knowns , stack cache, trace cache, checkpointing etc.
Which can add as little as 2-5% depending on what exactly is done and to what extent, or as much as 30-40%. Major unknowns but we'll possibly have these details after HotChips 28 Aug 21-23 (like BD formerly).
It is even more difficult to estimate the gain because this Zen design is not supposed to be a K7 based Thuban improvement, so we know where to compare and estimate, nor from BD, but something completely different. Each of them had their own design philosophy and process based bottlenecks. Less than spectacular process, far behind parity and die size magnified each of their struggles. Who knows where the Zen design will bottleneck?
Just remember that BD also gained much theoretical improvements on paper in most departments (for instance, the huge BTBs, IBBs, the ROBs/PRFs and schedulers, the branch fusion, the overhauled branch predictors, 4vs3 decode, the repairing of return stack, the 1GB page and much bigger L1/2 ITLB, loop detection, the predecode/pick buffers, RIP queues, etc).
Sent from HTC 10