IMO # transistors per chiplet could have increased, bu I doubt "real" density has changed much. What I mean by real density is transistor density in the L3$ or used core areas. I suspect it's the same for various reasons, including that they'd be constrained because of faster clocks and heat density. If they increased density while also increasing clocks, without a process shrink, that would require serious thermal considerations and I'm not sure they've changed a whole lot on that front.
I think that decent transistor budget would have been spent on improving security. Zen3 supports shadow stack, SEV-SNP, and a big bunch of hardening measures which you can find in the latest manuals which were updated since mid of Oct.
disappointed that Zen3 isnt really actually any wider in the front end or execution stages, or even load/store pipeline,
Amazed they pulled off ~20% IPC , higher clocks probably around the same transistor count on the same process at the same TDP.
I am more impressed by their more pragmatic approach in finding a balance between manufacturability and performance.
Papermaster specifically mentioned about this, that they could have gotten performance by throwing more power and transistors at the problem, but that was not how they want to tackle the problem in this generation.
If N5P improves within the next quarters, even a conservative ~1.5x gain in density would give them a lot of transistor budget to play with.
I wont be surprised if they did not make Zen3 any wider or add more execution units simply because they knew the bottleneck is there somewhere, like DDR4 memory subsystem or something.
DDR5 or even X3D might be some of the things to tackle that problem.
Zen4, imo, would be an even more radical uplift compare to Zen3.
This statement "The whole thing is supposed the brainiac vs speed demon approach" reveals his bias. There is no brainiac vs speed demon approach. ARM and Apple don't reach 4+ GHz and therefore they must design their pipelines to maximize throughput at slower speeds. Intel and AMD CAN reach those speeds and have found that the combination of higher speed, relatively lower SPEC per GHz, and the scalability to many cores works for them.
Having designed some buggy logic blocks professionally, increasing clock speeds is far from being the dumb speed demon approach. In many cases, you need more silicon to design high speed blocks, and so much of effort you have to spend on so many things like handling delay propagation, race conditions, locks, etc that is is just easier to duplicate logic blocks to gain performance and continue to work in a different clock domain with lower frequencies
I am either just not great at it or something, but thankfully I got promoted to work with "complete system design" now and dont deal with such low level stuff since a long time.
Andrei keeps using this "oh but Apple A13 is +64% vs +67%? that's what we're fighting over"? No, A13 loses to the 5600X by 20% in SPECint2006. Once Apple can design a chip that beats the 5600X then we can talk. Apple probably CAN design a nice fast chip. But they haven't. Defending the A13 is a loser's battle when we're talking about desktop CPUs.
I think the article is not intended to cater to professional audience. Even the Graviton article was quite cringy, if you are a professional.
Ths article is simply a repeat of AMD's slides with little new information that you can't surmise already.
Regarding SPEC, I dont know how relevant SPECint2006 is, honestly. From compute cluster/server perspective, I know the guys in my department don't use it. Some of the SW that Phoronix is running on his PTS suite is also what we run in our evaluation labs. Any of you guys use it professionally?
What about fp performance?
For the Graviton test it should be meaningful at least, not only just for the usual rendering and compute/scientific tasks
Some of the updates made to our distro by our software vendor, added AVX2 to BPF and netfilter, the performance gain for packet traversing in netfilter chains is mindblowing. If you have have proxy in front of your worker nodes, the difference in performance it makes is night and day, like 10x-15x lesser time spent in kernel space before packets get delivered/routed to end process.