Of course Graviton2 has lower IPC than Zen2. It's based on weak A76 from 2018. But it has core area of 1.4mm2 what allows to put twice as much cores than Zen2 (3.6mm2). Wait for 128-core Graviton3 based on A78 (30% more IPC while less -5% transistors). And prey for that they will not use Cortex X1 cores (60% higher IPC (40% more than Zen2) at 2.1mm2 area). How about that. Still looking x86 strong?
This claim did
@LightningZ71 . He is silent because he cannot prove his crazy claim. When you work with the stuff I'd like to know numbers from your company. Just give us number of machines and how many of them is running SMT OFF.
We could have 8xALU, 4xAGU, 4xFPU, SMT4 CPU core in 2003, what a shame. If Zen3 isn't Keller's EV8 resurrection then AMD is in deep deep trouble. I hope Zen3 is at least 6xALU, 3xAGU, 4xFPU, SMT4.
AMD needs to bring more tech features and go forward. However we know they can go also backward like with Bulldozer. So who knows :/
This stuff with the ALU count is ridiculous. If the number of ALU units was a bottleneck, then they would have been increased quite a while ago. Integer ALU units are tiny; the scheduling hardware to keep them busy may not be, but the engineers working on these chips are going to have register transfer level simulators to explore such design choices and determine what the bottlenecks are. Some enthusiast saying “Well, there‘s your problem, this one has 6 and yours only has 3” isn’t going to change the reality that it isn’t the bottleneck.
Also, comparing ALU counts across ISAs is not valid. It may not even be valid across different architectures since some of them may split or combine instructions in different ways. ARM is more RISC-like, so we would expect it to need to execute more simpler instructions to accomplish the same compute. RISC and CISC are mostly obsolete at this point though. None of the current architectures are very RISC-like anymore; ARM has huge numbers of very specialized instructions. The original ideas behind RISC was to use a much reduced instruction set to enable higher clocks and things like better pipelining and out of order execution. This was probably why Alpha processors were at 500 MHz while the Pentium pro was 200. About the only thing that has survived is to use more fixed width instructions and other things to simplify decoding vs. more CISC-like architectures that still have very complex decode.
The current bottlenecks probably make some more complex instructions better since things are so memory bound. Complex instructions can act as a kind of instruction compression such that they take less cache space and such. AMD64 still suffers from higher instruction decode overhead, which is what hey need somthing like the trace cache to save decoded instructions. That kind of takes the place of regular instruction cache to some extent. Also, AMD64 has probably changed a lot also since instructions that did not perform well would tend to get deprecated. They may still be available via microcode for compatibility, but modern compilers generally wouldn’t use them. IMO, the distinction between RISC and CISC just doesn’t really exist anymore.
As for why Apple processors perform so well, I don’t think it has anything to do with the ALU count. Current processors are incredibly dominated by cache performance. In fact, given the die area devoted to cache, you are almost buying more of a memory chip than a processing chip. I have profiled applications that were essentially compute bound and they often still only achieved an IPC near 1. The execution core can execute ridiculous numbers of instructions at 3 to 4 GHz with out of order, superscalar, speculative execution, so it almost always comes down to getting the data to the core. Apple still has a relatively low core count (was it 2 high performance cores and 4 low performance cores?) and a very large shared L2 cache, not L3. Some applications do very well with large, low latency L2 caches. It probably works exceptionally well for the small memory footprint applications that normally run on iPhones and iPads. This is also probably why some of the older core2quad processors still perform very well. I remember some of the old core2quad processors being listed as good enough for compute intensive VR games early on. People were surprised, but some models had 4 to 6 MB L2 caches; those were very expensive extreme edition parts at the time. Having a good cache design (including things like prefetch) is the most important part of modern cpu design. It is also a big source of the improvments in Zen vs. previous architecture.
The larger core count and small L2 + large L3 is probably necessary for good performance across a wide spectrum of applications, from mobile to server. It isn’t going to be the best at both though. It will be interesting to see what Apple does with making laptop and desktop chips. They don’t need to support server applications though, since they don’t make servers. They may end up with something more like Zen with core clusters for something like the Mac Pro though.
As for SMT, it probably isn’t going away unless they decide to include a bunch of tiny, stripped down, low power cores instead. A lot of server applications have no use for all of the FP units taking up huge amounts of die space. Such applications will often run just as well on a tiny low core since they are generally not very cacheable either. They just spend most of their time waiting on memory. Such applications are throughput oriented and we have had architectures specifically designed to run such things with a lot of low power cores or a lot of hardware threads. i don’t see why AMD would remove SMT since it can be shut off if you don‘t want it. Also, there is reason to support higher thread counts eventually For such throughput application.
For the FP improvements, I am thinking that they will add at least one more 256-bit FMA unit. I don’t know quite how the current FP units are architected. I have seen some diagrams that show 2 FMA and 2 FADD with one of the FADD units sharing its input ports with the 2 FMA units. Can it actually do 2 FMA and one FADD per clock? If anyone has a link to more detailed info, it would be appreciated. The FMA units only need 2 operands when doing a multiply but they need 3 operands for an FMA op. Doubling up the FMA units wouldn’t really fit with the 50% number. Going up to 3 units would. I am not sure how they would arrange the ports, but they probably would not need to increase them that significantly. I am kind of hoping that they support AVX512 instructions across 2 clocks as they did with 256-bit instructions on 128-bit units. Some of the AVX512 instructions may be needed to compete with intel independently of the width of the vector. I don’t think there was really that much need to increase the vector width. There shouldn’t be much difference between 2 x 256 vs. 1x512. Three units is more flexibe.