Well, Intel clearly saw adequate benefit to increase L2 cache for the SPR variant of golden cove. No question that it's additional layout effort, but maybe it's worth that effort?
You see the additional years it takes them to release the server parts? Well that's just with AVX512 and extra L2 "bolted" on top. You can clearly see from the Skylake-SP shot that it doesn't even change the layout of the 256KB portion - they just add the 1MB on that side.
Also their server division grew enough to be a substantial portion of revenue. Sure their laptop division is large, but desktop is much smaller. And you are talking about enthusiast K market and they'll need to design pretty much just for them. We've argued about whether they need a third core to separate client into two. Perhaps they will if they get their chips so good that they can get back into mobile again.
We have gone from 1 decode with the 486 and earlier, 2 decode "superscaler" with the Pentium, 3 decode with Core, 4 wide with Haswell, 5 with Skylake, and now 6 with Golden Cove.
Intel chips have been 3-wide since the Pentium Pro/II. The first 4-wide Intel chip was the Core 2. Haswell extends some things but didn't change anything big, hence the relatively small improvement.
Skylake claimed 5-issue but I think that's with fusion. Golden Cove slide says they went from 4 to 6, and Agner Fog says despite what the Intel manual says he couldn't get above 4.
1. Am I correct in writing the that original reason for HT or SMT is to utilize all of the unused resources in the CPU? With GC able to under the best case situation execute 6 instructions at a time, unless the code has a lot of parallelism there will be plenty of resources left to run more than one thread.
2. ....would you expect the additional performance due to HT to increase with the width of the CPU?
Yes, but some of the extra gains will be mitigated because of other parts that improve ILP such as improved branch prediction and larger OoOE resources.
3. Do we know on average how many instructions per cycle a CPU is executing? Obviously the bounds for Golden Cove are 1 and 6, but when running Cinebench R23 ST on average how many instructions do you think are being executed per cycle? 3? 4? 4.5?
You can go lot under 1. Transactional benchmarks benefit a lot from SMT for the same reason.
4. Does Amhdal's law apply to the width of a CPU in the same way it applies to multicore CPU's? Are we reaching a point of very small payback as we increase the width beyond 6 decode?
It's different. The wider issue works out because there are multiple instruction streams. Out of order is what allowed superscalar to be effective, since it speculatively allows second stream to go before the first one is done. So if that's easy to do, then they will scale pretty much infinite. But there are code that are fundamentally limited. Rather than having an Amdahl's limit you will simply run into more and more scenarios where it won't scale cause you won't be able to break down the code to take advantage of increased width. But as long as the other parts get wider, smarter, and better performance will increase.
Code sizes also continue to grow as well.
5. Gracemont is as wide as Golden Cove on the front end and wider on the back end as compared to Golden Cove yet has much lower throughput. Is this primarily because the Out-of-Order intelligence isn't as sophisticated as GC? If not then why?
Deeper BTB buffers, faster execution units, uop cache in addition to the traditional pipeline(although the lower number of stages on Gracemont makes up somewhat), better Load/Store capabilities, larger buffers both for OoO and execution units.
So lot of them are details that are not/can't be shown in powerpoint. Gracemont may have more dedicated ports but they are simpler and dedicated to the task. When it comes to instruction latency Golden Cove likely has lower latency and higher throughput. Like how Pentium 4 had double clocked ALUs but for simple instructions.