I found a really great talk about the Lion Cove core changes and had a go at transcibing it, well kind of. Some transcribe, some notes. Couple spots with (?) I could follow. Let me know if you can figure those parts out. Good info here that I didn't see published.
Hyperthreading
HT can add 30% MT performance for only a 20% increase in power and a 10% increase in die area.
Compared to a core that includes HT the same core without HT can use 15% less power for the same ST performance and be 10% smaller in area.
20 years ago, when software threads often outnumbered core count and prior to the advent of E cores or processors with high core count, HT was a good way to increase MT performance. But now with E cores and P cores, each optimized by area and power to excel at ST and MT performance, respectively, HT isn’t the best way to increase overall performance on the desktop. In addition, in most desktop applications there are sufficient P and E cores to handle all of the threads created by the software.
HT is still useful in server situations where the software thread count is extremely high and all available compute cores can be utilized. But for Lunar Lake the idea was to remove any transistors that don’t increase the goodness of the CPU and MT is better handled by E cores than HT.
Smarter Thermal Management and 16.67MHz Bins
While die shrinks have allowed for more transistors to be placed in smaller and smaller areas, along with this advancement in comes increase in thermal density that must be dealt with. Historically thermal issues in the core were either dealt with by frequency throttling or reducing frequency. Since these “guard band” settings were static and were determined in the lab pre-launch they had to be set conservatively to protect the core under the highest environmental conditions and heaviest compute loads. This invariably leaves some performance on the table in most “normal” operating conditions.
Intel is now using a novel approach to thermal management by using a network based self-tuning controller that adapts to the real time operations conditions of the actual workload being run and takes into account the environmental conditions and thermal solution being used. The controller is thereby able to maximize all of the thermal headroom with tighter frequency control to maximize performance. Efficiency is also increased as the frequency bins have been reduced from 100MHz step to 16.67MHz, again allowing for maximum performance and efficiency from the core.
For example, if conditions permit operation at 3.05GHz, but not 3.1GHz the old system would have to hold frequency at 3GHz, the new system will be able to operate at close to 3.05GHz. The overall performance benefit is approximately 2%.
Front End Improvements
The front end of a CPU is responsible for fetching x86 instructions and decoding them into micro-operations. In order to adequately supply the core with instructions the Out-Of-Order part of the front end must be able to accurately determine the correct code blocks from which to generate instructions. Lion Cover fundamentally changes the branch prediction scheme to significantly widen the prediction block up to 8x the previous generation without sacrificing performance accuracy. This has two important benefits. First, it allows the BPU (Branch Prediction Unit) to run ahead and prefetch code lines into the instruction cache alleviating possible instruction cache misses. In this context cache request bandwidth towards the L2 was increased in Lion Cove 3x to capitalize on the BPU running ahead.
Second, wider prediction blocks allow the increase in instruction fetch bandwidth in instruction fetch bandwidth and indeed the instruction fetch bandwidth was doubled from 64 bytes per cycle to 128 bytes per cycle and the decode bandwidth was increased from 6 to 8 instructions per cycle. These instructions are steered towards the micro-op queue and are also built into the micro-of cache since code lines are often reused the micro-op cache allows for efficient low latency and high bandwidth supply of previously decoded micro-ops toward the OoO engine without having to power up the fetch and decode pipeline.
The Lion Cove micro-op cache grew from 4,000 micro-ops in Redwood Cove to 5.250 in Lion Cove and the read bandwidth was increased to supply 12 micro-ops per cycle verses 8 for Redwood Cove. Finally, the micro-op cache grew from 144 to 192 entries facilitating the service of longer or larger code loops in a power efficient manner. The OoO engine is responsible for scheduling micro instructions for execution in a manner which maximized parallelism thus increasing IPC.
Prior generation P cores employed a monolithic scheduling scheme where a single scheduler was tasked with determining the data readiness of all micro-op types and scheduling and dispatching them across all execution ports. This scheme was exceedingly hard to scale and incurred significant hardware overhead.
Lion Cove solves this by splitting the OoO engine into two domains. Integer, which also holds address generation units for memory operations, and vector. These two domains now have independent renaming structures catering to optimize the bandwidth and independent schedulers catering to optimized portability. This allows future expansion of each of these domains independently of each other and provides opportunities on workloads are on only one of these domains.
Lion Cove increases the allocation rename bandwidth from 6 to 8 micro-ops per cycle and the OoO depth or instruction window was increased from 512 to 576 micro-ops. In addition, the physical register files were enlarged appropriately versus prior generation. Lion Cove retires 12 micro-ops per cycle versus eight previously.
Execution
Lion Cove increases the total number of execution ports from 12 to 18. On the integer side, 6 ALU’s are complemented by 3 shift units and 3 64-bit multipliers operating at 3 cycles latency and 1 cycle throughput. 3 branches can be resolved in parallel per cycle. Onthe vector side Lion Cove has four 256 bit ALU’s plus two 256 bit FMA’s operating at 4 cycle latency and two 56-bit floating point dividers with significantly improved latency and throughput for both single and double precision operations vs prior generation.
Crypto acceleration hardware for AES, Shaw and SM3 and 4 resides in the vector stack.
Memory Subsystem
The memory subsystem is a key part of a performant microarchitecture. At the heart of that core memory subsystem are the data caches. Caches are all about striking the perfect balance between bandwidth latency and capacity given a certain area and power budget. Lion Cove significantly re architects the core’s memory subsystem to allow for sustainable high bandwidth with low average latency while still keeping built-in scalability and flexibility to increase cache capacity.
The first data level cache was completely redesigned to allow full operation at 4 cycles latency vs 5 cycles in the previous generation. Lion Cove also introduces a new three level cache hierarchy by inserting an intermediate 192KB cache between the 1st and 2nd level caches.
This has two key benefits. First and foremost, it decreases the average load to use latency by the core, which increase IPC. Second, it allows us to grow the L2 cache capacity to keep a larger portion of the data set closer inside the core without paying the IPC penalty of the added L2 cache latency and indeed the L2 online curve grows to 2.5MB on Lunar Lake and 3 MB on Arrow Lake along with several other L2 controller optimization as well as an increase in L1 fill buffers to 24 and L2 mis-queues(?) to 80, Lion Cove shows a significant improvement in its capacity to consume external bandwidth and this as you know is key to running performant AI workloads.
In other memory subsystem enhancements the first level DTLB (Data Translation Lookaside Buffer) was increased to support coverage for 128 pages vs 96 pages previously and in order to improve load execution in the shadow of older stores Lion Cove adds a 3rd store address generation unit. It employs a new fine grain memory disambiguation algorithm to safely conflicts (audio dropout) and enhances the stored load forwarding scheme to allow a young load to collect and stitch data from any number of older pending resolved stores as well as from the data cache.
Lion Cove drives a significant double-digit IPC improvement over a wide spectrum of fork loads (?). Having optimized for lower TDP’s (Total Design Power) on Lunar Lake, Lion Cove delivers more than 18% PNP (Performance at Power) at that low TDP.