Discussion General CPU µArch Research Thread

soresu · Feb 5, 2025

Hi all, I'm starting some new threads to post interesting research on µArchs, this one is specific to CPUs.

soresu · Feb 5, 2025

This guys entire µArch publication resume is pretty interesting all things considered:

Microarchitecture | Ipoom Jeong

A highly-customizable Hugo academic resume theme powered by Wowchemy website builder.

ipoom-jeong.com

soresu · Feb 5, 2025

This sounds promising, claimed 50% energy savings over traditional OoO instruction queue designs with an 8% performance increase:

FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors

https://ieeexplore.ieee.org/document/8715034

gai · Feb 9, 2025

soresu said:
This sounds promising, claimed 50% energy savings over traditional OoO instruction queue designs with an 8% performance increase:

https://ieeexplore.ieee.org/document/8715034

The baseline design is somehow spending a rather large 45% of its total dynamic energy on the issue queue in the first place. One is required to infer why claimed energy savings are even more than this, because the authors do not provide any energy figures other than for the issue queue.

You don't gain performance by making your scheduler less intelligent. The best you can hope for is to stay within some noise margin of the same performance level. The 8% performance increase is evidently because the scheduler size has increased from one 56-entry, 4-wide issue queue to the combination of one 56-entry, 1-wide issue queue and three 32-entry, 3-wide FIFO queues. That's a total of 152 entries, or about 2.7x as large of a scheduling window as the baseline design. The authors do not call any attention to this discrepancy, because it would completely invalidate their conclusions, but you can see it in Table 1.

The basic concept is perfectly reasonable on its face. The CAMs are the most energy-intensive part of an instruction scheduler. So, start with a distributed issue queue design, then find a way to reduce the number of CAMs without losing too much performance. As is often the case with academic work, the results in the paper provide no guidance about how the chosen categories--or the concept generally--behave in a practical design.

soresu · Mar 4, 2025

This one is not strictly about CPU µArchs at the hw level, but offers interesting promise for kickstarting WoA:

Biotite: A High-Performance Static Binary Translator using Source-Level Information

https://dl.acm.org/doi/10.1145/3708493.3712693

soresu · Mar 5, 2025

Another software/hardware crossover paper:

Transcending Hardware Limits with Software Out-of-Order Processing

https://ieeexplore.ieee.org/document/7862125

soresu · Mar 5, 2025

ISA optimised for OoO perf without the associated area problem?

STRAIGHT: Hazardless Processor Architecture Without Register Renaming

https://ieeexplore.ieee.org/document/8574536

Abstract:
The single-thread performance of a processor improves the capability of the entire system by reducing the critical path latency of programs. Typically, conventional superscalar processors improve this performance by introducing out-of-order (OoO) execution with register renaming. However, it is also known to increase the complexity and affect the power efficiency. This paper realizes a novel computer architecture called "STRAIGHT" to resolve this dilemma. The key feature is a unique instruction format in which the source operand is given based on the distance from the producer instruction. By leveraging this format, register renaming is completely removed from the pipeline. This paper presents the practical Instruction Set Architecture (ISA) design, the novel efficient OoO microarchitecture, and the compilation algorithm for the STRAIGHT machine code. Because the ISA has sequential execution semantics, as in general CPUs, and is provided with a compiler, programming for the architecture is as easy as that of conventional CPUs. A compiler, an assembler, a linker, and a cycle-accurate simulator are developed to measure the performance. Moreover, an RTL description of STRAIGHT is developed to estimate the power reduction. The evaluation using standard benchmarks shows that the performance of STRAIGHT is 18.8% better than the conventional superscalar processor of the same issue-width and instruction window size. This improvement is achieved by STRAIGHT's rapid miss-recovery. Compilation technology for resolving the possible overhead of the ISA is also revealed. The RTL power analysis shows that the architecture reduces the power consumption by removing the power for renaming. The revealed performance and efficiencies support that STRAIGHT is a novel viable alternative for designing general purpose OoO processors.

Plus a related paper covering general purpose code generation for the above ISA:

A Sound and Complete Algorithm for Code Generation in Distance-Based ISA

A Sound and Complete Algorithm for Code Generation in Distance-Based ISA | Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction

dl.acm.org

soresu · Mar 5, 2025

Another register rename free ISA for OoO execution:

Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors

https://dl.acm.org/doi/10.1145/3613424.3614272

Abstract
Out-of-order superscalar processors are currently the only architecture that speeds up irregular programs, but they suffer from poor power efficiency. To tackle this issue, we focused on how to specify register operands. Specifying operands by register names, as conventional RISC does, requires register renaming, resulting in poor power efficiency and preventing an increase in the front-end width. In contrast, a recently proposed architecture called STRAIGHT specifies operands by inter-instruction distance, thereby eliminating register renaming. However, STRAIGHT has strong constraints on instruction placement, which generally results in a large increase in the number of instructions.
We propose Clockhands, a novel instruction set architecture that has multiple register groups and specifies a value as “the value written in this register group k times before.” Clockhands does not require register renaming as in STRAIGHT. In contrast, Clockhands has much looser constraints on instruction placement than STRAIGHT, allowing programs to be written with almost the same number of instructions as Conventional RISC. We implemented a cycle-accurate simulator, FPGA implementation, and first-step compiler for Clockhands and evaluated benchmarks including SPEC CPU. On a machine with an eight-fetch width, the evaluation results showed that Clockhands consumes 7.4% less energy than RISC while having performance comparable to RISC. This energy reduction increases significantly to 24.4% when simulating a futuristic up-scaled processor with a 16-fetch width, which shows that Clockhands enables a wider front-end.

igor_kavinski · Mar 5, 2025

soresu said:
Another register rename free ISA for OoO execution

Can you give us a quick summary of their central idea? How are they bypassing register renaming?

soresu · Mar 5, 2025

igor_kavinski said:
Can you give us a quick summary of their central idea? How are they bypassing register renaming?

I'm not remotely qualified to answer that kind of question sadly 😅

I can edit the post to add the abstract (summary) of the paper if you like.

DavidC1 · Mar 5, 2025

Novel approaches typically come with the side effect of more largely varying performance across applications over established ones.

Like the STRAIGHT concept that wants to replace OoOE. Likely the final result is going to be similar to Sandy Bridge's implementation of the uop cache, where it's a complement to the decoder over replacing it entirely as did with the Trace Cache. So both systems will have to exist in some form.

gai · Mar 6, 2025

igor_kavinski said:
Can you give us a quick summary of their central idea? How are they bypassing register renaming?

The 2023 paper (Clockhands) is an iterative improvement on the 2018 paper (STRAIGHT). In both cases, the ISA is modified to replace architectural register identifiers with distance-based identifiers. This means that the source operand directly refers to a source instruction, rather than relying on a register name. The goal of both architectures is to simplify the hardware cost of register renaming itself, while leaving the remainder of the out-of-order machinery untouched. Register renaming is hard to scale up to arbitrarily large widths: 4 to 8 is not so hard, but 16+ renames per cycle would be very challenging at high frequency.

More details follow.

In the 2018 paper, source operands are encoded with 10-bit distance values, and implementations may choose a smaller maximum window. Logically, each newest instruction can look backwards at an up to 1024-entry circular FIFO of producer ops. Program code would have to be compiled for a hypothetical target processor. Their modified ABI basically requires producer instructions for function arguments to directly precede the call instruction in argument order and for return values from a function to also directly precede the return instruction.

The compiler inserts some additional instructions (moves or NOPs) for inconvenient cases, like: (1) constants reused in loops, (2) variables that survive for a long time without stack spill and fill, and (3) different numbers of instructions on convergence paths following branches. The amount of instruction overhead depends on the quality of the compiler, but seems to be in the >30% range even with a good compiler. Not great.

In the 2023 paper, distances between producer and consumer are broken up into 4 groups. These groups are statically allocated by the compiler for: (1) the stack pointer, function arguments, and a zero register; (2) local short-lived variables; (3) local long-lived variables; and (4) local constants. Instructions use 2 bits to decide which group's FIFO to write, 2 bits per source operand to determine which FIFO to read, and 4 bits per source operand for distance. So, compared to STRAIGHT, we've dropped the maximum addressable space from 1024 architectural registers to 4x16=64 architectural registers and the number of ISA encoding bits for 2 source / 1 dest drops from 20 to 14. The number of overhead instructions for moves is reduced to less than 10% of the 2018 paper, so we should feel comfortable saying that the instruction count overhead decreases to around 5% or less, though it's still not free.

So what do we get for the trouble? Well, we get a circuit that looks similar to a traditional register renamer, since it performs a very similar function. The asymptotic scaling is better in terms of rename width, but it's likely that many more physical registers will be wasted, so the problem is really getting pushed to another part of the processor. And we still have to pay those extra few percent of move instructions. The renamer is probably under 10% of the power in an OoO core, so the best-case benefit is rather small.

	OoO register renamer	Clockhands register renamer
Source operands	Lookup to rename map table (~16 or ~32 entries) with override paths for intra-group dependencies. Override paths are difficult to scale wider due to priority selection structure.	Lookup to register pointer (4 entries) and perform a small subtraction in roughly constant time.
Destination operands	Lookup to free list in roughly constant time.	Lookup to register pointer (4 entries) and perform a small addition that slowly (logarithmically) grows with rename group size. This scaling is cheaper than the source operand lookup scaling in normal OoO, because addition is commutative (no priority).
Physical register ID	No particular efficiency restrictions on which physical registers may be provided by the free list.	Each of the 4 groups has a static partition of 1/4th of the total physical register count. More likely to stall at equal number of physical registers.
Recovery checkpoint cost	Higher	Lower

Cardyak · Mar 6, 2025

Nice thread.

For a long time I've firmly believed the most promising improvement that can be made to CPU performance is Value Prediction - Here's a paper from 1996, it's clearly old by this point but it still contains some very interesting studies: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b7c9efd81382765b3cc82ec756e86ccbb5f39d1e

The paper shows accuracy across most data instructions of around 60%, the most interesting idea is that you can potentially constrain the Value Predictor and only predict a smaller subset of instructions that are more consistent. As opposed to being aggressive and trying to predict every data instruction and experiencing a lower accuracy with a smaller speed improvement. (Essentially, you'll probably get a bigger IPC increase predicting the easiest most consistent instructions, let's say 25% of the total instruction steam with 99% accuracy, as opposed to being aggressive and predicting 100% of the instruction stream, but only reaching accuracies of ~60%). You're better off having smaller coverage but very high accuracy, as opposed to having large coverage but sacrificing accuracy to get there. The problem with this is you need to have a process to detect these "easy to predict" instructions in the first place. Other papers point to ideas such as using a "Confidence Predictor" to track the most consistent instructions, and cache them as future Value Prediction candidates.

There's rumours that Apple has already started to dabble in this in a very minimal fashion (There are theories of their later cores featuring a load value predictor that is utilized sparingly) But unfortunately there's no concrete proof of this.

Other than Value Prediction the other big IPC gain left on the table is probably Out of Order Retirement. I'll see if I can do some digging on this and unearth some papers on the subject. But as you can imagine it requires an extensive validation process to ensure program integrity is kept intact.

Nothingness · Mar 6, 2025

I agree value pred is a very interesting thing to explore at this point.

I also agree that accuracy is paramount, even more than for data prefetchers. The cost of being wrong would result in a lots of replays or checkpoint restores, which would be terrible for performance and efficiency.

I'm far from knowing a lot about VP but Arthur Perais has been doing some work on that subject for the last 10 years.

soresu · Mar 6, 2025

Nothingness said:
I agree value pred is a very interesting thing to explore at this point.

I also agree that accuracy is paramount, even more than for data prefetchers. The cost of being wrong would result in a lots of replays or checkpoint restores, which would be terrible for performance and efficiency.

I'm far from knowing a lot about VP but Arthur Perais has been doing some work on that subject for the last 10 years.

Look what I found from that author....

Toward Practical 128-Bit General Purpose Microarchitectures

https://ieeexplore.ieee.org/abstract/document/10158015

Abstract:
Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.

PDF paper link.

soresu · Mar 6, 2025

Related to the subject above:

128-bit addresses for the masses (of memory and devices).

128-bit addresses for the masses (of memory and devices).

The ever growing storage and memory needs in computer infrastructures makes 128-bit addresses a possible long-term solution to access vast swaths of data uniformly. In this abstract, we give our thoughts regarding what this would entail from a hardware/software perspective.

cea.hal.science

The ever growing storage and memory needs in computer infrastructures makes 128-bit addresses a possible long-term solution to access vast swaths of data uniformly. In this abstract, we give our thoughts regarding what this would entail from a hardware/software perspective.

PDF paper link.

igor_kavinski · Mar 6, 2025

128-bit??? We don't have exabytes of RAM to worry about currently, do we? Are they talking about 128-bit CPUs for supercomputers?

Nothingness · Mar 6, 2025

igor_kavinski said:
128-bit??? We don't have exabytes of RAM to worry about currently, do we? Are they talking about 128-bit CPUs for supercomputers?

This is not about RAM size, but storage size. Also be careful not to confuse virtual space with physical space.

igor_kavinski · Mar 6, 2025

Nothingness said:
This is not about RAM size, but storage size.

So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?

Nothingness · Mar 6, 2025

igor_kavinski said:
So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?

The article above proposes an approach to limit the cost (warning: I did not read most of it).

soresu · Mar 6, 2025

igor_kavinski said:
So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?

More like 10-20 years going by the abstract on one paper.

Depends if the RAM industry get their act together and start scaling again.

The transition to 3D multi layer RAM is long overdue.

Tuna-Fish · Mar 6, 2025

No-one has figured out how to build multiple layers of DRAM without requiring exposures per layer, without that it's not worth the cost.

igor_kavinski · Mar 6, 2025

Tuna-Fish said:
No-one has figured out how to build multiple layers of DRAM without requiring exposures per layer, without that it's not worth the cost.

No one is forcing them to keep them as sticks. They could theoretically make it a cube of RAM made up of multiple layers of PCB and DRAM chips. Put small spaces in between and use a fan to keep everything running cool. The larger the RAM size, the larger the cube. What would be the drawbacks to that approach?

Tuna-Fish · Mar 6, 2025

Why would you want to do this?

The reason why 3d NAND is such a good idea is cost. If it cost 2x to have two layers of flash, no-one would bother. But the way it's built, hundreds of layers of alternating materials are deposited (very cheaply) on top of each other, and then in the end a single step of litho is used to form the mask that is then used to etch through the entire stack in one step. This way, they can make hundreds (probably soon thousands) of layers of flash on a single chip, while minimizing the use of the expensive machines.

You can always stack DRAM vertically by just making individual DRAM dies, stacking them on top of each other and connecting them with TSVs. That's how HBM works. But, this buys you nothing in terms of cost, because you are just doing all the same steps repeatedly, and therefore does not result in higher typical memory capacities.

igor_kavinski · Mar 6, 2025

Tuna-Fish said:
Why would you want to do this?

To me, the current RAM capacity limits seem to be due to the form factors and density of the chips.

Current DIMM/SODIMM/CAMM etc. form factors can accommodate only a limited number of chips so the only solution is to increase chip density which is also growing at a snail's pace. Throw these form factors out the window and just start stacking DRAM chips in a sort of cubic fashion to expand capacity. It will resemble a cube physically so it could be called CubeRAM. As for the interconnect between the chips, I'm sure they could figure something out that is cheap and less costly than HBM-style TSVs.

Discussion General CPU µArch Research Thread

Diamond Member

Diamond Member

Diamond Member

Junior Member

Diamond Member

Biotite: A High-Performance Static Binary Translator using Source-Level Information​

Diamond Member

Transcending Hardware Limits with Software Out-of-Order Processing​

Diamond Member

Abstract:​

A Sound and Complete Algorithm for Code Generation in Distance-Based ISA​

Diamond Member

Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors​

Abstract​

Lifer

Diamond Member

Golden Member

Junior Member

Member

Diamond Member

Diamond Member

Toward Practical 128-Bit General Purpose Microarchitectures​

Abstract:​

Diamond Member

128-bit addresses for the masses (of memory and devices).​

Lifer

Diamond Member

Lifer

Diamond Member

Diamond Member

Golden Member

Lifer

Golden Member

Lifer

Biotite: A High-Performance Static Binary Translator using Source-Level Information

Transcending Hardware Limits with Software Out-of-Order Processing

Abstract:

A Sound and Complete Algorithm for Code Generation in Distance-Based ISA

Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors

Abstract

Toward Practical 128-Bit General Purpose Microarchitectures

Abstract:

128-bit addresses for the masses (of memory and devices).