Hi all, I'm starting some new threads to post interesting research on µArchs, this one is specific to CPUs.
FIFOrder MicroArchitecture: Ready-Aware Instruction Scheduling for OoO Processors
The baseline design is somehow spending a rather large 45% of its total dynamic energy on the issue queue in the first place. One is required to infer why claimed energy savings are even more than this, because the authors do not provide any energy figures other than for the issue queue.This sounds promising, claimed 50% energy savings over traditional OoO instruction queue designs with an 8% performance increase:
Biotite: A High-Performance Static Binary Translator using Source-Level Information
Transcending Hardware Limits with Software Out-of-Order Processing
STRAIGHT: Hazardless Processor Architecture Without Register Renaming
Abstract:
The single-thread performance of a processor improves the capability of the entire system by reducing the critical path latency of programs. Typically, conventional superscalar processors improve this performance by introducing out-of-order (OoO) execution with register renaming. However, it is also known to increase the complexity and affect the power efficiency. This paper realizes a novel computer architecture called "STRAIGHT" to resolve this dilemma. The key feature is a unique instruction format in which the source operand is given based on the distance from the producer instruction. By leveraging this format, register renaming is completely removed from the pipeline. This paper presents the practical Instruction Set Architecture (ISA) design, the novel efficient OoO microarchitecture, and the compilation algorithm for the STRAIGHT machine code. Because the ISA has sequential execution semantics, as in general CPUs, and is provided with a compiler, programming for the architecture is as easy as that of conventional CPUs. A compiler, an assembler, a linker, and a cycle-accurate simulator are developed to measure the performance. Moreover, an RTL description of STRAIGHT is developed to estimate the power reduction. The evaluation using standard benchmarks shows that the performance of STRAIGHT is 18.8% better than the conventional superscalar processor of the same issue-width and instruction window size. This improvement is achieved by STRAIGHT's rapid miss-recovery. Compilation technology for resolving the possible overhead of the ISA is also revealed. The RTL power analysis shows that the architecture reduces the power consumption by removing the power for renaming. The revealed performance and efficiencies support that STRAIGHT is a novel viable alternative for designing general purpose OoO processors.
A Sound and Complete Algorithm for Code Generation in Distance-Based ISA
Clockhands: Rename-free Instruction Set Architecture for Out-of-order Processors
Abstract
Out-of-order superscalar processors are currently the only architecture that speeds up irregular programs, but they suffer from poor power efficiency. To tackle this issue, we focused on how to specify register operands. Specifying operands by register names, as conventional RISC does, requires register renaming, resulting in poor power efficiency and preventing an increase in the front-end width. In contrast, a recently proposed architecture called STRAIGHT specifies operands by inter-instruction distance, thereby eliminating register renaming. However, STRAIGHT has strong constraints on instruction placement, which generally results in a large increase in the number of instructions.
We propose Clockhands, a novel instruction set architecture that has multiple register groups and specifies a value as “the value written in this register group k times before.” Clockhands does not require register renaming as in STRAIGHT. In contrast, Clockhands has much looser constraints on instruction placement than STRAIGHT, allowing programs to be written with almost the same number of instructions as Conventional RISC. We implemented a cycle-accurate simulator, FPGA implementation, and first-step compiler for Clockhands and evaluated benchmarks including SPEC CPU. On a machine with an eight-fetch width, the evaluation results showed that Clockhands consumes 7.4% less energy than RISC while having performance comparable to RISC. This energy reduction increases significantly to 24.4% when simulating a futuristic up-scaled processor with a 16-fetch width, which shows that Clockhands enables a wider front-end.
Can you give us a quick summary of their central idea? How are they bypassing register renaming?Another register rename free ISA for OoO execution
I'm not remotely qualified to answer that kind of question sadly 😅Can you give us a quick summary of their central idea? How are they bypassing register renaming?
The 2023 paper (Clockhands) is an iterative improvement on the 2018 paper (STRAIGHT). In both cases, the ISA is modified to replace architectural register identifiers with distance-based identifiers. This means that the source operand directly refers to a source instruction, rather than relying on a register name. The goal of both architectures is to simplify the hardware cost of register renaming itself, while leaving the remainder of the out-of-order machinery untouched. Register renaming is hard to scale up to arbitrarily large widths: 4 to 8 is not so hard, but 16+ renames per cycle would be very challenging at high frequency.Can you give us a quick summary of their central idea? How are they bypassing register renaming?
OoO register renamer | Clockhands register renamer | |
Source operands | Lookup to rename map table (~16 or ~32 entries) with override paths for intra-group dependencies. Override paths are difficult to scale wider due to priority selection structure. | Lookup to register pointer (4 entries) and perform a small subtraction in roughly constant time. |
Destination operands | Lookup to free list in roughly constant time. | Lookup to register pointer (4 entries) and perform a small addition that slowly (logarithmically) grows with rename group size. This scaling is cheaper than the source operand lookup scaling in normal OoO, because addition is commutative (no priority). |
Physical register ID | No particular efficiency restrictions on which physical registers may be provided by the free list. | Each of the 4 groups has a static partition of 1/4th of the total physical register count. More likely to stall at equal number of physical registers. |
Recovery checkpoint cost | Higher | Lower |
Look what I found from that author....I agree value pred is a very interesting thing to explore at this point.
I also agree that accuracy is paramount, even more than for data prefetchers. The cost of being wrong would result in a lots of replays or checkpoint restores, which would be terrible for performance and efficiency.
I'm far from knowing a lot about VP but Arthur Perais has been doing some work on that subject for the last 10 years.
Toward Practical 128-Bit General Purpose Microarchitectures
Abstract:
Intel introduced 5-level paging mode to support 57-bit virtual address space in 2017. This, coupled to paradigms where backup storage can be accessed through load and store instructions (e.g., non volatile memories), lets us envision a future in which a 64-bit address space has become insufficient. In that event, the straightforward solution would be to adopt a flat 128-bit address space. In this early stage letter, we conduct high-level experiments that lead us to suggest a possible general-purpose processor micro-architecture providing 128-bit support with limited hardware cost.
128-bit addresses for the masses (of memory and devices).
The ever growing storage and memory needs in computer infrastructures makes 128-bit addresses a possible long-term solution to access vast swaths of data uniformly. In this abstract, we give our thoughts regarding what this would entail from a hardware/software perspective.
This is not about RAM size, but storage size. Also be careful not to confuse virtual space with physical space.128-bit??? We don't have exabytes of RAM to worry about currently, do we? Are they talking about 128-bit CPUs for supercomputers?
So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?This is not about RAM size, but storage size.
The article above proposes an approach to limit the cost (warning: I did not read most of it).So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?
More like 10-20 years going by the abstract on one paper.So we can have base registers 128-bits in size? I assume it will increase the number crunching prowess of CPUs but transistor cost will be pretty high. When do you see this happening? Next 10 years in consumer CPUs?
No one is forcing them to keep them as sticks. They could theoretically make it a cube of RAM made up of multiple layers of PCB and DRAM chips. Put small spaces in between and use a fan to keep everything running cool. The larger the RAM size, the larger the cube. What would be the drawbacks to that approach?No-one has figured out how to build multiple layers of DRAM without requiring exposures per layer, without that it's not worth the cost.
To me, the current RAM capacity limits seem to be due to the form factors and density of the chips.Why would you want to do this?