Weren't 32 bit ARM and 64 bit ARM more dissimilar than x86 is to x64?That's an interesting point.
ARM was able to reduce rhe decoder size by 4x in the Cortex X2/X3, by ditching AARCH32.
IIRC out of order is why CPUs need a branch predictor?
why can’t AMD/Intel design a 10-wide decoder instead of these 2x4 and 3x3 ones?
How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer.x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.
You basically described the uop cache 😀How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer.
But but but, the fixed length ISA is what's missing from that cache's function.You basically described the uop cache 😀
I think you're misunderstanding what a micro op cache is. Whether an instruction is variable length or fixed length makes no difference once it's translated to micro ops. Every x86 CPU since the original 8086 has used micro ops.But but but, the fixed length ISA is what's missing from that cache's function.
If they're "translating" variable length instructions to fixed length at a hardware level automatically, that's essentially the same as supporting two ISAs. It would increase decoding overhead, not decrease. Not to mention the nightmare it would be to implement.Create new fixed length ISA for x86. Translate variable length instructions to fixed length instructions and cache those translations for future reference to reduce latency. If, however, software has been recompiled to use the fixed length ISA, it bypasses the translation overhead. I understand that this is probably a gargantuan task to accomplish but the only one I can think of. When x86's main competitors (ARM and RISC-V) are making so much progress mainly due to fixed length instructions, x86 has no choice but to fall in line with backward compatibility in place otherwise it will be left hopelessly behind and never be able to catch up to those simpler designs in power efficiency.
Or AMD could implement it at the chipset driver level, without Microsoft's blessing.Microsoft would have to be on board.
You mean because AMD has so much experience doing that well and smoothly like right now?Or AMD could implement it at the chipset driver level, without Microsoft's blessing.
How would the chipset driver know if a specific program is the old x86-64 or the new fixed length ISA? At some level, the OS would have to involved. And while I don't know the terms of their cross-licensing agreement, I doubt if it allows them to create x86 emulators.Or AMD could implement it at the chipset driver level, without Microsoft's blessing.
Hey, if at first you don't succeed and if you fail multiple times, might as well make a career out of it!You mean because AMD has so much experience doing that well and smoothly like right now?
Can't the driver read the first few instructions of the executable file to identify? Assuming mixing of new and old instructions in the same EXE isn't allowed.How would the chipset driver know if a specific program is the old x86-64 or the new fixed length ISA?
Pretty sure it was actually the Nx586You're probably right. I'm going from memory.
Can they? I'm not super familiar with Chipset drivers, but that sounds pretty far fetched. I don't think the chipset drivers "know" when a new program is run. Even if they do, that would add a delay each time a program is started. Not to mention the whole translation aspect.Can't the driver read the first few instructions of the executable file to identify? Assuming mixing of new and old instructions in the same EXE isn't allowed.
I think Rosetta was the same at first program execution. Subsequent executions were faster.Even if they do, that would add a delay each time a program is started. Not to mention the whole translation aspect.
The OS can run a translation layer, yes, but not the chipset driver.I think Rosetta was the same at first program execution. Subsequent executions were faster.
View attachment 105405
About the Rosetta Translation Environment | Apple Developer Documentation
Learn how Rosetta translates executables, and understand what Rosetta can’t translate.developer.apple.com
I don't see why x86 can't do the same while moving to a better ISA without x86 limitations.
Yes, Rosetta2 is a mix of static recompilation and dynamic translation. That's the FX!64 of the 21st century.I think Rosetta was the same at first program execution. Subsequent executions were faster.
Rosetta2 only takes care of user code which is easier to run quickly than having to emulate system code (OS/kernel drivers).I don't see why x86 can't do the same while moving to a better ISA without x86 limitations.
Yeah, so translate only user code from variable length instructions to fixed length ones.Rosetta2 only takes care of user code which is easier to run quickly than having to emulate system code (OS/kernel drivers).
Don't the drivers run in Kernel mode? If so, then they can do what the OS can do, unless the OS is limiting the drivers, even ones from the CPU manufacturer that should be rock solid by design.The OS can run a translation layer, yes, but not the chipset driver.
That's not how that works. Windows provides an API. Drivers can only implement functionality that the API allows.Don't the drivers run in Kernel mode? If so, then they can do what the OS can do, unless the OS is limiting the drivers, even ones from the CPU manufacturer that should be rock solid by design.
AMD will provide an update on their long-term open-source firmware strategy at the Open-Source Firmware Conference in September, focusing on their OpenSIL project, which is expected to eventually replace AGESA on future Ryzen and EPYC platforms. They aim for OpenSIL to be ready for production by 2026, spanning both client and server platforms.
I will bet the Medusa Ridge performance: 60% faster at same clock than Zen5 on SPECint rate-1. Doubled L2 and L3 sizes.
Same dual 4-wide decode plus same 6FPU 6ALU, 2nm TSMC. IoD is 3nm TSMC. IPC gains are from digging IPC on this wide ALU digging better IPC at FPU. I bet on AMD widening less this time and digging more IPC. Zen5 diagram already shows how wide and huge the IPC uplift must have been on Zen5.
Launch at Q1/26