Question Zen 6 Speculation Thread

FlameTail · Aug 14, 2024

That's an interesting point.

ARM was able to reduce rhe decoder size by 4x in the Cortex X2/X3, by ditching AARCH32.

gdansk · Aug 14, 2024

FlameTail said:
That's an interesting point.

ARM was able to reduce rhe decoder size by 4x in the Cortex X2/X3, by ditching AARCH32.

Weren't 32 bit ARM and 64 bit ARM more dissimilar than x86 is to x64?

And because Windows users have a bunch of 32-bit binaries with no source neither AMD nor Intel feel free to remove 32 bit mode.
Unless Microsoft writes another wow64 JIT converting x86 to x64? But that seems unlikely.

Tuna-Fish · Aug 15, 2024

soresu said:
IIRC out of order is why CPUs need a branch predictor?

Nope, there are plenty of in-order CPUs that have branch predictors.

The reason you need a predictor is that modern CPUs are deeply pipelined. That is, it takes something like 10-20 cycles to actually completely execute a full instruction. As you can still have an instruction depend on a value created by a previous instruction and execute on the next cycle, most arithmetic and stuff does not see this latency at all, and it sort of looks like instructions take one cycle to complete. But this doesn't work for branches, for them you hit the full latency. So either you just twiddle your thumbs for 10-20 clocks every time there is a branch, or you guess. Early predictors were really dumb (you can get a net win with something really stupid, like the classic "all backwards branches taken, all forwards branches not taken"), but as improving prediction quality is both a performance and a power optimization, we have gotten really far from that.

poke01 said:
why can’t AMD/Intel design a 10-wide decoder instead of these 2x4 and 3x3 ones?

x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.

igor_kavinski · Aug 15, 2024

Tuna-Fish said:
x86 instructions can be anywhere from one to 15 bytes long. The possible starting point for the nth instruction covers a huge space that grows massively as the n grows. This is the one big advantage ARM has over x86; ARM instructions are fixed size.

How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer.

Nothingness · Aug 15, 2024

igor_kavinski said:
How about they design a new layer that converts the variable instruction sizes to fixed sizes and create a large cache to index those translations? It will cost more transistors but this seems to be the only way x86 can get rid of its "variable size instruction" headache and baggage. Future compilations of applications can then just generate the fixed length instructions to bypass the translation layer. In time, only unsupported legacy applications will need to depend on the translation layer.

You basically described the uop cache 😀

Another possible trick is to add instruction boundaries in the icache (and possibly in the L2 cache). The problem is that you don't know if an icache line doesn't have data in the middle which would create wrong information. This can be very messy.

But neither of these features will alleviate the need to find the instruction boundaries which is quite expensive given how irregular x86 encoding is.

igor_kavinski · Aug 15, 2024

Nothingness said:
You basically described the uop cache 😀

But but but, the fixed length ISA is what's missing from that cache's function.

Create new fixed length ISA for x86. Translate variable length instructions to fixed length instructions and cache those translations for future reference to reduce latency. If, however, software has been recompiled to use the fixed length ISA, it bypasses the translation overhead. I understand that this is probably a gargantuan task to accomplish but the only one I can think of. When x86's main competitors (ARM and RISC-V) are making so much progress mainly due to fixed length instructions, x86 has no choice but to fall in line with backward compatibility in place otherwise it will be left hopelessly behind and never be able to catch up to those simpler designs in power efficiency.

GTracing · Aug 15, 2024

igor_kavinski said:
But but but, the fixed length ISA is what's missing from that cache's function.

I think you're misunderstanding what a micro op cache is. Whether an instruction is variable length or fixed length makes no difference once it's translated to micro ops. Every x86 CPU since the original 8086 has used micro ops.

igor_kavinski said:
Create new fixed length ISA for x86. Translate variable length instructions to fixed length instructions and cache those translations for future reference to reduce latency. If, however, software has been recompiled to use the fixed length ISA, it bypasses the translation overhead. I understand that this is probably a gargantuan task to accomplish but the only one I can think of. When x86's main competitors (ARM and RISC-V) are making so much progress mainly due to fixed length instructions, x86 has no choice but to fall in line with backward compatibility in place otherwise it will be left hopelessly behind and never be able to catch up to those simpler designs in power efficiency.

If they're "translating" variable length instructions to fixed length at a hardware level automatically, that's essentially the same as supporting two ISAs. It would increase decoding overhead, not decrease. Not to mention the nightmare it would be to implement.

If they do it at a software level, then it's essentially the same as Rosetta 2 or Prism. Microsoft would have to be on board. And Intel would likely sue AMD (or vice versa). Even if they do come to an agreement to emulate newer x86 extensions, a fixed length x86 would still be a new ISA. At that point you might as well redesign it from the ground up.

igor_kavinski · Aug 15, 2024

GTracing said:
Microsoft would have to be on board.

Or AMD could implement it at the chipset driver level, without Microsoft's blessing.

BorisTheBlade82 · Aug 15, 2024

@GTracing
IIRC, P6 (Pentium Pro) was the first x86 CPU to use micro ops.

moinmoin · Aug 15, 2024

igor_kavinski said:
Or AMD could implement it at the chipset driver level, without Microsoft's blessing.

You mean because AMD has so much experience doing that well and smoothly like right now?

GTracing · Aug 15, 2024

igor_kavinski said:
Or AMD could implement it at the chipset driver level, without Microsoft's blessing.

How would the chipset driver know if a specific program is the old x86-64 or the new fixed length ISA? At some level, the OS would have to involved. And while I don't know the terms of their cross-licensing agreement, I doubt if it allows them to create x86 emulators.

igor_kavinski · Aug 15, 2024

moinmoin said:
You mean because AMD has so much experience doing that well and smoothly like right now?

Hey, if at first you don't succeed and if you fail multiple times, might as well make a career out of it!

Served Raja Koduri pretty well.

igor_kavinski · Aug 15, 2024

GTracing said:
How would the chipset driver know if a specific program is the old x86-64 or the new fixed length ISA?

Can't the driver read the first few instructions of the executable file to identify? Assuming mixing of new and old instructions in the same EXE isn't allowed.

GTracing · Aug 15, 2024

BorisTheBlade82 said:
@GTracing
IIRC, P6 (Pentium Pro) was the first x86 CPU to use micro ops.

You're probably right. I'm going from memory.

gdansk · Aug 15, 2024

GTracing said:
You're probably right. I'm going from memory.

Pretty sure it was actually the Nx586

GTracing · Aug 15, 2024

igor_kavinski said:
Can't the driver read the first few instructions of the executable file to identify? Assuming mixing of new and old instructions in the same EXE isn't allowed.

Can they? I'm not super familiar with Chipset drivers, but that sounds pretty far fetched. I don't think the chipset drivers "know" when a new program is run. Even if they do, that would add a delay each time a program is started. Not to mention the whole translation aspect.

igor_kavinski · Aug 15, 2024

GTracing said:
Even if they do, that would add a delay each time a program is started. Not to mention the whole translation aspect.

I think Rosetta was the same at first program execution. Subsequent executions were faster.

About the Rosetta translation environment | Apple Developer Documentation

Learn how Rosetta translates executables, and understand what Rosetta can’t translate.

developer.apple.com

I don't see why x86 can't do the same while moving to a better ISA without x86 limitations.

gdansk · Aug 15, 2024

Look up WoW64. Microsoft can handle running binaries with different instruction sets in that layer.

That said there is almost no point to remove the x86 encoding since x64 simply builds on it.

And there is no point adding a fixed size "version of x86." That is not happening. It would be done by switching to ARM. No point to add another RISC variant that complicates the front end. Or if it's the only instruction set supported then it requires writing another JIT and another layer for running x64 code in the instruction set (like they have already done for ARM, so just use that).

GTracing · Aug 15, 2024

igor_kavinski said:
I think Rosetta was the same at first program execution. Subsequent executions were faster.

View attachment 105405

About the Rosetta translation environment | Apple Developer Documentation

Learn how Rosetta translates executables, and understand what Rosetta can’t translate.

developer.apple.com

I don't see why x86 can't do the same while moving to a better ISA without x86 limitations.

The OS can run a translation layer, yes, but not the chipset driver.

Nothingness · Aug 15, 2024

igor_kavinski said:
I think Rosetta was the same at first program execution. Subsequent executions were faster.

Yes, Rosetta2 is a mix of static recompilation and dynamic translation. That's the FX!64 of the 21st century.

I don't see why x86 can't do the same while moving to a better ISA without x86 limitations.

Rosetta2 only takes care of user code which is easier to run quickly than having to emulate system code (OS/kernel drivers).

igor_kavinski · Aug 15, 2024

Nothingness said:
Rosetta2 only takes care of user code which is easier to run quickly than having to emulate system code (OS/kernel drivers).

Yeah, so translate only user code from variable length instructions to fixed length ones.

igor_kavinski · Aug 15, 2024

GTracing said:
The OS can run a translation layer, yes, but not the chipset driver.

Don't the drivers run in Kernel mode? If so, then they can do what the OS can do, unless the OS is limiting the drivers, even ones from the CPU manufacturer that should be rock solid by design.

GTracing · Aug 15, 2024

igor_kavinski said:
Don't the drivers run in Kernel mode? If so, then they can do what the OS can do, unless the OS is limiting the drivers, even ones from the CPU manufacturer that should be rock solid by design.

That's not how that works. Windows provides an API. Drivers can only implement functionality that the API allows.

API reference docs for Windows Driver Kit (WDK) - Windows drivers

Windows Driver Kit (WDK) 10 is integrated with Microsoft Visual Studio and Debugging Tools for Windows. This integrated environment gives you the tools you need to develop, build, package, deploy, test, and debug Windows drivers. WDK includes templates for several technologies and driver models...

learn.microsoft.com

Gideon · Aug 16, 2024

Interesting!

AMD To Provide Update On Long-Term Strategy For Open-Source Firmware - Phoronix

www.phoronix.com

https://www.reddit.com/r/framework/comments/1eryog9/amd_to_provide_update_on_longterm_strategy_for

AMD will provide an update on their long-term open-source firmware strategy at the Open-Source Firmware Conference in September, focusing on their OpenSIL project, which is expected to eventually replace AGESA on future Ryzen and EPYC platforms. They aim for OpenSIL to be ready for production by 2026, spanning both client and server platforms.

If I'm reading this correctly Zen 6 will have open-source firmware instead (or more likely as an alternative to ) AGESA.

static shock · Aug 16, 2024

static shock said:
I will bet the Medusa Ridge performance: 60% faster at same clock than Zen5 on SPECint rate-1. Doubled L2 and L3 sizes.
Same dual 4-wide decode plus same 6FPU 6ALU, 2nm TSMC. IoD is 3nm TSMC. IPC gains are from digging IPC on this wide ALU digging better IPC at FPU. I bet on AMD widening less this time and digging more IPC. Zen5 diagram already shows how wide and huge the IPC uplift must have been on Zen5.

Launch at Q1/26

I will cram this IPC gain to just 60% more IPC over Zen4.

Question Zen 6 Speculation Thread

Diamond Member

Diamond Member

Golden Member

Lifer

Diamond Member

Lifer

Senior member

Lifer

Senior member

Diamond Member

Senior member

Lifer

Lifer

Senior member

Diamond Member

Senior member

Lifer

Diamond Member

Senior member

Diamond Member

Lifer

Lifer

Senior member

Platinum Member

Member