Question Zen 6 Speculation Thread

inf64 · Sep 10, 2024

APU_Fusion said:
I predict +40% ipc over zen 5 give or take 40% to 60% either way

Your branch predictor needs a revamp .

On a serious note, 10-15% as per that slide is basically what we should be getting, which is fine. If we can get 5% higher clocks with say 12% IPC, it would be a very solid 18% overall uplift.

static shock · Sep 10, 2024

30% ipc uplift vs Zen5 i bet.

Then +30% in 2026 q3 for the real Zen6(beefier than ever with 9ALU+9FPU core at 18 or 14A Intel or 16A TSMC).

inquiss · Sep 10, 2024

MadRat said:
AMD could support DDR6 immediately. AMD burned in support for external memory controllers. An external memory controller can literally be whatever they define. That is the beauty of supporting an external memory controller. What they aim for is profit. Can they profit from the product? Probably not atm.

DDR6 is going to be server only for a long time maybe ever. Super expensive.

NostaSeronx · Wednesday at 12:19 AM

Tuna-Fish said:
AMD hit the same issues on their 65nm node, which is what caused both the delay and the anemic performance.

AMD shared the same dimensions across 90-nm->65-nm->45-nm.
90nm=40nm Lg // Intel's 90nm was 45-nm/50-nm thus should suffer less quantum tunneling than AMD.*
65nm=40nm Lg (in practice the AMD node for 65nm actually used larger Lgate like 45+ nm)
45nm=40nm Lg
32nm=35nm Lg <-- first real gate shrink. (added high-k which is a known anti-QT tech; same stuff in this is usable w/o significant QT all the way down to 7nm FDSOI(57CPP/40MX))

So, expecting worse quantum tunneling going thru the gate is insane. The actual issue was Xdep; iGISL, iGIDL, iPunch. Which is why everything from 2003-2006 was hyper focused on ETSOI between AMD/IBM till it wasn't. As the larger device shrunk, as the gates stayed the same w/ mobility improvements, they leaked more into the insulator/body not thru the gate-oxide.

*the issue between AMD and Intel's nodes was AMD's PDSOI fixed Xdep for a while. Where Intel suffered most of their working current into leakage of the body.
Terahertz FDSOI vs Traditional PDSOI: "This allows the depleted substrate transistor to have 100 times less leakage than traditional silicon-on-insulator schemes." => Xdep is more significant than quantum tunneling for >20-nm and larger gates. In this case, 7nm FDSOI has larger oxide which allows it to get the same performance/power metrics of N3B. -CEA-Leti CTO.

Tuna-Fish · Wednesday at 2:18 AM

NostaSeronx said:
AMD shared the same dimensions across 90-nm->65-nm->45-nm.

Yes, but that wasn't by choice. Their 65nm node was supposed to have a thinner gate, but that failed, so they bumped it up.

NostaSeronx · Wednesday at 3:06 AM

Tuna-Fish said:
Yes, but that wasn't by choice. Their 65nm node was supposed to have a thinner gate, but that failed, so they bumped it up.

I can't find any PDSOI specific sub-40nm gate length in time for 65nm, but since 2002-2003 there is some for FDSOI.

"This is the first demonstration of a metal gate FDSOI with performance suitable for the 65nm technology node."

The thinner gates were only ever going to come with FDSOI. 25nm and 20nm(late 2003) standard fdsoi were both set for the 45nm node. Only case for smaller than 40nm was with FDSOI, so PDSOI was never going to go below 40nm till 2007~2008 DSL scaling to below 40-nm. Which was set for the 32-nm technology node in its respective paper.

Same applies with Intel's TeraHertz going for 15-nm gate length by 2010. Where the bulk side was 65nm=35nm, 45nm=35nm, 32nm=30nm, 22nm=30/34nm.

More recent stuff allows for gate oxide to get bigger (lowers QT) when Si thickness shrinks below 5-nm. Which is a reversal of "But at advanced nodes, that isn’t possible because gate oxides shrink with the rest of the features." so FinFET/GAA are actually hitting the issue harder than ever.

AMD/GlobalFoundries => UDNA is the FDSOI GPU architecture
The Low-power core that hasn't been seen yet is the FDSOI CPU architecture
Where this is still in play "Products are to be exclusively taped out to and manufactured by FoundryCo." from 7th and amended 7th.

New Geode 52 (officially started 2 years after 50th) = Zen6LP+UDNA, something something ultra-low-power business unit. "server-client-Ultra lower power platforms" 17W/25W for server and 4.5W/6W for client.

12LP+ AI Platform was delayed from 2H2020 to 1Q2025.
12FDX should launch at the same time 1Q2025.

Anhiel · Wednesday at 7:00 AM

inquiss said:
DDR6 is going to be server only for a long time maybe ever. Super expensive.

MadRat said:
AMD could support DDR6 immediately. AMD burned in support for external memory controllers. An external memory controller can literally be whatever they define. That is the beauty of supporting an external memory controller. What they aim for is profit. Can they profit from the product? Probably not atm.

Yes, and I do expect the SOC tile for Zen6 to be updated but DDR6 for consumers is just too early for desktop.
As a consumer I certainly would welcome it. And since it does seem as Apple and Qualcomm are pushing for LPDDR6 in 2025 it does look like there's not much of a choice for marketing's sake at least for mobiles. More than anything I wish for memory expansion to at least 64 GB.
In any case, DDR6 doesn't bring much on the table other than speed bump. DDR7 will bring the more important safety features. So my bet is on DDR7 and PCIe7 being where we'll see a long term pause again. PCIe8 will likely transition toward optical.

Back on topic, in 2018 I was expecting there would be a new architecture family after Zen5, ending the Zen family.
But then Zen5 troubles apparently split and moved improvements to Zen6 on N2. I was expecting backside power delivery with N2. Now N2 and Zen6 troubles apparently split and moved improvements to Zen7 on A16. And since Zen6 is gonna be on AM5 again and likely being chocked to death by its limitation; I see little reason to believe we can expect much raw performance improvement let alone a lot of IPC gains (as already posted before ~12%).
Regardless, as far back as Zen3 it was clear to me AMD needs to improve Infinity Fabric bandwidth by 2x with each subsequent generation or the cores will be starved. And with Zen6 server CCD having 32 cores it's a must. It would be nice if that applies to consumer parts. But alas I hear AMD wants to have tons of different designs instead which likely means consumers might get a more mediocre version. Some people are already complaining about Zen5 being server centric. That's a bad incentive.

Expected improvements are ofc fixing Zen5, especially the fixed queue size schedule. A dynamic and elastic version might bring much greater throughput.
Improving the bad inter-CCD latency probably would be a greater sure thing for solid gains.

The other question people don't care as much as the industry now is AI or rather "NPU" improvement. This is mobile only ofc.
This is where I expect to see the biggest leaps to come in the next decade (100x). Hardware wise things it won't be as great but it looks reasonable to expect at least 2x or even 3-4x (as Intel claims).
Currently, the NPUs are seemingly limited to 5W. I don't expect that budget to change so any change would have to come from process node improvement. for N3->N2 ~25-30% or 10-15% speed: so NPU "IPC" would need to be 1.5x to 3x.
I see no reason for further improvements at this point because the AI/LLM will be memory limited...unless we do get support for more memory.

static shock · Wednesday at 7:07 AM

FDSOI/SOI isn't anymore used!

MS_AT · Wednesday at 7:23 AM

Anhiel said:
Expected improvements are ofc fixing Zen5, especially the fixed queue size schedule. A dynamic and elastic version might bring much greater throughput.
Improving the bad inter-CCD latency probably would be a greater sure thing for solid gains.

Which queue size do you mean?

And do you know any workload that is affected by this inter-CCD latency issue other than the synthetic benchmark that is measuring it?

Aapje · Wednesday at 7:44 AM

Anhiel said:
DDR7 will bring the more important safety features.

Do you have any sources on this? I can't find anything on DDR7.

NostaSeronx · Wednesday at 8:08 AM

static shock said:
FDSOI/SOI isn't anymore used!

It should pick up in use post:
1. Unlimited Bi-directional Back-Bias in FD-SOI Technology With New Dual Isolation Integration - June 2024
2. 3D Sequential Integration with Si CMOS Stacked on 28nm Industrial FDSOI with Cu-ULK iBEOL Featuring RO and HDR Pixel - December 2023
3. 7nm FDSOI build up in 2026 and production start in 2027 - July 2023

Entry/Essential/Value accounts for ~40% of shipments and ~35% of revenue during Duron (2000) to A-series (2019). Mendocino(Zen2c) and Sonoma(Zen5c) are mainstream products not value.

"Products are to be exclusively taped out to and manufactured by FoundryCo." <-- this is still in play for products. Only the FDSOI roadmap has additional nodes. 12FDX 2025 -> 7nm FDSOI 2027. With 5G/6G AMD+Partner being FDSOI as well. With Xilinx/NASA deeply invested in Malta-FDSOI.
"We rely on Taiwan Semiconductor Manufacturing Company Limited (TSMC) for the production of all wafers for microprocessor and GPU products at 7 nanometer (nm) or smaller nodes, and we rely primarily on GLOBALFOUNDRIES Inc. (GF) for wafers for microprocessor and GPU products manufactured at process nodes larger than 7 nm." April~May 2024; AMD is required to do new tapeouts at GF.

The value x86-64 core has always been placed under the latest family with AMD. So, it would share the Zen6-name.

Anhiel · Wednesday at 8:15 AM

MS_AT said:
Which queue size do you mean?

And do you know any workload that is affected by this inter-CCD latency issue other than the synthetic benchmark that is measuring it?

I'm referring to the decoder being clustered into two sets with each 4-wide. So making them combined and dynamic would result in more bandwidth for 1T which mostly benefits games only. It won't matter for 2T.

The most use cases would be compilation workloads.

Aapje said:
Do you have any sources on this? I can't find anything on DDR7.

This was information from Sansum? anyhow an article interview with one of the developers talking about features and considerations many years ago. I do have a link somewhere but it seems I can't find it again rn.

Wolverine2349 · Wednesday at 10:56 AM

MS_AT said:
Which queue size do you mean?

And do you know any workload that is affected by this inter-CCD latency issue other than the synthetic benchmark that is measuring it?

Certain games big time 1% and 0./1% lows are tanked badly and stutter when crossing CCDs.

8 cores just enough for now for almost all games. But a CPU with more than 8 strong cores on a single die would be nice. 7800X3D 8 core chips are really the best compromise for now until a 10-12 Big core CPU with all the cores on a single node comes out if ever.

Or accept Big.Little. Maybe Arrow Lake with the Skymont cores being so much improved and having Raptor Cove IPC just clocked lower at 4.6GHz and clusters of 4 of them for very fast latency between them and still so much faster latency on single ring with 12 nodes than crossing the IF for more than 8 cores on AMD and Zen 6 apparently not gonna fix it much or its unknown????

Anhiel · Wednesday at 12:10 PM

Aapje said:
Do you have any sources on this? I can't find anything on DDR7.

It's not the original article as it has almost nothing on DDR7 but it does include the feature term used:

Schneller Arbeitsspeicher: Samsung spricht über DDR6, DDR6+ und GDDR7

Im Rahmen des Tech Day 2021 hat Samsung über die neuen DRAM-Speichergenerationen DDR6, DDR6+ und GDDR7 gesprochen.

www.computerbase.de

„real-time error protection feature“

Aapje · Wednesday at 1:42 PM

That's a description of GDDR7, but AFAIK they didn't actually improve error detection in GDDR7 over 6. So is the article even accurate?

And they already put error protection in DDR5. I think that it's all very thin evidence to conclude that DDR7 is going to especially awesome.

I'm personally more excited about the possibility of CAMM coming to the desktop in an affordable way, since DIMMs seem like an obsolete format, causing way too many signaling issues.

MS_AT · Wednesday at 1:45 PM

Anhiel said:
I'm referring to the decoder being clustered into two sets with each 4-wide. So making them combined and dynamic would result in more bandwidth for 1T which mostly benefits games only. It won't matter for 2T.

But aren't games latency and not throughput bound? I mean more decoder throughput won't help you there. For example take a look at https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ uop cache is already serving over 75% needs for instructions. Problem is when you miss in uop and in L1i and L2 and L3 and you will miss due to relatively large number of branches.

Anhiel said:
The most use cases would be compilation workloads.

For InterCCD latency? At least for C/C++ I find it hard to believe knowing how the build system works where every TU is compiled as separate process and there is little need to sync anything except for messaging: I am done. While linking might be affected most people are, unfortunately still using single threaded linkers so the latency should not apply. Not sure about other languages though.

Wolverine2349 said:
8 cores just enough for now for almost all games. But a CPU with more than 8 strong cores on a single die would be nice. 7800X3D 8 core chips are really the best compromise for now until a 10-12 Big core CPU with all the cores on a single node comes out if ever.

I was mostly concerned about a situation where either Strix Point is shown to perform worse than Hawk Point having all things equal [hard to do, I know] or where 9950x is shown to have noticeably worse performance than 7950x [assuming core parking is disabled to make comparison equal to 7950x] so we could attribute that to worse CCD to CCD latency.

igor_kavinski · Wednesday at 1:55 PM

Aapje said:
I'm personally more excited about the possibility of CAMM coming to the desktop in an affordable way, since DIMMs seem like an obsolete format, causing way too many signaling issues.

Fixed by CUDIMMs: https://www.corsair.com/us/en/explo...m-xKFXqczx7UOBCR4vxdpNvL-D9rHnAqiikvhwGV2g5pu

igor_kavinski · Wednesday at 2:06 PM

MS_AT said:
While linking might be affected most people are, unfortunately still using single threaded linkers so the latency should not apply. Not sure about other languages though.

She's created her own C++ build system that doesn't need scripts: https://rachelbythebay.com/w/2024/09/03/ops/

To give some idea of the impact, touching a file that gets included into a whole bunch of stuff forces everything above it in the graph to get rebuilt. This used to take about 77 seconds on the same 2011-built8-way workstation box I've had all along. With just the early work on parallelization, it became something more like 21 seconds.

She did a Kickstarter for like $50,000 (I think) to opensource it. She got hardly $5000 and decided against opensourcing coz people are just too cheap.

(Not trying to prove some point. This is just FYI, in case you haven't heard about her).

MS_AT · Wednesday at 2:27 PM

igor_kavinski said:
She's created her own C++ build system that doesn't need scripts: https://rachelbythebay.com/w/2024/09/03/ops/

Haven't heard thanks for sharing but tbh she was a bit doomed to fall, we have too many too small tools in the ecosystem. I mean even projects like MOLD that are basically drop-in replacements into existing build systems are facing similar hurdles. https://github.com/rui314/mold but well for some reason people are using MSVC as a compiler on Windows including games where Clang is easily accessible from Visual Studio so well, enough with the off top

Doug S · Wednesday at 3:33 PM

Anhiel said:
And since it does seem as Apple and Qualcomm are pushing for LPDDR6 in 2025 it does look like there's not much of a choice for marketing's sake at least for mobiles. More than anything I wish for memory expansion to at least 64 GB.
In any case, DDR6 doesn't bring much on the table other than speed bump.

Based on what evidence? Apple only now started using LPDDR5X, and not even a particularly fast version of it. I think the idea they are "pushing for LPDDR6" in 2025 is folly. The only reason they might (assuming they could get sufficient guaranteed quantity at a not-exorbitant price) is if they considered memory tagging or ECC important.

Has the DDR6 standard even been finalized? I think it is too soon to claim it "doesn't bring much ... other than a speed bump" until it is finalized. The only thing we seem to know about it is that it is supposed to go from 2 channels to 4. I think that's a big hint though, given that ECC is absolutely vital and supporting ECC across four 16 bit channels in the traditional way would be problematic. So I wouldn't be shocked to see it go to 24 bit wide channels handled like LPDDR6 which would imply 96 bit wide DIMMs. That would also make it easier for it to double throughput per DIMM without having to actually double the per bit data rate.

Anhiel · Wednesday at 3:50 PM

Aapje said:
That's a description of GDDR7, but AFAIK they didn't actually improve error detection in GDDR7 over 6. So is the article even accurate?

And they already put error protection in DDR5. I think that it's all very thin evidence to conclude that DDR7 is going to especially awesome.

I'm personally more excited about the possibility of CAMM coming to the desktop in an affordable way, since DIMMs seem like an obsolete format, causing way too many signaling issues.

As I said that's not even close to the article I'm referring to which is older, too.
It's different from DDR5 error correction in that it's supposedly equal to real ECC but for consumers and being standard for all.
If it weren't significant I would't rememberit in the first place.

Doug S said:
Based on what evidence? Apple only now started using LPDDR5X, and not even a particularly fast version of it. I think the idea they are "pushing for LPDDR6" in 2025 is folly. The only reason they might (assuming they could get sufficient guaranteed quantity at a not-exorbitant price) is if they considered memory tagging or ECC important.

Has the DDR6 standard even been finalized? I think it is too soon to claim it "doesn't bring much ... other than a speed bump" until it is finalized. The only thing we seem to know about it is that it is supposed to go from 2 channels to 4. I think that's a big hint though, given that ECC is absolutely vital and supporting ECC across four 16 bit channels in the traditional way would be problematic. So I wouldn't be shocked to see it go to 24 bit wide channels handled like LPDDR6 which would imply 96 bit wide DIMMs. That would also make it easier for it to double throughput per DIMM without having to actually double the per bit data rate.

like here?

LPDDR6 RAM, expected in 2025, could make a 2024 debut on the Snapdragon 8 Gen 4

LPDDR5 is yesterday's papers. So is the boosted LPDDR5X, so is the turbocharged LPDDR5T that delivers approximately 12-13% faster reading speeds than the previous one, the popular LPDDR5X. Let’s talk about LPDDR6!

www.phonearena.com

All the information about DDR6+/GDDR7 were from Samsung's 2021 Tech Day

Samsung Tech Day 2021: DDR6-17000, GDDR6+, GDDR7 and HBM3 Roadmap

Unveiling Samsung's roadmap for future memory technologies.

www.tomshardware.com

As I understand it the 4 channels are only for dividing the DIMMs into 2 sections not 4 channels as in for 4 DIMM.
It's all about speed.

Doug S · Wednesday at 5:02 PM

Anhiel said:
As I said that's not even close to the article I'm referring to which is older, too.
It's different from DDR5 error correction in that it's supposedly equal to real ECC but for consumers and being standard for all.
If it weren't significant I would't rememberit in the first place.

like here?

LPDDR6 RAM, expected in 2025, could make a 2024 debut on the Snapdragon 8 Gen 4

LPDDR5 is yesterday's papers. So is the boosted LPDDR5X, so is the turbocharged LPDDR5T that delivers approximately 12-13% faster reading speeds than the previous one, the popular LPDDR5X. Let’s talk about LPDDR6!

www.phonearena.com

All the information about DDR6+/GDDR7 were from Samsung's 2021 Tech Day

Samsung Tech Day 2021: DDR6-17000, GDDR6+, GDDR7 and HBM3 Roadmap

Unveiling Samsung's roadmap for future memory technologies.

www.tomshardware.com

As I understand it the 4 channels are only for dividing the DIMMs into 2 sections not 4 channels as in for 4 DIMM.
It's all about speed.

That article is talking about Snapdragon, not iPhone. Android OEMs have always pushed the newest memory standards, which is easier to do on models that aren't selling 10 or 15 million phones on launch weekend and 200+ million a year.

The article you linked about Samsung is content free, it says nothing about DDR6 other than an expected launch speed. AI could have written that.

Four channels for DDR6 means exactly the same thing two channels does for DDR5, and carries the same implications for ECC. They already had to go to 80 bit wide DIMMs for ECC in DDR5 because of that change, I don't see how they can do the same for DDR6. Hence why I believe DDR6 DIMMs will be 96 bits wide for everyone, with the ECC handled similarly to how LPDDR6 does (except maybe DDR6 will support BL=18 or even =20 to spread the bits around for chipkill)

MadRat · Wednesday at 9:42 PM

You could conceivably use dynamically configured buses if they expand the number of channels per module. Seems natural to configure 8 channels that can swap between 1,2,3, or 4 channels being synchronized per memory read/write. Perhaps there's an invisible 5th channel to handle ECC. But those small bursts that only require 1 channel would kill performance if you cannot dynamically control the number of channels used. A big cache will be critical to prevent unnecessary overlapped memory requests. Channels would be dynamically swapping between i/o flow directions. Communication from the controller to the core should be synchronized to a high sustainable rate. Keeping the lanes full is more important than simply widening the number of lanes for no measurable performance gain. I would love to see wide buses, but if you gain responsiveness with a narrower bus its hard to justify that wider bus.

branch_suggestion · Wednesday at 11:35 PM

igor_kavinski said:
Fixed by CUDIMMs: https://www.corsair.com/us/en/explo...m-xKFXqczx7UOBCR4vxdpNvL-D9rHnAqiikvhwGV2g5pu

Not enough, DIMMs need to be obsoleted for actual scaling in the future.

del42sa · Saturday at 4:03 AM

NostaSeronx said:
AMD/GlobalFoundries => UDNA is the FDSOI GPU architecture
The Low-power core that hasn't been seen yet is the FDSOI CPU architecture
Where this is still in play "Products are to be exclusively taped out to and manufactured by FoundryCo." from 7th and amended 7th.

New Geode 52 (officially started 2 years after 50th) = Zen6LP+UDNA, something something ultra-low-power business unit. "server-client-Ultra lower power platforms" 17W/25W for server and 4.5W/6W for client.

12LP+ AI Platform was delayed from 2H2020 to 1Q2025.
12FDX should launch at the same time 1Q2025.

oh no, this FD-SOI BS again

Question Zen 6 Speculation Thread

Diamond Member

Member

Member

Diamond Member

Golden Member

Diamond Member

Member

Member

Senior member

Golden Member

Diamond Member

Member

Senior member

Member

Golden Member

Senior member

Lifer

Lifer

Senior member

Platinum Member

Member

Platinum Member

Lifer

Senior member

Member