Question Zen 6 Speculation Thread

dr1337 · Jun 5, 2024

gdansk said:
With 384 threads in a single socket now it seems really unnecessary.
If Zen 6 increases core counts it is a good time to remove SMT.

Or maybe AMD wants to dream even bigger.

gdansk · Jun 5, 2024

dr1337 said:
Or maybe AMD wants to dream even bigger.

Well I expect the thread counts to become even more absurd. I just think of the poor kernels...

marees · Jun 5, 2024

SarahKerrigan said:
Not sure why the hate for multithreading.

For small cores it gets forward movement when one context is stalled for Free*. Nice for real-time, too.

For large cores it gets you high utilization of functional units for Free*.

* Terms and conditions apply, but the costs are pretty small. Marvell called it a 5% area impact for TX3.

Intel gave a reason of power constrained and/or area constrained

I can see consoles going this way, as they are always power constrained

Intel wants to compete with low power ARM APUs. So it makes sense for them. Not sure if it makes sense for AMD

gdansk · Jun 5, 2024

SarahKerrigan said:
Not sure why the hate for multithreading.

For small cores it gets forward movement when one context is stalled for Free*. Nice for real-time, too.

For large cores it gets you high utilization of functional units for Free*.

* Terms and conditions apply, but the costs are pretty small. Marvell called it a 5% area impact for TX3.

For the core farms, 5% less area means they can fit more cores. And what percent of those customers are using SMT? I don't know but it seems like a security risk with all these bleeds and breaks.

I think they could discard it for core farm SKUs and it would be a net benefit for most customers.

SarahKerrigan · Jun 5, 2024

marees said:
Intel gave a reason of power constrained and/or area constrained

I can see consoles going this way, as they are always power constrained

Intel wants to compete with low power ARM APUs. So it makes sense for them. Not sure if it makes sense for AMD

I saw what Intel said. I thought the way it was put was kind of strange, but the gist seems to have been "we care about ST for these cores, because for MT we have a flock of smaller cores." Works for me - if you have small cores around for throughput loads, the SMT value calculation does change, not least because scheduling gets annoying; with ADL they did the whole "1 Atom core = 1 SMT thread" equivalence, which probably simplified scheduling but is a bit troublesome if it's a performance point that designs are supposed to target.

Tuna-Fish · Jun 5, 2024

SarahKerrigan said:
Not sure why the hate for multithreading.

For small cores it gets forward movement when one context is stalled for Free*. Nice for real-time, too.

For large cores it gets you high utilization of functional units for Free*.

* Terms and conditions apply, but the costs are pretty small. Marvell called it a 5% area impact for TX3.

It reduces the performance available per thread.

If you only consider the CPU, this still seems like a huge win for throughput loads. But an active process requires pretty much the same amount of RAM regardless of how fast it is. Back when DRAM was comparatively cheap and CPUs were expensive, this wasn't a big deal. But today, if you spec out a normal server for a throughput load, >50% of the cost of that server is going to be DRAM. And supporting twice the threads means you have to double the most expensive component on your bom, for >25% extra cost, and you are not getting even 20% extra speed. It just makes more sense to get more cores, to more efficiently utilize the most expensive part of the server.

(None of this applies for workloads that don't use much ram per thread. Congrats, they are super cheap to host now.)

(edit: ) I don't see AMD outright removing SMT2 any time soon, it does help on some workloads. But it's much less important, and clients are increasingly turning it off.

SarahKerrigan · Jun 5, 2024

gdansk said:
For the core farms, 5% less area means they can fit more cores. And what percent of those customers are using SMT? I don't know but it seems like a security risk with all these bleeds and breaks.

I think they could discard it for core farm SKUs and it would be a net benefit for most customers.

As far as I know, hyperscalers typically do run with SMT. To the best of my knowledge, one "vCPU" on AWS is an SMT thread with x86 parts, for instance.

If we're throwing out the performance baby with the side-channel bathwater, we need to have a talk about branch prediction, OoO, shared caches across multiple cores, etc.

SarahKerrigan · Jun 5, 2024

Tuna-Fish said:
It reduces the performance available per thread.

If you only consider the CPU, this still seems like a huge win for throughput loads. But an active process requires pretty much the same amount of RAM regardless of how fast it is. Back when DRAM was comparatively cheap and CPUs were expensive, this wasn't a big deal. But today, if you spec out a normal server for a throughput load, >50% of the cost of that server is going to be DRAM. And supporting twice the threads means you have to double the most expensive component on your bom, for >25% extra cost, and you are not getting even 20% extra speed. It just makes more sense to get more cores, to more efficiently utilize the most expensive part of the server.

(None of this applies for workloads that don't use much ram per thread. Congrats, they are super cheap to host now.)

(edit: ) I don't see AMD outright removing SMT2 any time soon, it does help on some workloads. But it's much less important, and clients are increasingly turning it off.

The same argument would apply against aggressive manycore designs, though. If it's really about maximizing the RAM-to-thread ratio - and I have not seen that as a central element of capacity planning, either for conventional enterprise stuff or EDA flows - everyone should be building hulking monster-cores like Z, not tossing hundreds of cores on a device.

Nothingness · Jun 5, 2024

SarahKerrigan said:
As far as I know, hyperscalers typically do run with SMT. To the best of my knowledge, one "vCPU" on AWS is an SMT thread with x86 parts, for instance.

If we're throwing out the performance baby with the side-channel bathwater, we need to have a talk about branch prediction, OoO, shared caches across multiple cores, etc.

Yeah I can confirm that lower cost AWS instances have SMT enabled . In my case, security is not a concern since these instances are secured (read, no one external to my company can run jobs on them), but you get horrible performance. Definitely a showstopper for my needs.

StefanR5R · Jun 5, 2024

Nothingness said:
I can confirm that lower cost AWS instances have SMT enabled . [...] you get horrible performance. Definitely a showstopper for my needs.

Purely WRT your application performance, in this case it's still better for you to rent VMs with twice the vCPUs with SMT enabled, compared to rent SMT-disabled VMs. Pricing of either variant is another (but not entirely technical) question. Maybe the latter can be had at a better price (I haven't checked) because of lower potential power use.

Joe NYC · Jun 6, 2024

adroc_thurston said:
That's the trajectory but crossover timeline seems off.
Maybe a comboPHY for DT/luggable with two platform at once?

A platform supporting 2x LPDDR5/6 LPCAMM for Strix Halo - as a parallel platform to AM5 - would be interesting.

adroc_thurston · Jun 6, 2024

Joe NYC said:
A platform supporting 2x LPDDR5/6 LPCAMM for Strix Halo - as a parallel platform to AM5 - would be interesting.

That's too big and too niche for DT. forget about that.
DT has discrete graphics, proper one.

Glo. · Jun 6, 2024

Joe NYC said:
A platform supporting 2x LPDDR5/6 LPCAMM for Strix Halo - as a parallel platform to AM5 - would be interesting.

The only platform that it will appear on desktop is in Mac Mini/Mac Studio competition.

Doug S · Jun 6, 2024

Glo. said:
The only platform that it will appear on desktop is in Mac Mini/Mac Studio competition.

I think within a few years desktops will be 100% LPCAMM. If you want traditional DIMMs you'll be on server platforms, or on workstation level desktops that use server CPUs like Intel's Xeon workstations or server CPUs in desktop clothing like Threadripper.

A single LPCAMM that's 192 bits wide (i.e. the high end, there will be narrower ones) is the equivalent of six DDR channels bandwidth wise. Outside of those server class CPUs mentioned above, how many have six channels? Zero, AFAIK. The single LPCAMM will also take up less board space than 6 (let alone 12) DIMM slots. I wouldn't be surprised to see boards in SFF setups install the CAMM on the bottom to further reduce footprint in a way DIMMs cannot.

Almost no one wants a tower case PC. Those have been the standard for years because of the need to fit 3.5" and 5.25" drives, but those are gone. So you can have a Mac Mini type form factor for the PC for the average person, and a Mac Studio like form factor (expanded to a cube) for those who want a discrete GPU. Tower style that aren't the server/workstation type platform PCs still using DIMMs are going to be a niche by the end of the decade if they exist at all.

I wouldn't look for much in the way of options for two LPCAMMs like JoeNYC wants. That requires 384 bits of memory controller - just look at the M3 Pro die shots and double the area devoted to memory controllers - plus maybe more since LPDDR6 is more complex than LPDDR5. Those don't shrink as much as logic does, either, so don't expect much help from N2. That's a lot of area, and that's equivalent to 12 DDR channels. i.e. Threadripper territory.

Joe NYC · Jun 6, 2024

adroc_thurston said:
That's too big and too niche for DT. forget about that.
DT has discrete graphics, proper one.

The memory controller could support 2 of the big channels, there could be pins for 2 channels but a lower end platform would only support 1 channel

OTOH, the same platform, supporting both channels could be a revival of HEDT platform.

But the point would be to move to LPDDR memories for desktops / APUs / NUCs / corporate desktops

Markfw · Jun 6, 2024

Nothingness said:
Yeah I can confirm that lower cost AWS instances have SMT enabled . In my case, security is not a concern since these instances are secured (read, no one external to my company can run jobs on them), but you get horrible performance. Definitely a showstopper for my needs.

What CPUs are in those boxes ?

adroc_thurston · Jun 6, 2024

Joe NYC said:
The memory controller could support 2 of the big channels, there could be pins for 2 channels but a lower end platform would only support 1 channel

Lotta engineering for not much gained.

Joe NYC said:
But the point would be to move to LPDDR memories for desktops / APUs / NUCs / corporate desktops

Can do that with bog standard 128b (then 192b-ish) for L6.

Glo. · Jun 6, 2024

Doug S said:
I think within a few years desktops will be 100% LPCAMM. If you want traditional DIMMs you'll be on server platforms, or on workstation level desktops that use server CPUs like Intel's Xeon workstations or server CPUs in desktop clothing like Threadripper.

A single LPCAMM that's 192 bits wide (i.e. the high end, there will be narrower ones) is the equivalent of six DDR channels bandwidth wise. Outside of those server class CPUs mentioned above, how many have six channels? Zero, AFAIK. The single LPCAMM will also take up less board space than 6 (let alone 12) DIMM slots. I wouldn't be surprised to see boards in SFF setups install the CAMM on the bottom to further reduce footprint in a way DIMMs cannot.

Almost no one wants a tower case PC. Those have been the standard for years because of the need to fit 3.5" and 5.25" drives, but those are gone. So you can have a Mac Mini type form factor for the PC for the average person, and a Mac Studio like form factor (expanded to a cube) for those who want a discrete GPU. Tower style that aren't the server/workstation type platform PCs still using DIMMs are going to be a niche by the end of the decade if they exist at all.

I wouldn't look for much in the way of options for two LPCAMMs like JoeNYC wants. That requires 384 bits of memory controller - just look at the M3 Pro die shots and double the area devoted to memory controllers - plus maybe more since LPDDR6 is more complex than LPDDR5. Those don't shrink as much as logic does, either, so don't expect much help from N2. That's a lot of area, and that's equivalent to 12 DDR channels. i.e. Threadripper territory.

I meant not about LPCAMM, but Strix Halo appearing on desktop as Mac Mini/Mac Studio competition .

In essence, I agree. LPCAMM is the future, because it will simplify manifacturing for desktops and mobile platforms, and allows the use of mobile chips straight up on desktops.

With the way it goes, the direction of AI everywhere - memory bandwidth and NPUs will be a requirement, and the need for competition with Apple/Microsoft will be stronger than anything else, which will drive the adoption of LPCAMM everywhere.

DIY will remain on desktop but as a highest end of highest end.

Everything below - APUs with PCIe expansion, or stuff like Mac Studio, large APUs without internal expansion, on desktops. Thats the direction of market, since desktop in general is dying.

Joe NYC · Jun 6, 2024

Doug S said:
I think within a few years desktops will be 100% LPCAMM. If you want traditional DIMMs you'll be on server platforms, or on workstation level desktops that use server CPUs like Intel's Xeon workstations or server CPUs in desktop clothing like Threadripper.

A single LPCAMM that's 192 bits wide (i.e. the high end, there will be narrower ones) is the equivalent of six DDR channels bandwidth wise. Outside of those server class CPUs mentioned above, how many have six channels? Zero, AFAIK. The single LPCAMM will also take up less board space than 6 (let alone 12) DIMM slots. I wouldn't be surprised to see boards in SFF setups install the CAMM on the bottom to further reduce footprint in a way DIMMs cannot.

That's very interesting, that there will be 192 bit wide version. I don't know who has a device on their road map for this width at this time.

Strix Halo will be 256 bits wide, supporting LPDDR5.

Doug S said:
Almost no one wants a tower case PC. Those have been the standard for years because of the need to fit 3.5" and 5.25" drives, but those are gone. So you can have a Mac Mini type form factor for the PC for the average person, and a Mac Studio like form factor (expanded to a cube) for those who want a discrete GPU. Tower style that aren't the server/workstation type platform PCs still using DIMMs are going to be a niche by the end of the decade if they exist at all.

I wouldn't look for much in the way of options for two LPCAMMs like JoeNYC wants. That requires 384 bits of memory controller - just look at the M3 Pro die shots and double the area devoted to memory controllers - plus maybe more since LPDDR6 is more complex than LPDDR5. Those don't shrink as much as logic does, either, so don't expect much help from N2. That's a lot of area, and that's equivalent to 12 DDR channels. i.e. Threadripper territory.

I see now, after a bit of ~~Googling~~, ~~Binging~~, Co-Piloting, I see that DDR6 is going to be 192 bits wide instead of 128 wide. That would seem sufficient / workable for Strix Halo like CPU.

adroc_thurston · Jun 6, 2024

Joe NYC said:
I see that DDR6 is going to be 192 bits wide instead of 128 wide

LPDDR6 is 24b subchannels for 192b total in a standard quadchannel setup.
DDR6 is the opposite and shrinks the subchannel to 16b from 32b of DDR5 and there you go, still 128b.

Joe NYC · Jun 6, 2024

adroc_thurston said:
LPDDR6 is 24b subchannels for 192b total in a standard quadchannel setup.
DDR6 is the opposite and shrinks the subchannel to 16b from 32b of DDR5 and there you go, still 128b.

That's going to be some nice bandwidth uplift. Looks like LPDDR6 is going to be 2x LPDDR5.

Comparing 256 bits of Strix Halo vsl 192 bit LPCAMM for LPDDR6, it will be:
2 x 256 / 192 = 2 x 4/3 = 1.5

Seems like a bright future for company that can make a competent APU, and not so bright for company selling mobile dGPUs.

Tuna-Fish · Jun 6, 2024

Joe NYC said:
That's very interesting, that there will be 192 bit wide version. I don't know who has a device on their road map for this width at this time.

Strix Halo will be 256 bits wide, supporting LPDDR5.

Strix Halo will be old news by the time LPDDR6 will be out in volume. It's a Zen5 product, while even Zen6 is probably too early for LP6.

Joe NYC · Jun 6, 2024

Tuna-Fish said:
Strix Halo will be old news by the time LPDDR6 will be out in volume. It's a Zen5 product, while even Zen6 is probably too early for LP6.

Strix Halo is a proof of concept, something to build up on in future versions.

It's not really completely breaking new ground since Apple is already doing it. Building good APUs, replacing GDDR with LPDDR, unified memory

inquiss · Jun 6, 2024

Joe NYC said:
Strix Halo is a proof of concept, something to build up on in future versions.

It's not really completely breaking new ground since Apple is already doing it. Building good APUs, replacing GDDR with LPDDR, unified memory

Let's hope it sells well so they find the rest of the roadmap...

Doug S · Jun 6, 2024

adroc_thurston said:
LPDDR6 is 24b subchannels for 192b total in a standard quadchannel setup.
DDR6 is the opposite and shrinks the subchannel to 16b from 32b of DDR5 and there you go, still 128b.

There's no way DDR6 will have 16 bit subchannels. ECC, at least as an option, is a hard requirement for DDR6. You'd take a 50% bit penalty to do ECC in 16 bit chunks. DDR4's 64 bit channels allowed 72 bits to handle ECC. With DDR5 ECC DIMMs went 80 bits wide to handle handle ECC for two channels. I don't see them doubling down on that. Since non-ECC DDR6 DIMMs will be a lot rarer assuming the everything below "high end workstation" using CAMMs, they might end up with non-ECC DIMMs being really high priced due to how little demand there is so I'm not convinced ECC will be optional with DDR6. Maybe it'll be something you can disable in the BIOS, but I'd say there are better than even odds that every DDR6 DIMM is an ECC DIMM.

Now I suppose they could do something funky and fiddle with the burst length to get enough extra bits to do ECC rather than doing a wider data path as with DDR5 (which technically is what they did with LPDDR6) but it would make a lot more sense to make DDR6 DIMMs 96 bits wide in 24 bit wide channels following the LPDDR6 plan.

Question Zen 6 Speculation Thread

Senior member

Platinum Member

Senior member

Platinum Member

Senior member

Golden Member

Senior member

Senior member

Diamond Member

Elite Member

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Platinum Member

Moderator Emeritus, Elite Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Golden Member

Platinum Member

Member

Platinum Member