Discussion Zen 5 Architecture & Technical discussion

gdansk · Aug 8, 2024

JustViewing said:
The latency has increased for SIMD instructions from 1 to 2 cycles. Because of this SSE instructions seems to suffer. So any workload using this instructions might see a slight regression from Zen4.View attachment 104756

Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

naukkis · Aug 8, 2024

gdansk said:
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

X86-64 basic fp instruction set is SSE2, x87 could be used from x64 but ain't recommended and also not normally used at all. AVX/AVX2 has some support but as it's not supported on all cpu's even sold today support is quite minimally. AVX512 ain't supported pretty much on anything. AMD probably didn't know SIMD workload distribution when they started Zen5 design - Intel did back up AVX512 then pretty strongly. But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD. Intel goes to opposite direction - their E-core straight doubled 128 fp resources and Lion cove increased 256 bit fp units. Zen5 seems to face quite tough competition from Intel.

JustViewing · Aug 8, 2024

gdansk said:
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

I guess it will impact, as most executable are generic ones which are compiled for lowest common denominator. Win64 baseline is SSE2. So most generic application may not have significant improvement. It could change in Zen 6.

MS_AT · Aug 8, 2024

gdansk said:
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

That depends. SSE is the default [for scalars too] processing mode for floating point values for x64 architecture. The thing 1 cycle latency instructions got worse, and those are usually shuffles and you don't need to shuffle scalar values within the register. Add and multiply were already 3 cycles each and those were not affected. Actually SIMD int might be affected as vector int add was probably one cycle.

naukkis said:
X86-64 basic fp instruction set is SSE2, x87 could be used from x64 but ain't recommended and also not normally used at all. AVX/AVX2 has some support but as it's not supported on all cpu's even sold today support is quite minimally. AVX512 ain't supported pretty much on anything. AMD probably didn't know SIMD workload distribution when they started Zen5 design - Intel did back up AVX512 then pretty strongly. But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD. Intel goes to opposite direction - their E-core straight doubled 128 fp resources and Lion cove increased 256 bit fp units. Zen5 seems to face quite tough competition from Intel.

It wasn't hard for Skymont to double 128b execution units, they had so few of them before It would be much more impressive if Lion Cove doubled number of 256 pipes, but they are probably facing the same limitations AMD is. Lion Cove will match Zen5 with AVX2 capabilities as it is actually playing catch-up to Zen4.

poke01 · Aug 8, 2024

The 9700x is $609AUD. Less than the 7700X which was $670AUD. But you can get the 7700X for $489 now. I made the right call and got the 7700X for $395AUD on Prime day.

Mahboi · Aug 9, 2024

Zen5's AVX512 Teardown + More...

The 40% IPC improvement in SpecInt (an early leak) is consistent with my tests showing 30-35% improvement in raw scalar integer that isn't memory-bound.

We get essentially 10% general improvement in INT, if not less.
How the heck can Zen 5 be somehow entirely memory-bound on scalar????

MS_AT · Aug 9, 2024

Mahboi said:
Zen5's AVX512 Teardown + More...

We get essentially 10% general improvement in INT, if not less.
How the heck can Zen 5 be somehow entirely memory-bound on scalar????

Latency, what good are all those execution resources if you are waiting either for data or code to run. Games must be notorious for this, seeing how many of them are helped by x3d cache. Since Zen5 and Zen4 share the same connection characteristics from Core to L3 and from CCD to IOD afaik you won't see much improvement between Zen4 and Zen5 when that happens.
I guess using synthetic benchmarks running completely from L1 cache, you would see noticeable improvements in int scalar execution between Zen5 vs Zen4. Therefore the uncore changes rumored for Zen6 might be more meaningful than IPC gain of the core, if current potential is not fully tapped. But to know that, we would need someone to hook a profiler and see where the problem lies. Maybe C&C will do that.

soresu · Aug 9, 2024

naukkis said:
But even with AVX512 main desktop performance priority is on 128 bit SIMD - giving up 128 bit performance for wider vectors is just wrong bet from AMD

Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

gdansk · Aug 9, 2024

soresu said:
AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

No, FPU can only do 4 operations per cycle (of any size). And up to 2 loads (of any length). The stores can be split, however. 1 x 512 bit or 2 x 128/256 bit.

soresu · Aug 9, 2024

gdansk said:
No, FPU can only do 4 operations per cycle (of any size)

Ken g6 · Aug 9, 2024

gdansk said:
Wouldn't that impact most x86 workloads...? Programs basically don't use x87 nor AVX512 as I understand.

SSE replacing x87 basically started in 2000 with the Pentium 4. But with SSE came the option of SIMD, doing 2-4 tasks at a time. So any floating-point code requiring performance between 2000 and the early 2010s should have used SSE with SIMD. When AVX came along in the early 2010s it was a drop-in upgrade for most SIMD code.

By this logic, most performant floating-point code updated in the past decade should be using AVX and thus should not be affected. Also realize that many applications don't use floating-point at all.

Of course there are always exceptions. A program I worked on used SSE and 80-bit x87 in a weird way that wouldn't translate well to AVX because there weren't enough x87 registers. Fortunately it's obsolete now, but it would have required a good deal of work to use AVX.

inquiss · Aug 9, 2024

soresu said:
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

Yes there is extra latency there now

soresu · Aug 9, 2024

Ken g6 said:
But with SSE came the option of SIMD, doing 2-4 tasks at a time

MMX entered the chat.....

soresu · Aug 9, 2024

Also 3DNow! instruction set the year before SSE.

Edit: Oh interesting, 3DNow! actually started offering FP32 add/subtract/multiply operations before Intel had them with SSE.

Thunder 57 · Aug 9, 2024

Ken g6 said:
SSE replacing x87 basically started in 2000 with the Pentium 4. But with SSE came the option of SIMD, doing 2-4 tasks at a time. So any floating-point code requiring performance between 2000 and the early 2010s should have used SSE with SIMD. When AVX came along in the early 2010s it was a drop-in upgrade for most SIMD code.

By this logic, most performant floating-point code updated in the past decade should be using AVX and thus should not be affected. Also realize that many applications don't use floating-point at all.

Of course there are always exceptions. A program I worked on used SSE and 80-bit x87 in a weird way that wouldn't translate well to AVX because there weren't enough x87 registers. Fortunately it's obsolete now, but it would have required a good deal of work to use AVX.

I think you meant SSE2 with the P4. SSE was on the P3 before that. When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.

soresu said:
Also 3DNow! instruction set the year before SSE.

Edit: Oh interesting, 3DNow! actually started offering FP32 add/subtract/multiply operations before Intel had them with SSE.

3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.

MS_AT · Aug 9, 2024

gdansk said:
No, FPU can only do 4 operations per cycle (of any size). And up to 2 loads (of any length). The stores can be split, however. 1 x 512 bit or 2 x 128/256 bit.

There are further limits by the operation type [add, mul, complex permute, simple permute] which might be or might not be important

inquiss said:
Yes there is extra latency there now

only for subset of instructions the most basic one already had more than 1 cycle latency so are not affected

Thunder 57 said:
I think you meant SSE2 with the P4. SSE was on the P3 before that. When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.

If not for this stupid segmentation policy, AVX512 could be more popular now, I mean Intel could fit somewhat limited implementation into Tiger Lake, and I remember part of the hardware for this limited implementation was already present on Skylake non X but fused off. But I would need to dig for source, since I might be remembering wrongly

Ken g6 · Aug 9, 2024

Thunder 57 said:
I think you meant SSE2 with the P4. SSE was on the P3 before that.

I know there was SSE on P3, but I don't think they really pushed moving off the x87 FPU for non-SIMD work until the P4.

Thunder 57 said:
When AVX came along in the early 2010's Intel used it as market segmentation. The lower end chips and didn't include it. If not for that, maybe AVX(2) would be more prevelant today. But that was par for the course for a long time with Intel.

Oh, yeah, I forgot about the broken Celery. (Celerons.)

soresu · Aug 9, 2024

Thunder 57 said:
3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.

Story of AMD's life really.

Exact same thing happened with SSE5 and AVX.

Mahboi · Aug 10, 2024

MS_AT said:
Latency, what good are all those execution resources if you are waiting either for data or code to run. Games must be notorious for this, seeing how many of them are helped by x3d cache. Since Zen5 and Zen4 share the same connection characteristics from Core to L3 and from CCD to IOD afaik you won't see much improvement between Zen4 and Zen5 when that happens.
I guess using synthetic benchmarks running completely from L1 cache, you would see noticeable improvements in int scalar execution between Zen5 vs Zen4. Therefore the uncore changes rumored for Zen6 might be more meaningful than IPC gain of the core, if current potential is not fully tapped. But to know that, we would need someone to hook a profiler and see where the problem lies. Maybe C&C will do that.

Fascinating...
One thing I read somewhere is that Apple's performance success comes also from a fairly fat L2 rather than the more server-typical L1/2/3 AMD uses.
Now granted they also do it in GPUs while Nvidia stays with only L1/L2, maybe it's just a kink AMD will keep. But could it be that with Zen 6, we start seeing the latency bottleneck cured with a smaller or non-existent L3 on client, while a really fat L2 replacing it? Server apps very clearly gain a lot from Zen 5, the problem seems to be more that what we have here is a fully primed server chip that isn't really any kind of improvement in client.

naukkis · Aug 10, 2024

Thunder 57 said:
3DNow! was implemeted to make up for AMD's less than stellar x87 performance at the time. If AMD had more market share maybe it would've made more of a difference.

3dNow! was basically meant only for 3d-gaming. It fp calculations weren't IEEE-compatible, so for most programs they lack accuracy so being useless for general usage.

Mahboi · Aug 10, 2024

soresu said:
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/ Ever so slightly, a percent or two.

naukkis · Aug 10, 2024

soresu said:
Are you implying that AMD's 128 bit vector perf has actually regressed?

AFAIK unless I have read things completely wrong the larger units should just subdivide for smaller vectors allowing 4x 512 to become 8x 256, or 16x 128.

Not the point. AMD does have only 2-load ports to fp register where everybody else has more. Even Intel E-cores will have 3 load ports, so probably being able to achieve better IPC for some scalar & 128 bit workloads. AMD has biggest FP-unit of all in their cpu's - and is nearly in situation that it will have worst IPC for desktop and mobile workloads.

gdansk · Aug 10, 2024

naukkis said:
AMD has biggest FP-unit of all in their cpu's

Biggest how? Area I doubt it's even half the size of GC's. Transistor count? Doubt that it'll approach any N3 core.
I think people keep overestimating the area needed to do the changes they did. Plus in mobile the FPU isn't wider, it's still 256-bit.
And yet it has the 1 cycle penalty (i.e. it's not a consequence of being 512 bit but of some other design hazard).

naukkis · Aug 10, 2024

gdansk said:
Biggest how? Area I doubt it's even half the size of GC's. Transistor count? Doubt that it'll approach any N3 core.
I think people keep overestimating the area needed to do the changes they did. Plus in mobile the FPU isn't wider, it's still 256-bit.
And yet it has the 1 cycle penalty (i.e. it's not a consequence of being 512 bit but of some other design hazard).

Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.

soresu · Aug 10, 2024

naukkis said:
Theoretically most powerful - but in real usage cases might actually be worst performer. That's a pretty imbalanced situation.

And yet a rather obvious low hanging fruit for future cores to improve upon.

Discussion Zen 5 Architecture & Technical discussion

Diamond Member

Golden Member

Senior member

Senior member

Diamond Member

Golden Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Programming Moderator, Elite Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Senior member

Programming Moderator, Elite Member

Diamond Member

Golden Member

Golden Member

Golden Member

Golden Member

Diamond Member

Golden Member

Diamond Member