Discussion Zen 5 Speculation (EPYC Turin and Strix Point/Granite Ridge - Ryzen 9000)

Page 840 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

gdansk

Diamond Member
Feb 8, 2011
3,276
5,186
136
Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
It's weird because other tests (i.e. Phoronix Epyc tests) show that enabling SMT has little to no impact on power draw but generally increases performance.

So I wonder what can be extrapolated from either set of tests.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Skylake has very fat Load/Store units. It's two Load units with 32B/cycle for client and two 64B/cycle for server, which is double the ones in Haswell. That fits 2x512 units for AVX512 perfectly.

But it's still limited to 2x128bit loads for SSE2 code like most x86-designs. Only Skymont will be able to do 3 128 bit loads. Those 256 and 512 bit loads won't help at all with SSE2.

It goes back to @naukkis mentioning about Cinebench being a heavy FP workload, when it's not.
Hardware limited workload is either compute or load/store bound. SMT won't give more load/store ability(except AMD Dozers) but compute bound will benefit from SMT specially on fp because long instruction latencies usually makes full utilization of execution hardware very hard to achieve.

The same workload where it scales well on SMT is also same workload where it doesn't gain a lot from AVX512.

That means nothing. Longer vectors are only useful to small part of workloads where is enough parallelism to build those long vectors. For that Cinebench Apple's 4 128 bit execution units and load ports offers about double PPC per thread performance against x86 rivals, way more PPC for one thread than x86 can have from both SMT threads.
 

Saylick

Diamond Member
Sep 10, 2012
3,644
8,222
136
But it's still limited to 2x128bit loads for SSE2 code like most x86-designs. Only Skymont will be able to do 3 128 bit loads. Those 256 and 512 bit loads won't help at all with SSE2.
Right, the program I run doesn’t take advantage of the wider vectors afforded by AVX2 and definitely not AVX512, hence why I think I’m L/S bound.

I’m really interested in Zen 5 since it appears to handle a mix of 4 loads/2 stores per cycle, which is readily evident to me that it can truly handle two threads at full throughput. Again, more evidence that Zen 5 was designed for SMT from the get-go. My company sources our servers through Dell, so it shouldn’t be long before Turin is available for purchase. Would love to get my hands on a dual-socket R7725 blade with dual EPYC 9655. I’m in charge of one of our internal analysis committees so I have a little bit of sway on our server hardware choices.

 

MS_AT

Senior member
Jul 15, 2024
365
798
96
Right, the program I run doesn’t take advantage of the wider vectors afforded by AVX2 and definitely not AVX512, hence why I think I’m L/S bound.

I’m really interested in Zen 5 since it appears to handle a mix of 4 loads/2 stores per cycle, which is readily evident to me that it can truly handle two threads at full throughput. Again, more evidence that Zen 5 was designed for SMT from the get-go. My company sources our servers through Dell, so it shouldn’t be long before Turin is available for purchase. Would love to get my hands on a dual-socket R7725 blade with dual EPYC 9655. I’m in charge of one of our internal analysis committees so I have a little bit of sway on our server hardware choices.

View attachment 109739
Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).

Also if your code is really pure (pure part is important here) SSE2 only then it doesn't use FMA units at all. If it is in your power I would really reconsider recompiling it to something newer. You can use 128b vectors with AVX2/512, compiler will actually emit FMA for these and it will have 16/32(AVX512) architectural registers to play with instead of 8.
 

Saylick

Diamond Member
Sep 10, 2012
3,644
8,222
136
Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).

Also if your code is really pure (pure part is important here) SSE2 only then it doesn't use FMA units at all. If it is in your power I would really reconsider recompiling it to something newer. You can use 128b vectors with AVX2/512, compiler will actually emit FMA for these and it will have 16/32(AVX512) architectural registers to play with instead of 8.
Unfortunately, it’s old commercial software. We don’t have control over how it’s compiled.
 
Jul 27, 2020
20,898
14,487
146
Unfortunately, it’s old commercial software. We don’t have control over how it’s compiled.
My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employees
 
Reactions: Tlh97
Jul 27, 2020
20,898
14,487
146
That is the most distressing part for me.
Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.
 

Saylick

Diamond Member
Sep 10, 2012
3,644
8,222
136
My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employees
Lol, I wish I could do this. Best case is I convince them to let me rent cloud compute instances to evaluate various options.
 

OneEng2

Senior member
Sep 19, 2022
259
356
106
Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.
Not a shareholder, just an American ex Navy Nuke Submariner. I very much believe that the US can not be secure without US based chip IP companies leading the market. I would very much like Intel to start acting like a tech company leader.

I do hear you though. Intel seems dedicated to finding ways to muck up their business. If they didn't have TSMC wafer capacity, they would have had a completely different size of problem than they have now. Still, I wonder how well the money is working out for them on Arrow Lake and Lunar Lake.

Their decision to punt on AI is also baffling. Their late (really really late) arrival into integrated GPU's is also puzzling.
 

Hans Gruber

Platinum Member
Dec 23, 2006
2,369
1,259
136
Not really, just a different appproach.

AMD is 26% more performance at 27% more power.

Intel is 18% more perfomance at 3% more power.

Overall it's just different, I would even argue that Intels approach is better, because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
Intel is using superior silicon on Arrow Lake compared to their own Intel fab. That is where the power efficiency for Intel comes from.
 

Rheingold

Member
Aug 17, 2022
69
203
76
because it's basically free performance at same power. Which makes it even more unbelievable that they removed it for new consumer CPUs.
The aspect of performance per power is only one of the reasons for implementing SMT. It's nice to get it without increasing power usage much, but even at linear power increase it would be desirable because that's better than the less than linear performance-per-power increase at high frequencies. The other argument for SMT is more performance per area because the additional transistors require much less area (around 5%) than they add in performance (15-30%). The source is this document, and the money quotes are:
This implementation of Hyper-Threading Technology added less than 5% to the relative chip size and maximum power requirements, but can provide performance benefits much greater than that.
Measured performance on the Intel Xeon processor MP with Hyper-Threading Technology shows performance gains of up to 30% on common server application benchmarks for this technology.

So...
The main takeaway for me is that Intel removing HT still doesn't make sense, because the performance uplift should still be higher than the die area required.
Why do I say this? Because Intel says "Optimized for PPA" on their slide for the Arrow Lake P-cores:



This. Doesn't. Make. Sense.

And sorry for this now basically being a post for the Intel thread, but we are also discussing AMD's more effective implementation of the same technology.
 

Saylick

Diamond Member
Sep 10, 2012
3,644
8,222
136
The aspect of performance per power is only one of the reasons for implementing SMT. It's nice to get it without increasing power usage much, but even at linear power increase it would be desirable because that's better than the less than linear performance-per-power increase at high frequencies. The other argument for SMT is more performance per area because the additional transistors require much less area (around 5%) than they add in performance (15-30%). The source is this document, and the money quotes are:



So...

Why do I say this? Because Intel says "Optimized for PPA" on their slide for the Arrow Lake P-cores:

View attachment 109751

This. Doesn't. Make. Sense.

And sorry for this now basically being a post for the Intel thread, but we are also discussing AMD's more effective implementation of the same technology.
Maybe the simplest answer is correct: removing SMT simplifies validation of the design, and makes it easier to schedule threads between the P cores and the E cores. That’s it. Anything beyond that is just marketing so that the consumer doesn’t feel like they got a downgrade.
 

Thibsie

Senior member
Apr 25, 2017
913
1,019
136
My approach would be to convince management to let me get a bunch of different machines (could be anything. laptops or desktops, as long as they are different brands or generations of CPUs) and then benchmark the software to see which particular CPU/RAM combination seems to deliver the best performance. Right now, I would ask them to get me one 9700X and one 265K and then get to work benchmarking them, rather than trying to extrapolate the results of OTHER wildly different benchmarks and estimate how our software would perform. The test machines could always be repurposed or given to hard working employees
Can't. Smart. Smart not possible. Please be dumb.
😁
 

Thibsie

Senior member
Apr 25, 2017
913
1,019
136
Unless you are an Intel shareholder, there's nothing to be distressed about. Intel dug their own hole for themselves and they are the ones who have to dig themselves out of it. Some diehard fans will still buy Intel out of loyalty or disdain for AMD but generally well informed people will buy whatever gives them the best bang for buck. Intel is still miraculously not as worse off as they could have been. Can you imagine what would've happened if Jim Keller hadn't hammered the idea of buying TSMC wafer capacity into their thick skulls? They would still be churning out 400W consumer CPUs with massive performance and cooling issues.
How many people are well informed ? Not many. Most others will follow.
 

naukkis

Senior member
Jun 5, 2002
962
829
136
Only 2 pipes are able to do SIMD loads/stores. In general you can do at most 4 mem ops of which at most 2 can be stores. You can do at most 2 SIMD loads per cycle or 2 SIMD stores (only 1 for AVX512).

Also if your code is really pure (pure part is important here) SSE2 only then it doesn't use FMA units at all. If it is in your power I would really reconsider recompiling it to something newer. You can use 128b vectors with AVX2/512, compiler will actually emit FMA for these and it will have 16/32(AVX512) architectural registers to play with instead of 8.

You suppose that his code uses FMA? SSE2 in x64 mode has also 16 registers, 8 limitation is only in 32bit mode.

Intel actually offers great performance monitoring programs so developer doesn't need to guess where code is bottlenecked.
 
Last edited:
Reactions: lightmanek

MS_AT

Senior member
Jul 15, 2024
365
798
96
You suppose that his code uses FMA? SSE2 in x64 mode has also 16 registers, 8 limitation is only in 32bit mode.

Intel actually offers great performance monitoring programs so developer doesn't need to guess where code is bottlenecked.
He mentioned GEMM and GEMM will use FMA if available so it could use it. Since SSE2 predates x64 its hard to say if his software was compiled for x86 or x64.

But yes without profiling we can mostly throw around random guesses
 
Mar 11, 2004
23,341
5,772
146
Didn't Intel remove HT because of the recurring security vulnerabilities which have proven near impossible to prevent because of the inherent issues that arise from shared resource allocation when doing multiple threads on a single core? That alone is worth it. For consumers, most of the SMT uses are for things like video encoding and similar, where its not time critical (or else you'd be paying for more cores anyway) or if it is you'll be using video encoder hardware acceleration (either integrated into the chip, in the GPU, or as a dedicated separate system if you need better quality or other like archiving at the same time like if you're a professional streamer or something). Or its something a GPU or other likely would be better suited for too.

I know there was also some discussion about software licensing which was based on core counts, which also could play a role, but that's more in server space, but the security aspect is even more critical there, and companies were moving to developing their own ARM designs to get around such licensing restrictions and security, but probably mostly IP protectionism.

Main reason AMD hasn't yet is that its easy performance for them to tout and still no one really cares about security. They can flip a switch and turn it off, then probably fuse it out, then probably have it not even in there like Intel has basically done as well, once there is need for that.

This is one of the few things I am 100% onboard with Intel doing and I wish AMD would as well (I believe turning it off does not simply turn off that attack vector, which is why Intel pretty rapidly went from disabling HT to removing it from the cores).

Random babbling that's not super pertinent/just barely tangental to one aspect to this discussion

This is also an area I think Microsoft dropped the ball from 2 perspectives. They should have been requesting hybrid ARM/x86 chips (5 years ago), where the ARM chips could serve as the dense high core count processing block, as well as efficient idle/standby/sleep. This would have both sped up development and support of Windows on ARM, but also would not have made it where it either works or is broken like its kinda been (and still is), letting slower more measured development. It would have brought tangible improvements that would have mitigated x86 complaints (battery life especially, but also could have directed the BIG.little/p vs e/etc issue so that there was a more unified direction).

The other is building a CUDA alternative, where then x86, ARM, (maybe even RISC?) and GPU compute could duke it out for supremacy of parallel processing. I have a hunch that's where we're ultimately headed. Something kinda like Cell but on steroids and where that is basically a single core of the modern processor. Where the main core is a relatively large core, surrounded by and probably cascading out from, there's several smaller more purpose built cores, which will depend on use. Maybe mobile implements some RISC for low level background stuff, then there's efficient cores for low power actual use around a large core that wakes up when performance is needed, where it then directs to dedicated processing blocks, which will probably include GPU based stuff or similar for AI and graphics needs. Then once the task is either setup properly (say recording a video where it needs to get the pipeline setup, but once that's done it can then go back to efficient cores while the dedicated blocks do their thing). Arguably I think we're largely already there in mobile designs.

In servers, I could see CPU blocks at the forefront or split directly off from GPU compute blocks, but integrated as a base block (vs chiplets of each, where chiplets will just be adding more blocks). I think this also would make sense for gaming devices (and I've said AMD should have already built their graphics cards like this, with gaming performance focused cores, sharing cache blocks with GPUs, all integrated, sharing the full memory stack including storage that would enable overall bandwidth and throughput beyond what all these abstracted physical connectors allow, with lower latency to boot, then have external connectors where you hook it up to a general computing device which handles say logging into whatever service your using and has your account, and can handle other, but the gaming card/device is entirely focused on that, and then everything else can be handled by other devices which are more suited for that). I'd say most other consumer stuff should be going to small efficient cores paired with dedicated processing (image, graphics, video, etc). Either ARM or c/e whatever low power but good enough cores are. Max battery life, and for tasks where lots of threads would be needed have people pony up for larger MCM/chiplet designs of those. Even laptops should move to eGPU, where they can integrate a mobile GPU into a dock to handle I/O (which could be built into the power cord). Split power and data, such that for power you can implement wireless/magnetic, and for data you can have fiber optic. Let's you make fully sealed devices (water/dust proof - don't read that as impossible to work on though, but by splitting off most of the ports, them failing can be more easily remedied, or if you need changes to what ports you need for I/O you can change it much more easily without compromising the design like modular stuff tends to, see how swapping out a dock vs say modules on the Framework is for the design of the laptop).
 
Reactions: igor_kavinski
Jul 27, 2020
20,898
14,487
146
Main reason AMD hasn't yet is that its easy performance for them to tout and still no one really cares about security.
I would disagree here. I think the main reasons are "lots of design decisions made around improving SMT performance", making AMD CPUs great scalable multicore processors and not wanting to spam their CPUs with smaller cores (reasons for which I'm not entirely sure of but I think they have Mont type cores undergoing development and they are not yet ready for prime time).

We have seen many examples where AMD has shown Intel how to do things right. SMT is just one of such things where they took a security first approach to implementing secure boundaries between the physical and virtual threads. Another is AMD's mitigations against Spectre/Meltdown which Phoronix showed that turning off these mitigations actually makes AMD CPUs run a bit slower whereas doing the same on Intel CPUs makes them run faster. So Intel's mitigations are doing additional checks or preventing some optimized pathways from working whereas AMD's mitigations were built into the core design itself and they figured out optimizations to steal the performance back from these mitigations and turning the mitigations off also turns off those mitigation "mitigating" optimizations.

I think if AMD gives up SMT without ever exploring SMT4, it will be when they can have a whole frickin' sea of cores in an area smaller than Intel Monts.
 
Reactions: Tlh97 and Abwx
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |