Discussion Zen 5 Architecture & Technical discussion

Page 18 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Jul 27, 2020
19,613
13,477
146
I'd love to see a re-review of the findings with regards to the inter-core and inter-ccx latencies compared to Z4 when there are new AGESA/BIOS'es out and chipset drivers and Windows are properly updated.
Clock is ticking. AMD needs to get the fixed AGESA out the door BEFORE Arrow Lake launch.
 
Reactions: Hotrod2go

MS_AT

Senior member
Jul 15, 2024
207
497
96
Clock is ticking. AMD needs to get the fixed AGESA out the door BEFORE Arrow Lake launch.
And what needs fixing? CCD to CCD latency? Was anyone able to correlate poor performance in this synthetic test with impact on any particular workload? Or is there something else that needs fixing? Chips&Cheese did one article long time ago https://chipsandcheese.com/2021/06/...to-core-latency-and-the-role-that-locks-play/ let me quote the conclusion:
Typically games have around 20-30 per 10000 instructions suffering a L3 cache miss, which means that games are much more bound by memory latency than lock latency. If you picked an instruction at random, it’s 20-30 times more likely to miss L3 than require a core to core transfer. The situation is more skewed for very parallel productivity workloads like Cinebench, where L3 misses happen about 80x as often as core to core transfers. So in conclusion, a core to core latency test using locks isn’t very indicative of how a CPU will perform with real world usage either of games or productivity workloads. Core to core latency is merely one part of a CPU’s overall performance, and plays a small role compared to other factors like the performance of a CPU’s cache and memory hierarchy.
That is not to say such workloads do not exist, it might be also that C&C test approach was flawed or their game selection insufficient but anyway this synthetic test got lots of attention because of huge regression but I am unaware of anyone who was able to successfully link that to real world perf regression.
Alpha has that additional latency when instructions has to cross register file. Scheluder, in case of Alpha with help of programmer, tries to keep dependent instructions on same side of register file. But in situations like when there is 3 adds to scheluded in clock for 2+2 alu configuration one instruction needs to take wrong side and from that result dependent instructions see one cycle latency penalty. When scheluder can isolate dependent instructions to their own sides full thoughput can be maintained without any penalties.
In Zen you also suffer 1c latency penalty when you shift data from scalar / int register file to fp/simd register file if I understand documentation correctly. But this penalty is different to the penalty suffered exclusively by otherwise 1c latency SIMD ops.
 
Jul 27, 2020
19,613
13,477
146
And what needs fixing?
Anything that AMD hasn't been able to do so far. I don't know what their TODO list for Zen 5 related AGESA improvements looks like but they better get to it fast. Also, if they release the X870E/X870 chipsets soon, those may also help performance a bit when used with Zen 5 optimized EXPO kits. And of course, if they are able to release the new X3D SKUs, that would further improve their standing in benchmarks.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96
Agree, 6400MT/S 1:1 must be the new "sweet spot", anything less is robbing the efficiency gains from Zen 5, also help in the market fight with Intel's upcoming arrow lake.
Fabric clock sweet spot is 2000MHz, CCD to IOD link is 32B/c -> max bandwidth 64GB/s. If you are lucky you get 70,4GB/s with 2200MHz.
For DDR: 6400MT/s with 128b(16B) bus is 102GB/s. With current sweetspot of 6000MT/s you are at 96GB/s.
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.
 
Jul 27, 2020
19,613
13,477
146
In other words, the new sweet spot will be meaningless to 1 CCD SKUs.
AVX-512 workloads using 512-bit registers will see benefit. They are the ones most in need of any extra bandwidth compared to the same instructions executing on Zen 4.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96
AVX-512 workloads using 512-bit registers will see benefit. They are the ones most in need of any extra bandwidth compared to the same instructions executing on Zen 4.
Ok, maybe I wasn't clear enough. The CCD to IOD interface limits you to 64GB/s, while 6000MT/s DDR5 setups provides theoretical 96GB/s. Since CCD to IOD bandwidth is the limiting factor here, it doesn't matter how fast your DRAM is if you saturate CCD to IOD link first [probably better to have a bit higher for various contoller related overheads].

AVX512 would love to use the bandwidth but it won't be able to.
 
Jul 27, 2020
19,613
13,477
146
AVX512 would love to use the bandwidth but it won't be able to.

If one is able to get DDR5-6400 CL30 or lower working at 2133 IF with a high end DDR5-8200 RAM kit, AVX-512 WILL see gains. May not be as much as one would like but going for 6400 MT/s won't be a total waste over stock DDR5-6000 CL30. In AIDA64, that's 13% copy, 12.6% read and 14.2% write memory bandwidth gains. Not insignificant by any means.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96

If one is able to get DDR5-6400 CL30 or lower working at 2133 IF with a high end DDR5-8200 RAM kit, AVX-512 WILL see gains. May not be as much as one would like but going for 6400 MT/s won't be a total waste over stock DDR5-6000 CL30. In AIDA64, that's 13% copy, 12.6% read and 14.2% write memory bandwidth gains. Not insignificant by any means.
Let me quote myself:
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.
Thanks for finding measurements that confirm this hypothesis It's 7950X so 2 CCD SKU using AIDA that uses both CCDs
 
Jul 27, 2020
19,613
13,477
146
Thanks for finding measurements that confirm this hypothesis It's 7950X so 2 CCD SKU using AIDA that uses both CCDs

DDR5-6400 with tuned memory timings and IF @ 2200 on a 9700X yielding above 20% improvement in V-Ray and AI Bench, both of which presumably use AVX-512.
 

MS_AT

Senior member
Jul 15, 2024
207
497
96

DDR5-6400 with tuned memory timings and IF @ 2200 on a 9700X yielding above 20% improvement in V-Ray and AI Bench, both of which presumably use AVX-512.
Let me quote myself once again:
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.
I have never said timings are not important. In fact they are more important than the pure bandwidth for single CCD SKUs due to IF limitation. Second if you read through his article paying some attention you will see that:
Despite the Ryzen 7 9700X having only 8 cores, the performance is restricted by its maximum power to 65W. By enabling PBO, we can easily double the power budget in all-core workloads. Combined that with enabling higher memory speeds and it translates into significant performance gains across the board. The Geomean performance improvement is +4.04%, and we get a maximum improvement of +18.07% in the AI Benchmark.
So this 20 % AI Benchmark is not thanks to memory alone. The results shown are 9700X PBO + MEMORY EXPO vs 9700 Stock at 4800MHz RAM.

I get you really want to prove your point Igor, but now it turns into pure spam of hastily thrown links that are supposed to validate what you say. I will drop the subject now in order not to deteoriate the thread further.
 

Hotrod2go

Senior member
Nov 17, 2021
349
233
86
Let me quote myself once again:

I have never said timings are not important. In fact they are more important than the pure bandwidth for single CCD SKUs due to IF limitation. Second if you read through his article paying some attention you will see that:

So this 20 % AI Benchmark is not thanks to memory alone. The results shown are 9700X PBO + MEMORY EXPO vs 9700 Stock at 4800MHz RAM.

I get you really want to prove your point Igor, but now it turns into pure spam of hastily thrown links that are supposed to validate what you say. I will drop the subject now in order not to deteoriate the thread further.

65W TDP... pffft! my Asrock X670E board runs my 9700X with auto power limits for PBO well beyond that limit - 105W for example when running Mem Test Pro & this with a negative Vcore offset as well, my PBO has no tuning, all that on AGESA 1.2.0.0a which is the only current AGESA Asrock implement for my board atm. MSI & Asus have released updated AGESA at the time of writing this so that promotes 105W TDP for Zen 5 but Asrock are yet to catch up to that besides Asrock are pushing the power limits of Zen 5 as it is. If I engage motherboard limits, the power limits are pushed even further, but so is performance however that is beyond the thermal solution I use atm.

Fabric clock sweet spot is 2000MHz, CCD to IOD link is 32B/c -> max bandwidth 64GB/s. If you are lucky you get 70,4GB/s with 2200MHz.
For DDR: 6400MT/s with 128b(16B) bus is 102GB/s. With current sweetspot of 6000MT/s you are at 96GB/s.
In other words, the new sweet spot will be meaningless to 1 CCD SKUs. Only 2 CCD SKUs will be able to benefit provided you can engage both CCD dies. When it comes to bandwidth, at least.
If 6000 MT/s is the so called "sweet spot" then why do Tech tubers like Tech YES City conduct testing between 9700X, 7700X & 7800X3D with 6200 MT/s? are they cheating or something?
FCLK at 2000 is playing it safe for poor quality silicon chips thanks to the silicon lottery still a thing, this why AMD recommends it. The higher the FCLK can go with stability shows increases in memory bandwidth & reduced latency. Gaming benchmarks have proven fps increases with faster Memory up to a given point & that is IF the chip can do 6600MT/S 1:1 even on Zen 4. Your technical theories are wrong in real world applications like gaming.
Using a chip with 2 CCDs brings in the dreaded latency between the CCD problem & when running in 2:1 that is even worse as the memory controller is only operating at half speed. 1:1 is where its at for pure unadulterated performance.
 
Last edited:
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |