I understand enthusiasm but at this pace, Skymont will turn out to be better than yet to be announced Apple M5. So if we start to pay attention to what intel claimed
under following conditions
So first 2% IPC parity has +/- 10% error.
Next Intel used for comparison GCC12.1 at O2 optimization...
Lion Cove will be faster, especially in gaming as it simply has more cache available per core compared to Skymont core. Is it 6x the L2 cache per core? [3MB vs 0.5MB] not to mention L1 and L0 caches. I mean this is simply the conclusion you can get from seeing that 7800X3D outperforms faster...
Until current console generation dies, then 12 + 0 die is a gimmick for gaming, really. You would better spend the additional die space on bigger caches so they could catch up to x3d. By the time more than 8 cores will be a must in gaming your 12 + 0 ArrowLake will be obsolete. Unless of course...
It's different in what it stores and that it's higher throughput but the overall mechanism of getting it filled looks the same from comparing Software Optimization Guides so if either one misses it will need to wait unless I have misunderstood something.
I think the problem is how much you have to wait when it misses. Games are reaching like 75% uop utilization in Zen4 from C&C analysis, and if you are unlucky it seems the core will sit there doing nothing for 40 cycles when it is trying to fetch instructions to decode from L3 even worse if from...
Have you observed thread utilization? OpenVino might limit itself to physical cores since HT won't give you lots of benefits in backend bound code. What you might see is noticeable performance scaling with DDR MT/s if the benchmark is using LLMs underneath.
Not at all I am afraid. I mean according to AMD single thread gets 4-wide decoder and the design is anyway more latency bound than throughput bound. So while I guess compiler guys can do wonders I wouldn't keep my hopes high.
Well servers are also used for compiling;) Problem with compilation...
Haven't heard thanks for sharing but tbh she was a bit doomed to fall, we have too many too small tools in the ecosystem. I mean even projects like MOLD that are basically drop-in replacements into existing build systems are facing similar hurdles. https://github.com/rui314/mold but well for...
It can be there, but the score is not so meaningful if we don't know the power envelope needed to achieve it. OnePlus was known for using special power modes for GeekBench to boost perf in benchmarks and the test platform itself doesn't need to be in sealed phone factor so it might have...
Do you know which method Intel used to produce their latency numbers, as I recall Cheese from Chips&Cheese suggested they are comparing apples to oranges since they used C&C number for Meteor Lake but did not provide how they obtained Lunar Lake numbers
But aren't games latency and not throughput bound? I mean more decoder throughput won't help you there. For example take a look at https://chipsandcheese.com/2023/09/06/hot-chips-2023-characterizing-gaming-workloads-on-zen-4/ uop cache is already serving over 75% needs for instructions. Problem...
Which queue size do you mean?
And do you know any workload that is affected by this inter-CCD latency issue other than the synthetic benchmark that is measuring it?
I am getting lost in the hazards, do you mean the scheduler hazard that adds 1 cycle latency to ops that are supposed to be 1 cycle only or is there some otherwise undocumented FADD specific scheduler hazard that prevents 2 cycle FADDs to be used every time? As according to all official info on...
If we are to believe measurements done by Alex Yee and David Huang only Granite Ridge is able to do 2c under specific circumstances http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/#fadd_latency
Rumors are pointing in that direction. Server people this way won't complain as they will get what they are used to. Marketing has already prepared an easy explanation with maximizing throughput not energy saving to justify the difference and Client has simpler scheduler model.
While I can only guess, based on their own press materials, they wanted to save transistor budget. After all you need to duplicate some structures to support SMT and seeing professional software still has issues with Raptor Lake due to scheduling issues, it might be for the better for them to...
Number of 512b entries is noticeably smaller. For 128/256 the number is either exactly the same or very close to Granite Ridge depending on the source.
For what it is worth, desktop parts have twice as big L3 cache [32MB vs 16MB] but L3 is a victim cache so not sure how well R24 is able to make use of it.
Let me quote myself once again:
I have never said timings are not important. In fact they are more important than the pure bandwidth for single CCD SKUs due to IF limitation. Second if you read through his article paying some attention you will see that:
So this 20 % AI Benchmark is not thanks...
Ok, maybe I wasn't clear enough. The CCD to IOD interface limits you to 64GB/s, while 6000MT/s DDR5 setups provides theoretical 96GB/s. Since CCD to IOD bandwidth is the limiting factor here, it doesn't matter how fast your DRAM is if you saturate CCD to IOD link first [probably better to have a...
Fabric clock sweet spot is 2000MHz, CCD to IOD link is 32B/c -> max bandwidth 64GB/s. If you are lucky you get 70,4GB/s with 2200MHz.
For DDR: 6400MT/s with 128b(16B) bus is 102GB/s. With current sweetspot of 6000MT/s you are at 96GB/s.
In other words, the new sweet spot will be meaningless to...
And what needs fixing? CCD to CCD latency? Was anyone able to correlate poor performance in this synthetic test with impact on any particular workload? Or is there something else that needs fixing? Chips&Cheese did one article long time ago...
R24 is memory subsystem dependant what makes it different to R23. I suggest checking out Chips&Cheese review of the benchmark. AT is using stock memory, while other outlets are using 6000MHz+ RAM. In case of Lunar Lake vs Strix, Lunar is enjoying both bandwidth and latency advantage. Since there...
We would need an AMD engineer to answer that question. To the best of my knowledge, there is one register file for each domain. We don't know if it is internally divided. What we know is that the latency penalty applies only to a subset of instructions only under specific conditions, while what...
Wasn't this assuming so OS side patches, or something like that, I remember in footnotes they somehow alluded to tuned software for the highest perf improvement they quoted.
Ah, that would be unfortunate as it would suggest a cut down die, to be honest, or something that did not pass tests to...
No. You should read this: "We are claiming some instructions have best case latency of 1 cycle but your standard latency test might measure 2 cycles. To measure 1 cycle do this". That is the only purpose for the testing solution they propose. It's not meant for performance comparisons between...
The Zen 5 Software Optimization Guide contains also an excel file listing instruction latencies. In there there is also a Notes sheet that contains a following note:
This is about the 1 cycle latency regression for single cycle SIMD ops, that people were previously discussing here.
I was looking up AT review of Skylake before posting;) Haven't calculated the percentage but the bars looked within 5% range. Anyway my point was, that 5% gaming improvements weren't something unheard of before even across 2 generations. The whole media coverage of Zen5 sounds like it is an...
The issue is that IPC might be misleading and not exactly what people are looking for. For example in Zen4 case, and avx512 vs avx2, you would have the same throughput, but AVX512 code would retire only half of the instructions the avx2 loop would do, and maybe shave off few cycles saved on the...
The packaging should also bring latency benefits I guess. I mean memory is on package, doesn't need to go through motherboard PCB. It's physically much closer to the CPU than otherwise would be possible.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.