- Mar 3, 2017
- 1,747
- 6,598
- 136
It's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.Not sure why AVX512 should be special in this regard.
Yes, but this assumes the amount of data per time unit that is being processed is big. Which is not the case for many MT workloads. And not the case for many AVX512 workloads either.It's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.
But clearly Zen5, especially when running in AVX 512 is a memory bandwidth hog.
Everybody keeps denying the change in avx-512 applications. First it was in PURE avx-512 benchmark show 98% faster than Zen 4. Now we have at least one application where it is 30-40% faster that is actually used (in the application areas). See the benchmarks in the DC forum here. These are not imaginary gains. In scientific and other areas, it should see these huge gains. This being the case, I doubt that it will die at Zen 5.I'm doubtful any consumer Zen 6 implementation will enjoy full speed AVX 512.
Yes, at least when running in AVX-512 mode - Zen5 is twice the throughput of Zen4, let alone Zen3 which only had AVX2!DDR5 is roughly twice as fast as DDR4. Are the new cores on DDR5 twice as fast as the latest cores on DDR4? Do they require twice the memory bandwidth?
It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependentDepends on what kind of workload and data you're using AVX512 for. If the amount data being processed is limited, it'll be kept in cache which fast and has very high memory bandwidth. Same as for other operations BTW. Not sure why AVX512 should be special in this regard.
My source is the sporadic lackluster benchmark results.Source for this claim?
Also, not all MT workloads use pure AVX512 anyway. Instead it's probably very few actually.
**ideal MT workloads that aren't memory bandwidth bound, of which there will be less and less as you add more coresBecause for ideal MT workloads, you can double performance by doubling number of cores.
That's a cherry-picked special case. What's the average perf increase when comparing the fastest DDR4 Zen core vs fastest DDR5 Zen core?Yes, at least when running in AVX-512 mode - Zen5 is twice the throughput of Zen4, let alone Zen3 which only had AVX2!
If the active dataset that is being processed is limited, it'll fit in cache which has very high memory bandwidth, so it won't be a problem. As igor_kavinski mentioned, X3D parts will help too, for those workloads where the active dataset is bigger.It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependent
Why do you think even Zen4 Epyc has 12 memory channels per socket?
That does not mean all MT workloads are pure AVX512 though. Far from it.Right, but I'd argue almost all AVX512 workloads use MT. Once we see more apps compiled with AVX512, we'll see more lackluster gains from 9950x.
I don't think 16 cores is a lot for a Zen5 CPU. We've been on 16 cores since Zen2 on DT. Intel DT CPUs are already on 24 cores, while also using DDR5.**ideal MT workloads that aren't memory bandwidth bound, of which there will be less and less as you add more cores
I get it - there's plenty of cases where more cores is better and not much memory bandwidth is needed (encryption, encoding, CPU based photorendering, hypervisor spam)
But for each of those there's plenty of cases where it is membw bound too, so the already niche idea of "moar cores" becomes increasingly so when combined with the ridiculous high throughput AVX512 9950x has.
Anyway, my main point is it would just generally be "imbalanced". 16 P-cores is already a lot for a non-HEDT part with only 2 channels of memory. Then there is also the power budget to worry about
I think we both agree on this one. I’m only arguing against 24 - 32 p-cores on dual channel consumer desktop. IMO I think 16 is even too many (and I say this as an owner of 5950x , 5900x , and 7840HS )Regarding power budget I agree though. But then the better solution is to use P + E cores. E.g. 8P cores for max ST perf, then X amount of E cores for max MT perf. Where the E cores are designed for lower power consumption, and optimal perf/watt.
Regarding power budget I agree though. But then the better solution is to use P + E cores. E.g. 8P cores for max ST perf, then X amount of E cores for max MT perf. Where the E cores are designed for lower power consumption, and optimal perf/watt.
I partly agree with you. But for sure some computational programs such as y-cruncher and Prime95 are limited by main memory BW.If the active dataset that is being processed is limited, it'll fit in cache which has very high memory bandwidth, so it won't be a problem. As igor_kavinski mentioned, X3D parts will help too, for those workloads where the active dataset is bigger.
...
And even for pure AVX512 workloads, memory bandwidth does not have to be a problem anyway if the active dataset is limited, as mentioned above.
AIDA64 measures Zen5's memory bandwidth to be about 60 GB/s. To the untrained eye, this may seem like a lot. But when you break it down, it becomes clear that it is suffocatingly insufficient for Zen5's computational power.
- 60 GB/s divided across 16 cores becomes 3.75 GB/s per core.
- 3.75 GB/s divided by ~5 GHz CPU clock becomes 0.75 bytes/cycle.
- 0.75 bytes/cycle divided by 512-bit load becomes 0.0117 loads/cycle.
- 1/0.117 loads/cycle = 85.3 cycles per load.
- Zen5's 4 x 512-bit execution width means 4 x 85.3 = ~340 instructions/load.
In plain English:
A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth.
If its AVX-512, that has to be used for CPU cores to use-up the available DDR5 bandwith, then we are cool, cause there are pretty much no notable apps on desktop that use it. And i have serious doubts we are going see more of them compiled anytime soon, especially if Intel does not even support it on desktop anymore.Right, but I'd argue almost all AVX512 workloads use MT. Once we see more apps compiled with AVX512, we'll see more lackluster gains from 9950x.
In a way it's even worse than that if you consider you can 2xFMA + 2xFADD per cycle, which amounts to 6 512-bit op per cycle.About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:
Given how much Intel support AVX-512 in the consumer market, how could it have been different? Intel sucks at supporting their own extensions. It took them 10 years to have AVX2 on all of their CPUs.AVX-512 has been at this point available on desktop cpus since what, 2017 and Skylake-X? Yet no adoption of it happened.
Beginning of 2027 is the rumour
It's available on consumer desktop since Rocket Lake so 2021. And for one generation only as AlderLake and its derivatives don't have it due to E cores. Then Zen4 appeared and maintained availability but from different vendor. Skylake-X was providing support only on HEDT machines. And Skylake-X implementation was troublesome due to AVX offsets, AVX512 "cold-start" and other things, so if you were only to sprinkle few AVX512 instructions into otherwise scalar code you would hurt the performance. On laptops you also had TigerLake and probably IceLake but Icelake was short lived on laptops. And since Rocket Lake was a real failure then not many people have bought it, so to be honest we should count consumer availability to start with Zen4.AVX-512 has been at this point available on desktop cpus since what, 2017 and Skylake-X? Yet no adoption of it happened.
Actually it's only one side of the coin. The other is to save power at the front-end. If you can find few independent operations of the same type you can execute them all with instruction. To give some contrived example. With 512b register you can do 16 32b float additions. You have 2 instructions to load operands, one to do the arithmetic operation (addition) and one to store a result. That's 4 instructions to decode. In pure scalar code you would need to decode 64 instructions in unrolled loop to do the same work. [this contrived example assumes uop cache does not exist, but even if the ops would be served from uop cache, each load takes at least 4 cycles to complete and so on]. That's why AVX512 was giving noticeable benefit to Zen4 and lower power draw than when using AVX2 or below. Also remember that for floating point aritthmetic operations it does not matter if you add only scalar values or full width of 16 values in case of 32b float, the latency is the same.It's special because it's chungus. AVX is all about loading huge amounts of data and processing them quick. Then the more cores you add into the mix, the more likely it is applications that weren't memory bandwidth dependent before will become memory bandwidth dependent
I am not referring to watts (energy / second) but total calories (total energy for the task)This I think you'll have to clarify.
A P-core consumes less power than an E-core? And you are talking about ARL DT?
The y-cruncher example is essentially a worst case scenario. It's also next to impossible to achieve. In reality, MOST, but not all, AVX-512 workloads that are not purely synthetic will not be constantly streaming the maximum amount of data continuously. They will digest chunks, manipulate it, test the results, then store the results of the manipulation or the findings of the test, then either wait on the non AVX-512 portion of the code to do things, or move on to the next chunk of data.About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:
Ordered an 8600G yesterday. I await its arrival with much delight for my usage case in one of my rigs.I am not a project management expert, but if the information that ZEN 5 has been developed by several completely different teams and faced reworks and delays is correct, it is a miracle that it even powers on.
The inter-CCD latency problem may be somehow strongly related to some hardware flaw or peculiarity, which will make fixing it impossible or very difficult.
I wonder how many engineers will need to work on this "flopper" even after its release to make it less bad. It may have flopped many times already during the development.
As an owner of 5700G CPU I wonder why these CPUs are not much more popular in PCs, how are the 8x00G CPUs selling now?
Most likely because L3 bandwidth hasn't increased compared to Zen4. In Zen6 they may increase it to 64B/sIt's possible if the FP units crunch through data so fast that they waste much of their time waiting for data from RAM because the L3 cache is too small to hold enough data to keep them fed and busy. If this is true, the X3D parts will be something special in AVX-512 workloads.