- Mar 3, 2017
- 1,747
- 6,595
- 136
Aw come on I have 0 formal education in electronics and just learned from Discord servers and listening in here. ^^Although a warning I am unable to tell how approachable they are if you don't have already some background in software.
Zen 5's bonus output for the clusters is zero in ST, contrary to Tremont.I think C&C showed that they decayed to 3 wide decode [so using one cluster] in absence of sufficient number of branches, so in a sense what Zen 5 is doing But Skymont is supposed to be in a league of its own when it comes to clustered decode. It's probably the most interesting piece of the Arrow Lake puzzle.
Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.Yeah but this is the first confusing generation of Zen.
Zen and Zen+ I don't know much about. BUT
Zen 2 > Zen 3 (Wow!)
Zen 3 -> Zen 4 (Impressive!)
Zen 4 -> Zen 5 (WTF???)
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.Yup, you are correct. Scalar workloads are the most important. And it's the hardest to improve since that's the definition of general purpose CPU. It means you have to improve all without losing performance in what's already out there.
Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
APX would open up the architecture for more IPC. But future CPUs have be to designed to take advantage of it. In my opinion, only after introduction of APX we could see a massive increase in general purpose IPC.Will APX help allievate this with more registers? I haven't heard much of anything about it since it was announced.
So ARM has an advantage because it already has 32 GPRs?Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
actually Lion Cove split FPU with the same fashion as AMDBecause of AMD's split FPU design, in the past it was advantages to utilize both General Purpose and SSE/AVX unit for maximum scalar throughput (it could be still true compared with intel big core). It would hurt Intel performance. So I don't think any compilers used like that. However, you could write a benchmark in assembly to hurt Intel using both sets of execution units.
Sort of, that one and fixed length instruction allows ARM higher IPC.So ARM has an advantage because it already has 32 GPRs?
You know register renaming exists and allows one to improve parallelism with limited number of registers, right?Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly. More read ports will help, but that also could increase the power usage, limiting the clock frequency.
This is from my experience programming in Assembly language. Register renaming would not help if you need to keep the value in registers. Lack of registers means, difficult to keep dependencies in registers. You constantly need to store and reload the values from L1. That is why I said having more read/write ports will help a bit with possible increase of power. For AVX, 16AVX registers in not much of a problem (most of the time). However, for me the 16 general purpose registers are not enough. With more registers, most of the ALUs can be utilized at once.You know register renaming exists and allows one to improve parallelism with limited number of registers, right?
If you state the main IPC blocker is the number of architectural registers, do you have a proof? The "just look at the assembly" is not one, 95%+ functions are not being executed enough to matter. Do you know of a study that analyzes hot functions in common software?
Also, I do not see any connection between the lack of addressable registers and the number of PRF read ports. Are you suddenly talking about renamed registers?
Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.
You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.You know register renaming exists and allows one to improve parallelism with limited number of registers, right?
Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?Lastly, Zen2 and Zen4 can actually rename memory locations into the PRF. This should reduce the cost of stack spills and fills (they still use load/store resources, though). I am not sure if Zen5 still knows this trick.
Is there scalar int that can be handle via the SIMD unit? I recall only FP part.Scalar can be General purpose or SSE/AVX. The main limitation for reaching higher IPC in general purpose ALUs are the 16 general purpose registers. Because of this limitation, there is a constant need to access the L1. It is easy to see this limitation when you are directly programming in Assembly.
You can load a single scalar Integer value and do the SIMD Integer operation on them. Then save the value as a single scalar Integer value. Yes it is an overkill, but can be done.Is there scalar int that can be handle via the SIMD unit? I recall only FP part.
You confuse register renaming with architectural state that compiler has to preserve. Compiler doesn't know you have 220 register file for GPR. It knows it has 16 of them. Compiler will spill if it finds out you overuse them.
Do you mean that facilities that normally spill (push and pop for GPR) will write to PRF and not directly to cache? Do you know if this is documented somewhere?
24.18 Mirroring memory operands
The Zen 4 can mirror memory operands inside the CPU so that memory operands have no
latency at all in some cases. See page 236 for a detailed description of this feature in Zen 2.
This feature was introduced in Zen 2, but absent in Zen 3. Now it has returned in Zen 4 with
some improvements. It works now with addressing modes that have a pointer and a 32-bit
offset, where Zen 2 allowed only 8-bit offsets. Some cases of false dependence have also
been removed, but not all.
Ah true, I was wondering if there is a special instruction for that, but you don't need special instructions for everythingYou can load a single scalar Integer value and do the SIMD Integer operation on them. Then save the value as a single scalar Integer value. Yes it is an overkill, but can be done.
Ah yes, I forgot about this one, thanks for sharing That is the last piece of non AMD coverage of Zen5 that we are still missing, Agner's update about it.Agner Fog at least has documented that in his "The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers": https://agner.org/optimize/
It seems to be the case, and it has some limitations that make me wonder if compilers are able to use it fully, but for someone writing assembly seems like a useful tool. Wish it could work with SIMD alsoI understand it that such instructions will both write to PRF and to memory, so the number of store pipelines still matters.
These caches are indeed faster, and L1d has become bigger. (And it doesn't stop at these caches, as e.g. BTB is bigger, ITLB is bigger, µop cache is dual ported, ROB is larger...) *However*, bigger L1 and faster L1/2/3 are benefiting workloads more which hit L1/2/3 a lot, whereas the returns are diminishing in workloads which have a sizable amount of cache misses to begin with. This goes without saying but I am mentioning it because you brought up games. Many video games empirically benefit from increase of L3 cache size and from decrease of main memory latency. So they are more akin to the latter type of workload.I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.
Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.
It's N4P, 9655 has Z5 classic cores.Excuse me lolwut?
4.5 boost EPYC?
Even on 3E that's insane. Actually it's N4P since at that frequency it def won't be Zen5c but full Zen 5.
Hoooooly.
Yeah Turin looks like it's going to be screaming fast. Right about the time that intel is losing it's cashflow and ability to discount and right as AMD ramps it's sales team to supply to commercial (non hyper scale). Wonder how many of the laid off intel sales team and will pick up with the right connections.
You can load from L1 128B/c with AVX512, 64B/c with AVX2 and 32B/c with pure scalar code. For Zen4 it was respectively, 64B/c, 64B/c, 24B/c.Compare Zen 4 single CCD to Zen 5 single CCD chips in Aida64. I notice faster level 1, 2 & 3 cache speeds in Zen 5, so bandwidth at the cache level is significantly upped. I compare 7600X to 9700X is especially noticed at level 3 cache but both chips have same size level 3.
Level 1 size improvement in Zen 5 is most welcome as well so specifically from a gaming point of view even with out X3D, there is more overall total cache in Zen 5.