But that's not really the case. Zen 5 paid no attention to balance since only 512bit AVX-512 got doubled width (in the ideal case up to doubling compute performance), and scalar integer got expanded (in the ideal case offering up to 35% more compute performance). Everything else is essentially untouched, while the improvement in scalar integer seems to be hard to make use of, and that's the imbalance.
The front-end was seriously altered. Much larger BTB, double decoders, for instance.
IIRC instruction schedulers were changed too.
The uop cache was changed.
My take on this is that so many things were changed that balancing structures was not properly done and we'll have to wait for Zen6 to see if these changes bring significant benefits for integer code.
The front-end was seriously altered. Much larger BTB, double decoders, for instance.
IIRC instruction schedulers were changed too.
The uop cache was changed.
My take on this is that so many things were changed that balancing structures was not properly done and we'll have to wait for Zen6 to see if these changes bring significant benefits for integer code.
Another way to look at it is that despite all the changes in the frontend the performance didn't degrade and is still "balanced" (i.e. close enough to previous gen) except for the positive 512bit and INT outliers that appear even without memory bandwidth bottlenecks having been alleviated yet. Zen 5's design looks like one requiring substantial uncore changes, but with none of them applied before Zen 6.
Another way to look at it is that despite all the changes in the frontend the performance didn't degrade and is still "balanced" (i.e. close enough to previous gen) except for the positive 512bit and INT outliers that appear even without memory bandwidth bottlenecks having been alleviated yet. Zen 5's design looks like one requiring substantial uncore changes, but with none of them applied before Zen 6.
I agree the uncore needs work, but I also think the core itself needs more balancing. Though I'd say that given all the changes that were made to the core, AMD already did a good job, but not to the level I expected (hoped...). That's why I will wait for Zen6 before replacing my aging desktop.
I'm very curious to see how much improvement the uncore will bring, we're still effectively stuck on Zen 2 era technology here. It is painfully bad for AMD.
I don't think we can expect much of any changes for the core itself for Zen 6.
I don't believe AMD designed duel decode just for SMT. I think it was a last minute decision to disable it. Zen 6 should be the Zen5 they planned, similar to Ryzen 2k is the bug fixed Zen1. Similarly Piledriver fixed many bugs of Bulldozer.
Theoretically. But I doubt this would happen in practice. The AMD decoder pipelines decode up to MOPS at peak, i.e. 2uops, but it is 1 arithmetic operation + 1 memory operation (read/write). My guess is that most of the time the AMD decoder pipelines run at 1uops.
Similarly to the first (complex) Intel decoder pipeline, which peaks at 4uops, but I don't think this happens most of the time, so the remaining decoder pipelines run at 1uops.
I suspect that AMD's symmetric decoders are intended to simplify the control logic and the algorithms embedded in it, which do not have to decide which pipelines should decode a given type of instructions.
The decoder in LionCove probably has 8 simple decoding pipelines (1uops). But now MSROM can do 4uops instead of 2uops and probably more complex instructions are decoded with microcode.
Lion cove decoder is very expensive compared to Zen5. Meanwhile ARM/Apple can decode 10 instructions from a single thread per cycle. What AMD designed for Zen5 suitsd x86 making it power efficient.
It would suit x64 even better if they could co-operate to execute a single thread ala Tremont/Skymont. Especially if doing that didn't cost much area. It doesn't seem like it should but there must be some reason AMD didn't do it.
It would suit x64 even better if they could co-operate to execute a single thread ala Tremont/Skymont. Especially if doing that didn't cost much area. It doesn't seem like it should but there must be some reason AMD didn't do it.
Lion cove decoder is very expensive compared to Zen5. Meanwhile ARM/Apple can decode 10 instructions from a single thread per cycle. What AMD designed for Zen5 suitsd x86 making it power efficient.
Yes. a single decoder with 8 pipes (LionCove) is more complex than 8 pipes grouped into 2 separate decoders (Zen5). The control logic and algorithms embedded in it are much more complicated (LionCove).
But I also meant the fact that asymmetric decoding pipelines are more complicated to control than symmetric ones.
It seems that the decoder itself with symmetrical(AMD) streams takes up more transistors, but requires less complex control logic.
However, a decoder with asymmetric streams uses fewer transistors, but makes up for this with more complex control logic.
Of course, a decoder-to-decoder comparison with the same number of pipes.
Mike Clark claims that this cluster decoding gave them the ability to expand the core, which shows the complexity problem of wider decoding.
Certainly not in any practical application code. However, I was able to get ~10 uops/clock in my Zen3 with assembler code. So it might be possible to get ~14+ uops/clock in Zen5.
rigaya's Ryzen 9 9950X Benchmarks contains tests of 9950X vs 7950X using many presets which bring interesting results. x264 and av1 seems to be able to choke Zen 5 av1 to even make it slower than Zen 4.
rigaya's Ryzen 9 9950X Benchmarks contains tests of 9950X vs 7950X using many presets which bring interesting results. x264 and av1 seems to be able to choke Zen 5 av1 to even make it slower than Zen 4.
Only the fastest presets of SVT-AV1 show a regression in this blog post (with 1080p video input). This same regression also shows in AnandTech's tests (with 1080p and 4K input).
Edit: AnandTech also shows results from 6- and 8-core CPUs, and these are telling us something about the nature of these workloads. Alas, neither rigaya's blog nor AnandTech are reporting standard deviations of these tests.
Only the fastest presets of SVT-AV1 show a regression in this blog post (with 1080p video input). This same regression also shows in AnandTech's tests (with 1080p and 4K input).
Edit: AnandTech also shows results from 6- and 8-core CPUs, and these are telling us something about the nature of these workloads. Alas, neither rigaya's blog nor AnandTech are reporting standard deviations of these tests.
Only the fastest presets of SVT-AV1 show a regression in this blog post (with 1080p video input). This same regression also shows in AnandTech's tests (with 1080p and 4K input).
I don't know how AVT1 behaves. But if it uses MT, we might again see some memory bottlenecks due to being less data compute intensive for the fastest presets.
Makes sense since while you do want to optimize the tech for specific target audiences you still want to be able to reuse as many blocks as possible. With RDNA significantly deviating from GCN on which CDNA continued to built this wasn't possible with GPUs anymore. Sounds like RDNA5 and CDNA4 will reset to a more common base.
For Zen I don't expect server/desktop/mobile to ever deviate to the degree RDNA and CDNA did.
Yeah, it sounds like they did all of the work to make it possible to use both in a single thread (why else mention them being able to work out of order and being able to fetch-ahead) and then at the end "BTW, a single thread can only use one pipe". I am not an expert by any means, but this sounds like they did not manage to utilize both decode pipes in a single thread even though they tried. I would then predict this being fixed in Zen 6 (disclaimer: do not believe random predictions on the internet).
<div p-id="p-0001">A processor employs a plurality of fetch and decode pipelines by dividing an instruction stream into instruction blocks with identified boundaries. The processor includes a branch p
8. The method of claim 1, further comprising: generating a plurality of decoded instructions at a plurality of fetch and decode pipelines based on a plurality of branch predictions including the first branch prediction; and reordering the plurality of decoded instructions after the plurality of decoded instructions are generated, the reordering based on a program sequence identified at the processor.
9. The method of claim 1, further comprising: identifying a starting point for the first fetch stream based on an instruction map indicating endpoints for one or more variable-length instructions.
10. The method of claim 1, further comprising: selecting a second fetch and decode pipeline of the processor based on a second branch prediction; and fetching and decoding instructions of a second fetch stream associated with the first branch prediction at the selected second fetch and decode pipeline.
11. The method of claim 10, wherein fetching and decoding the instructions of the second fetch stream comprises fetching and decoding instructions of the second fetch stream concurrently with fetching and decoding instructions of the first fetch stream at the first fetch and decode pipeline.
12. A method comprising: identifying an end of a fetch stream based on an instruction map indicating endpoints for one or more variable-length instructions; selecting a first fetch and decode pipeline of the processor; and fetching and decoding instructions of the first fetch stream at the selected first fetch and decode pipeline.
13. The method of claim 12, further comprising: identifying a second fetch window based on an end of the first fetch window and based on the instruction map; selecting a second fetch and decode pipeline of the processor; and fetching and decoding instructions of the second fetch window at the selected second fetch and decode pipeline.
14. A processor comprising: a branch predictor to generate a first branch prediction; a first fetch and decode pipeline; a control module to select the first fetch and decode pipeline based on the first branch prediction and based on instruction flow criteria; and wherein the selected first fetch and decode pipeline is to fetch and decode instructions of a first fetch stream associated with the first branch prediction.
From SOG,
Up to two of these blocks can be independently fetched every cycle to feed the decode unit’s two decode pipes. Instruction bytes from different basic blocks can be fetched and sent out of-order to the 2 decode pipes, enabling instruction fetch-ahead which can hide latencies for TLB misses, Icache misses, and instruction decode.
I believe the actual way they want it to work is this, as seen in the patent
employing a plurality of fetch and decode pipelines by dividing an instruction stream into instruction blocks (sometimes referred to as fetch streams) with identified boundaries. ... predicted branches provide known addresses of end and start blocks of sequentially ordered instructions (referred to herein as sequential fetch streams). Using these known boundaries, the processor provides different sequential fetch streams to different ones of the plurality of fetch and decode states, which concurrently process (fetch and decode) the instructions of the different fetch streams, thereby improving overall instruction throughput at the processor.
The processor employs the memory map to divide an instruction stream into fetch streams, and selects one of the plurality of fetch and decode pipelines to process each fetch stream.
My summary of their design goal, from this patent:
Create the entire instruction stream across two basic blocks and divide it this into different fetch streams and assign each stream to each of the fetch and decode and reorder it later.
Not sure where they failed, they could do neither
Is it because they cannot divide the instruction stream into fetch streams and assign to different decoding blocks due to the variable length instructions?
At the very least they could reorder the decoded instructions after the entire basic block has been decoded, does this bring no benefits so it is also not applied to ST?
#1 is from the patent application
#2 is what C&C article was talking about.
The BP can do 2 Taken/Cycle, so this is already great, in daydream category according to C&C article from last year, but they cannot leverage the benefit of this by being able to merge the decoded instructions before going to op queue
Finally, there’s the daydream category. .... Maybe Zen 5 can fetch across basic blocks in the common case instead of using a loop buffer or micro-BTB as Intel and Arm did. That likely requires a dual-ported instruction cache or micro-op cache alongside a large BTB capable of delivering two branch targets per cycle. Zen 5 would also need circuitry to merge two fetch blocks into a buffer that downstream stages can consume. I think implementing such a strategy makes little sense. It’d only help in high IPC code bound by frontend throughput. Frontend latency due to instruction cache misses is a bigger issue.
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.