- Mar 3, 2017
- 1,747
- 6,598
- 136
This also shows up in the iGPU. Adding an additional WGP and increasing the clocks makes sense if you are anticipating lots of bandwidth. It's barely higher than Hawk and iGPU performance, especially for non-compute bound tasks, is already showing to be only modestly better.Thinking about the Zen5c CCX on Strix Point, that's the least amount of L3 cache per thread since Lucienne. Yes, the L2 is twice as large and still exclusive, so the apparent L2/L3 cache is 33% larger, but Zen5c, even restricted to ~3.6Ghz, is much higher performance that Zen2 at up to 4-4.5Ghz. This becomes an even bigger issue with AVX-512 code as that can quickly become memory throughput bound.
All of that is to say that the Zen5c CCX is very memory bandwidth starved and will be hitting the memory controller and snooping the other CCX a lot. This seems like a very hamstrung design unless the memory controller has gobs of bandwidth available at low latency. A MALL cache would have made this make sense. The current design just exacerbates existing problems with limited bandwidth.
This leads me to believe that AMD changed course hard after they had frozen critical aspects of Strix Point's design, like the rumored NPU expansion that removed the MALL. They wanted to hit a marketing goal of having a processor that looked like Alder Lake/Raptor Lake, but things went sideways along the way and they didn't course correct well.
I'm amazed that this comment got jumped on (-7 downvotes? ).
I made that comment in the context of a few people discussing how some of Zen5's architectural changes seemed questionable (which could also be said of Bulldozer).
There are very few people in the media/online who can conduct proper SPEC tests and David is one of those people.
Edit, if I'm interpreting this right, the NPU doesn't take up that much die area. About as much as 4 Zen5c or 8 RDNA 3.5 CUs.
Not really, notebooktech is just measuring total system power not the package power during st workloadsBut this also shows that ZEN5 need more juice compared to ZEN4. The advantage at low wattages is pretty meh for 50% more threads. It only starts getting decent at power levels above ZEN4s sweetspot. I'm 80% sure we will see that every ZEN5 Desktop SKU is slower than it's predecessor at low wattages. Similar to Igors Leaks, the ZEN5 ES was besten by 7950X pretty much everywhere below 100W. ZEN5 needs Juice to run properly.
Not sure, I'm no expert and I don't work in the industry. I just like reading about CPU architecture.So ideally, there should be 4 decoders, to handle the case of two branching pathways and the instruction to be executed after the branch is entered.
Edit, if I'm interpreting this right, the NPU doesn't take up that much die area. About as much as 4 Zen5c or 8 RDNA 3.5 CUs.
How small does it have to be to be acceptable? If it's too small then the NPU wouldn't be useful for anything, then it would really be a waste. I understand being skeptical of Microsoft's 40TOP mandate, but 10% of the die area is not that bad.What? That is TOO MUCH IMO.
What a waste.
So how would the 2nd decoder be used? It would decode the instruction immediately after the one being decoded by the first decoder?
as I understand from tremont discussion second decoder should decode second path of branch
I don't think that is how it works, it doesn't seem to do decoding of both ways on branches, which this would amount to. The approach would harm power efficiency, likely.
Ian said that Zen 5c would see a much smaller drop in frequencies compared to what Z4c had relative to the classic cores.
Interesting.
[...] AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them. [...]
If I am looking at this right, they did manage to compact the core logic by quite some degree but, as to be expected, L2$ and L3$ diminish the overall area savings a lot. (The 8 MB L3$ of the Zen 5c CCX looks even less dense than the 16 MB L3$ of the Zen 5 CCX, but this is surely because of twice the number of cores to connect at the former.) The f_max which Strix Point's compact cores are designed for is probably OK for the limited TDPs for which this one is targeted at, and higher TDPs will really need to be covered by Fire Range...
Didn't see that, nice catch. Sounds like AMD wanted the dense cores to be more effecient at higher frequenies due to the fact that they are in their mainstream mobile lineup now.No. AMD said this during the post-tech-day briefing in the context of Strix Point's N4P Zen 5 cores and N4P Zen 5c core sizes, the latter being just 25% smaller than the former. (Whereas the difference was 35% between Phoenix 2's Zen 4 and Zen 4c cores.) These lowered space savings are because AMD had to work within stricter constraints WRT Voltages and clocks in order to make the compact cores work as a support to the classic cores the way as AMD wanted them.
(Source: Computerbase's article on this briefing. Also take Tom's Hardware's interview with Mike Clark in which the Zen 5c design targets are discussed in the context of heterogeneous CPUs, read: Strix Point. PS, I haven't watched the Cutress X Cozma video.)
On taken branches (technically speaking, when the predictor predicts that it's going to be taken), the code is jumping to a known address). That is where the second decoder cluster can immediately start working in parallel with the first decoder cluster.
In the future, they may add some predecode that would add instruction start marking into L1 cache or into fetch or soemthing, which could enhance this capability to be used more often. We don't know yet.
That's working approach with cpu that doesn't have MOP-cache like Intel E-cores. But with quite large MOP branches should mainly hit OP-cache when they are previously executed and decoded. I can't see much decoding improvement from such a branch targeting second decoder with MOP-cache cpus.
It's more about fetch than decode (but decode is a nice bonus.)
It means that you can potentially deliver uops from a second taken branch immediately without waiting to redirect fetch. That is useful. Not groundbreaking necessarily, but useful.
I have seen multiple people ask why not predict both branches but I think even the simplest loop that executes say a million times shows why that is a bad idea.
Yes, if they are taking both branches it has to be with very specific prediction history. In the general case pursuing both branches is a bad idea.We don't know how AMD is implemented their dual decoding. They might very well decode both branches in cases they doesn't accurately know which branch to take. In which case max decode ability is still 4 for single thread, performance increase comes from situations when prediction accuracy is low and they can survive misprediction faster.
I thought that fetching uops has to happen in program order. Second taken branch can fetch instruction data and decode it but uop fetch has to happen in program order, so decoded instructions from two decoders are combined in program order before fetching them.
if (someLikelyCondition)
if (someOtherLikelyCondition)
doStuff;
if (someLikelyCondition)
doStuff;
doMoreStuff; // as soon as this is dispatched, uops from doEvenMoreStuff can start to be dispatched, as they've been fetched/decoded in parallel
if (someOtherLikelyCondition)
doEvenMoreStuff; // no bubble for fetch/decode - fetch/decode started at the same time as doStuff did
If this was out of the box performance, or at very least something thats achievable by simple action like enabling PBO (that would not in turn decrease single-perf clock and performance, as was the case with 7000 series), not requring meddling around with curves, voltages and whatnot, then i guess it would be something to consider, after all. How much do you reckon one could sell 7950x for?
Getting this result on an air cooler is impressive. 13/14th gen needs an aio
So this means H370x is more than double the performance of the 155H at 15W, probably almost double the 185H too. Good luck to to Lunar Lake to catch up with advertised 50% improvement.Power/perf scaling according to NBC.
AMD Zen 5 Strix Point CPU Analyse - Ryzen AI 9 HX 370 gegen Intel Core Ultra, Apple M3 und Qualcomm Snapdragon X Elite
Notebookcheck Analyse des neuen Zen 5 Prozessors AMD Ryzen AI 9 HX 370 im Vergleich mit Intel Core Ultra, Apple M3 Pro und Qualcomm Snapdragon X Elite.www.notebookcheck.com
to save area