- Mar 3, 2017
- 1,747
- 6,598
- 136
Biggest bottleneck is L3 bandwidth. Simply 32B/s is not enough for full AVX512 throughput. The capacity of 32MB is good enough for 8 cores AVX512 non-streaming workloads.About why Zen5 SIMD is memory bottlenecked I will let myself cite Y-Cruncher author:
Super admin privilege is required to allocate extra memory.I am not sure if this has been mentioned yet in this thread or somewhere else, but it looks like there is a weird account bug in Windows 11 which can decrease gaming performance with Zen 4 and 5, at least so far. Apparently when using the built in administrator account, performance is slightly better. They mention that it may be possible this is also present in other CPUs, perhaps Zen 3 / Intel etc, but don't know yet.
I am wondering, does anyone know more about this and is this just for Windows 11 or is it present in 10 as well?
Hardware unboxed mentioned I think in a Twitter post that some of their discord members were reporting that they see the same thing in Windows 10 as well.I am not sure if this has been mentioned yet in this thread or somewhere else, but it looks like there is a weird account bug in Windows 11 which can decrease gaming performance with Zen 4 and 5, at least so far. Apparently when using the built in administrator account, performance is slightly better. They mention that it may be possible this is also present in other CPUs, perhaps Zen 3 / Intel etc, but don't know yet.
I am wondering, does anyone know more about this and is this just for Windows 11 or is it present in 10 as well?
This is probably not a windows bug but a "feature/bug" of AMD's CPU testing where they always enabled this modeI am not sure if this has been mentioned yet in this thread or somewhere else, but it looks like there is a weird account bug in Windows 11 which can decrease gaming performance with Zen 4 and 5, at least so far. Apparently when using the built in administrator account, performance is slightly better. They mention that it may be possible this is also present in other CPUs, perhaps Zen 3 / Intel etc, but don't know yet.
I am wondering, does anyone know more about this and is this just for Windows 11 or is it present in 10 as well?
Do you have a source for this?There would be a core latency patch by the end of August, likely released with new chipset X870/B860. Typical AMD that release software after hardware launch. LOL.
The question is, should he really?MLID didn't support his claim at all, so why is anyone defending it?
Who, MLID? No, he's a leaker. He just says stuff and asks you to believe it (or not). Not supporting his leaks is his job, and it's on everyone else to be skeptical of his claims, rather than claim that they're backed somehow by veracity (or anything else).The question is, should he really?
Their childish whining is really grating.
I wonder whether it's technically different from the L3 allocation that assigns (MMIO?) memory address space to the L3 partition, and which is available to users since... (don't remember which Zen gen)?Edit: Now what I am wondering is whether this wasn't supposed to be a Genoa feature to begin with, or if it couldn't be made to work with Genoa's CCDs or with Genoa's IOD, or…?
They forgot to add "2% faster than a.. umm... 7800x3d"Their childish whining is really grating.
Although I do think that we should summarise what we know about Zen 5 at this point. So much info has been flying around, we should recollect what we know.
Super admin privilege is required to allocate extra memory.
Does allocating huge pages using rebar require the same PVL?Not "extra memory" but more efficient paging. Normally, all memory on x86 windows is accessed through 4kB pages. If you want to access 4GB of space, you need to set up a million PTEs, which is a problem because the TLB can only cache a few thousand. The hardware also supports 2MB pages, which are a lot more reasonable, and on zen use the same tlb entries so the cache can cover 6GB on Zen 4.
IIRC large page support was previously available in windows, but a horrible bug was found in it (because very few people actually used it), at which point it was moved to admin only. Linux supports not only normal large pages, but also has transparent huge page support, meaning that if it's turned on, software that was not designed for large pages can make use of it.
OK, that's a valid point. But couldn't AMD support the higher core count (supposedly 24C/48T or 28C/56T) CPU on only the X670E/X870E mobos? Those are expensive to begin with so they should have the necessary power delivery components already in place. Not everyone is paying $400 for a mobo but those that do, they should get something in return for their dollars, such as support for higher core counts.The problem with high core count parts on mainstream platforms is that they raise the power delivery system requirements for motherboard manufacturers, because the CPU needs to have relatively high all-core boost clocks to make sense to begin with. Which in turn means general public would need to pay more for motherboards to essentially subsidise this small portion of the desktop market, which is a relatively small market on its own. Which is why I think AMD is reluctant to increase core counts on desktop: not only such an SKU would serve a relatively small niche of workloads that scale to high core counts AND don't need memory bandwidth, it would also require everyone else to pay for it.
IMO any board capable of running 16C should be able to run 24C… just at lower clocks.OK, that's a valid point. But couldn't AMD support the higher core count (supposedly 24C/48T or 28C/56T) CPU on only the X670E/X870E mobos? Those are expensive to begin with so they should have the necessary power delivery components already in place. Not everyone is paying $400 for a mobo but those that do, they should get something in return for their dollars, such as support for higher core counts.
They already do get something for their dollars.OK, that's a valid point. But couldn't AMD support the higher core count (supposedly 24C/48T or 28C/56T) CPU on only the X670E/X870E mobos? Those are expensive to begin with so they should have the necessary power delivery components already in place. Not everyone is paying $400 for a mobo but those that do, they should get something in return for their dollars, such as support for higher core counts.
Hmm, do you have a pointer to a description of this functionality? Was this one for CPU initiated accesses perhaps? SDCI is for device initiated accesses, the device being the writer.I wonder whether it's technically different from the L3 allocation that assigns (MMIO?) memory address space to the L3 partition, and which is available to users since... (don't remember which Zen gen)?
Hmm interesting. So possibly just better performance in admin mode due to more memory access?
I think you misunderstood. The quote was saying that this "digest chunk, manipulate it, test the results" part has to be extremely large between loads to avoid memory bandwidth bottleneck if you want to load all cores, what makes it impractical. Once again I underline all cores. Sure you can find workloads that are single threaded by nature or for one reason or another won't hit memory bottleneck. But that doesn't mean the memory bottleneck doesn't exist. If the memory BW would be sufficient then we could start to talk about the backend bottleneck etc. The thing is that the core has much more capable backend than memory BW available to it.The y-cruncher example is essentially a worst case scenario. It's also next to impossible to achieve. In reality, MOST, but not all, AVX-512 workloads that are not purely synthetic will not be constantly streaming the maximum amount of data continuously. They will digest chunks, manipulate it, test the results, then store the results of the manipulation or the findings of the test, then either wait on the non AVX-512 portion of the code to do things, or move on to the next chunk of data.
You forget that L3 is victim cache for both Skylake-X and Zen architectures. That means you cannot prefetch into it. If your algorithm won't reuse the memory locations that got evicted from L2 to L3 then the importance of L3 is reduced. Once again it depends on algorithm in question.32MB of l3 for 8 cores is plenty for most tasks, and represents as much or more l3 per core than any Intel avx-512 enabled product ever produced. The X3d parts will have 3x that amount. Yes, main memory bandwidth is limiting in synthetic or academic scenarios, but it isn't the end of the story.
Due to above, the L3 being a victim cache of L2, what is the biggest bottleneck depends on the algorithm. That's why 32MB might be good enough or might be too little. Statements like this are a bit too general and loose the nuance of the problem. For streaming workloads the GMI link bandwidth is the problem as it's lower than L2 to L1 bandwidth. Once you equalize them, the L2 will be a bottleneck and so on. If you have non streaming workload then the size of your working set and how you access the data will determine if the 32MB is good enough.Biggest bottleneck is L3 bandwidth. Simply 32B/s is not enough for full AVX512 throughput. The capacity of 32MB is good enough for 8 cores AVX512 non-streaming workloads.
I see HUB is milking the Zen5 release to the fullest. If I am not mistaken their own video was the source of the "admin rights give Zen cpus a boost" then they will do another video to discard it as something Zen specific. [which is funny as I saw other reviewers doing similar tests and showing that Intel was largely unaffected, but this is beside the point]. At the same time they were given the reviewer guide on hand that claims the game uplift is <= 5%. What would be more useful is if they brought a question to AMD why the review guide doesn't agree with promotional material and then do video about that. I mean they know that the gaming performance won't be improved no matter the weird trick they will try next only to paint the release in even worse light for more clicks... And except for one outlet I still haven't seen anyone try to benchmark if the core parking is doing anything for performance. But maybe as someone already suggested HUB will follow with a video, "Zen 5 gaming performance doesn't not improve in full moon..."
I can already see the outcry on social media about artificial segmentation to milk customersOK, that's a valid point. But couldn't AMD support the higher core count (supposedly 24C/48T or 28C/56T) CPU on only the X670E/X870E mobos? Those are expensive to begin with so they should have the necessary power delivery components already in place. Not everyone is paying $400 for a mobo but those that do, they should get something in return for their dollars, such as support for higher core counts.
it's sufficientWill just running games in Admin mode (click "run as administrator) help, or do you need to ran from that hidden super admin account?
I think it's best to just always use the correct node names to avoid confusion.I think you have it backwards. The high end Intel chips will be made with 20A silicon. I think the low end Intel offerings are any CPU's below what we know as i3 CPU's. Those could be made with TSMC silicon.
Based on how silicon has been measured historically. Intel 20A silicon is essentially 5nm. I didn't say it. That is what has been published all over the web for several years. Intel has said for many years that their silicon offers much more density than TSMC silicon. Intel is halving their process node from 10nm to 5nm. That is a huge jump in size compared to TSMC going from 7nm to 5nm to 3nm.
There is no comparing the 14th generation to Arrow Lake 20A. The performance uplift and power efficiency gains may put it ahead of what N4P has done for Zen 5. That is what a lot of people have ignored. Many assume that Arrow Lake is Raptor Lake's next act. The reality is totally new silicon with a different architecture scheme.
People who do not take sides in the AMD vs Intel battle have been waiting for Arrow Lake because of the new silicon node. Like me, they want to see what it can do. The upcoming Arrow Lake CPU's are said to be from 65w TDP to 150w TDP for the highest end CPU's. I have heard the non K series CPU's will be 65w up to at least Intel 7 series.
I said before Zen 5 was released that a Zen 5+ with N3P would be necessary because of 20A and 18A further down the road from Intel.