2990WX review thread Its live !

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

StefanR5R

Elite Member
Dec 10, 2016
5,690
8,263
136
About AnandTech's review:
Ian Cutress said:
you will notice that our 2990WX sample never goes near the 250W rated TDP of the processor, actually barely hitting 180W at times. We are unsure why this is.
I have a possible explanation:
Ian Cutress said:
Most of the testing data is with the Liqtech 240 liquid cooler, rated at 500W.
Enermax's rating of their AIOs is utter nonsense. If you try to push 500 W thermal load through a slim 360, slim 280, let alone slim 240 mm radiator, it's not only going to be very noisy, the loop will also become very hot. So hot, that AMD's boost will put a lid on the clocks, voltage, and power.

Unless Ian tested outdoors in Antarctica.

Edit,
a slim 240 radiator is inadequate even for 2990WX's stock TDP of 250 W, IMO.
 
Last edited:

24601

Golden Member
Jun 10, 2007
1,683
39
86
What could they do next gen to alleviate the memory issues with 2990wx?
Aside from different topologies, would adding a large L4 help much?

Nothing is going to help traversing 2 layers of glorified PCI-E 3.0 x16 links for your CPU to do the groundbreaking revolutionary task of conversing with it's own main memory.
 

Schmide

Diamond Member
Mar 7, 2002
5,590
724
126
Ok, assuming AT did proper testing of the connections and not using whole uncore...
Now IF is worse than intel mesh...16 core consume slighly more power than intel 18 core at max...but the real story is MINIMUM connection power consumption...it is considerably more on AMD IF than on intel mesh..and nearly an order of magnitude worse when comparing intel ring bus to pinnacle ridge IF...(baseline).

You're discriminating power consumption between two separate entities over different distances. AMD will never be able to compete. It's a total tradeoff.

Power consumption looks reasonably equivalent between mesh and IF when using AMD 2 dice Vs 18 monolithic intel die...but connecting 4 dice is a massive amount of power for either Epyc or 2990wx...using that much percentage of power is not ideal.

It's really not that bad considering what AMD was planing when they designed it. Over provisioning. And sure running point to point is going to be less of a burden. That's the nature of pathways.

I'm saying using IF to connect multiple small 4 core CCX for desktop is a bad situation long term, even worse when using multiple IF links to connect 4 of those dice together for threadripper.
They would be better off moving to a new more efficient topology for a larger CCX, better connections between CCX, finally some kind of active interposer with advanced topology for connecting dice.
Most people seem to think ..large ring bus CCX + active interposer+butter donut is a better long term solution.

Why are you trying to make this into intel? It's not.

CCX is not getting any bigger. When you factor it, the factorial figures return 4 or 6 or 4c2.

Regardless all this should be in the Speculation thread
 
Last edited:

Shivansps

Diamond Member
Sep 11, 2013
3,873
1,527
136
Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory? This has to be the reason of why it performs so bad on Windows compared to Linux.

Just guessing here, maybe they decided to improve performance for EPYCS whiout all 8 memory channels populated?
 

Batmeat

Senior member
Feb 1, 2011
803
45
91

Timorous

Golden Member
Oct 27, 2008
1,727
3,152
136
Hardware unboxed explained it in a way I finally understood -
https://youtu.be/QI9sMfWmCsk?t=3m49s

And how that impact memory bandwidth
https://youtu.be/QI9sMfWmCsk?t=17m3s

I am not so sure about that considering how much better it does in Linux. It makes me think the issue is with Windows rather than the CPU memory configuration. I am sure you could find some very specific workloads that do not scale because they can saturate the memory bandwidth when using 16 or fewer cores but I am not sure how prevalent they are.
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
Don't see why. I did read a lot of posts about how TR2 will suck at gaming,but.....first post on this thread from mark

Gaming wise, the chip is still a beast. I plan on jumping on the 2920x Should be near the same level of performance.
Tech Report's review shows that while frame rates on the 2990WX might not be completely catastrophic (though they're certainly on the low side), there are some pretty nasty frame latency spikes.

All that aside, seeing the Linux benchmarks makes me wonder whether Windows 10's NUMA optimizations might still not be entirely up to snuff. IIRC, Linux has had full NUMA support baked into the kernel since the time of the original Opteron's release 15 years ago, whereas Windows didn't add it until Vista, and even then it barely worked from what I remember.
 

french toast

Senior member
Feb 22, 2017
988
825
136
It will just increase latency, because now they need to look up l4 before ram.
Only on large data sets though surely? If you had a large chunk of L4 with good bandwidth, even 256mb...that would cut alot of traffic to main memory and improve performance of some workloads by decreasing latency.
 

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
Where an L4 cache would help is on each die. It would need to be at least 8x (preferrably 16x) as large as the combined L3 cache size on each die (to account for the 4 die configuration) and would definitely add another few cycles of latency for memory access. So, that's 128MB per die minimum and preferrably 256MB. That's a LOT of die space. However, for a 4 die TR2, that's an L4 cache of 512MB to 1GB in size. To make that work effectively, you'd need to increase inter-die bandwidth significantly. With the advent of PCI-E 4.0 and 5.0 though, that base clocking approach can help solve this (IF borrows a lot from it it seems).

so, in the long run, I think that we'll eventually see an L4 cache on the Zen die in future iterations. I also see that, eventually, there will be a split in the lines with Epyc and TR-WX series getting die that are custom made for them and may be able to filter down to the HEDT market in another performance tier.
 
Reactions: french toast

french toast

Senior member
Feb 22, 2017
988
825
136
Where an L4 cache would help is on each die. It would need to be at least 8x (preferrably 16x) as large as the combined L3 cache size on each die (to account for the 4 die configuration) and would definitely add another few cycles of latency for memory access. So, that's 128MB per die minimum and preferrably 256MB. That's a LOT of die space. However, for a 4 die TR2, that's an L4 cache of 512MB to 1GB in size. To make that work effectively, you'd need to increase inter-die bandwidth significantly. With the advent of PCI-E 4.0 and 5.0 though, that base clocking approach can help solve this (IF borrows a lot from it it seems).

so, in the long run, I think that we'll eventually see an L4 cache on the Zen die in future iterations. I also see that, eventually, there will be a split in the lines with Epyc and TR-WX series getting die that are custom made for them and may be able to filter down to the HEDT market in another performance tier.
Even if it was on die, that would presumably only add latency on certain workloads, most of the time it would lower latency no?.
There is going to be a hard limit of core amounts in a few years, then bigger caches will be here to stay.
 

BigDaveX

Senior member
Jun 12, 2014
440
216
116
Reactions: french toast

wahdangun

Golden Member
Feb 3, 2011
1,007
148
106
Only on large data sets though surely? If you had a large chunk of L4 with good bandwidth, even 256mb...that would cut alot of traffic to main memory and improve performance of some workloads by decreasing latency.
16 gig of HBM 2 as L4, squatting between all those dies would likely help a LOT. But the cost won't cut it yet. This year anyway it's a Unicorn.


TR already have plenty cache, and it's search in each l3 cache before going to ram so it's not worth it if they added another level of cache, it will just increasing the cost and latency but will not have a meaningful increase in performance. What AMD need to do is increasing the IF frequency but with the current power consumption I guess AMD really need that 7nm.
 

StefanR5R

Elite Member
Dec 10, 2016
5,690
8,263
136
Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory?
I am not sure, but it occurs to me that this case should be handled by any NUMA implementation from day one, because it is just a variation of the case that the memory of a node is fully allocated, while another node still has free memory.
 

LightningZ71

Golden Member
Mar 10, 2017
1,661
1,945
136
TR already have plenty cache, and it's search in each l3 cache before going to ram so it's not worth it if they added another level of cache, it will just increasing the cost and latency but will not have a meaningful increase in performance. What AMD need to do is increasing the IF frequency but with the current power consumption I guess AMD really need that 7nm.

TR has plenty of cache for the total package, but, we're looking at very specific use cases here. A large L4 cache for each die would enable each die to keep a local copy of the contents of each remote L3 cache local to each die. This would alleviate the strain on the IF links between the various Die and reduce the power demand on it as a result. It would also alleviate the issues that having distant RAM on TR-WX packages can cause. Yes, there would be invalidation and copy traffic, but, in most use cases, that would be a "less than frequent" case for a properly NUMA aware scheduler OS. Now, will it be expensive? You betcha! Will it be worth the cost at 12nm? Absolutely not. At 7nm? Maybe on 7nm+ for a hypothetical die that's targeted at EPYC/TR/HEDT AM4. I think that as AMD gets better market penetration and volume on their products, they will also have enough revenue to have three different actively developed die: Power and Size optimized Ryzen Mobile (current Raven Ridge), Clock optimized Ryzen Desktop (current Zepplin) for AM4/TR-X, and balanced Ryzen HCC for EPYC/TR-WX/HEDT AM4. Mobile and Desktop can stay at GloFo and Server can live at TSMC, each on a process node that fits their use case. With a separate HCC die, they can afford to make tweaks to it that address these shortcomings, such as adding extra CCXs and an L4 cache.
 
Reactions: french toast

wahdangun

Golden Member
Feb 3, 2011
1,007
148
106
TR has plenty of cache for the total package, but, we're looking at very specific use cases here. A large L4 cache for each die would enable each die to keep a local copy of the contents of each remote L3 cache local to each die. This would alleviate the strain on the IF links between the various Die and reduce the power demand on it as a result. It would also alleviate the issues that having distant RAM on TR-WX packages can cause. Yes, there would be invalidation and copy traffic, but, in most use cases, that would be a "less than frequent" case for a properly NUMA aware scheduler OS. Now, will it be expensive? You betcha! Will it be worth the cost at 12nm? Absolutely not. At 7nm? Maybe on 7nm+ for a hypothetical die that's targeted at EPYC/TR/HEDT AM4. I think that as AMD gets better market penetration and volume on their products, they will also have enough revenue to have three different actively developed die: Power and Size optimized Ryzen Mobile (current Raven Ridge), Clock optimized Ryzen Desktop (current Zepplin) for AM4/TR-X, and balanced Ryzen HCC for EPYC/TR-WX/HEDT AM4. Mobile and Desktop can stay at GloFo and Server can live at TSMC, each on a process node that fits their use case. With a separate HCC die, they can afford to make tweaks to it that address these shortcomings, such as adding extra CCXs and an L4 cache.


But the design of L4 cache is always because constraints in bandwidth not latency, just like why broadwell and Xbox one use it, and when there are available faster memory option they abandoned it ( in this case ddr4 or gddr5). And the problem with TR isn't about bandwidth but latency, and when ddr5 come out it wil bring much higher bandwidth, and making adding l4 cache is kinda useless.
 

richaron

Golden Member
Mar 27, 2012
1,357
329
136
Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory? This has to be the reason of why it performs so bad on Windows compared to Linux.

Just guessing here, maybe they decided to improve performance for EPYCS whiout all 8 memory channels populated?
I don't think Linux is aware of the specific NUMA architecture at all.

It's been my understanding (at least over the past decade) that Linux has been better with multithreading in general (scheduling?). Even slightly superior (SMT aware for example) scheduling could cause a huge increase in performance in these large core count and funky arrangement systems.

So I think it's entirely likely the Linux results we are seeing are due to fundamental advantages in pre-existing code. And real specific NUMA code could give even greater advantages on top of this.
 

Shivansps

Diamond Member
Sep 11, 2013
3,873
1,527
136
I am not sure, but it occurs to me that this case should be handled by any NUMA implementation from day one, because it is just a variation of the case that the memory of a node is fully allocated, while another node still has free memory.

if Windows sees a node without avalible memory that may wreak havoc with Windows virtual memory management that is far different to how Linux manages virtual memory.

But im not sure, i never looked into NUMA before.
 

french toast

Senior member
Feb 22, 2017
988
825
136
But the design of L4 cache is always because constraints in bandwidth not latency, just like why broadwell and Xbox one use it, and when there are available faster memory option they abandoned it ( in this case ddr4 or gddr5). And the problem with TR isn't about bandwidth but latency, and when ddr5 come out it wil bring much higher bandwidth, and making adding l4 cache is kinda useless.
Not true.
https://www.anandtech.com/show/9320/intel-broadwell-review-i7-5775c-i5-5675c/10
You can clearly see the impact of having an L4 cache has to the cpu here.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |