2990WX review thread Its live !

french toast · Aug 14, 2018

What could they do next gen to alleviate the memory issues with 2990wx?
Aside from different topologies, would adding a large L4 help much?

StefanR5R · Aug 14, 2018

About AnandTech's review:

Ian Cutress said:
you will notice that our 2990WX sample never goes near the 250W rated TDP of the processor, actually barely hitting 180W at times. We are unsure why this is.

I have a possible explanation:

Ian Cutress said:
Most of the testing data is with the Liqtech 240 liquid cooler, rated at 500W.

Enermax's rating of their AIOs is utter nonsense. If you try to push 500 W thermal load through a slim 360, slim 280, let alone slim 240 mm radiator, it's not only going to be very noisy, the loop will also become very hot. So hot, that AMD's boost will put a lid on the clocks, voltage, and power.

Unless Ian tested outdoors in Antarctica.

Edit,
a slim 240 radiator is inadequate even for 2990WX's stock TDP of 250 W, IMO.

24601 · Aug 14, 2018

french toast said:
What could they do next gen to alleviate the memory issues with 2990wx?
Aside from different topologies, would adding a large L4 help much?

Nothing is going to help traversing 2 layers of glorified PCI-E 3.0 x16 links for your CPU to do the groundbreaking revolutionary task of conversing with it's own main memory.

Schmide · Aug 14, 2018

french toast said:
Ok, assuming AT did proper testing of the connections and not using whole uncore...
Now IF is worse than intel mesh...16 core consume slighly more power than intel 18 core at max...but the real story is MINIMUM connection power consumption...it is considerably more on AMD IF than on intel mesh..and nearly an order of magnitude worse when comparing intel ring bus to pinnacle ridge IF...(baseline).

You're discriminating power consumption between two separate entities over different distances. AMD will never be able to compete. It's a total tradeoff.

french toast said:
Power consumption looks reasonably equivalent between mesh and IF when using AMD 2 dice Vs 18 monolithic intel die...but connecting 4 dice is a massive amount of power for either Epyc or 2990wx...using that much percentage of power is not ideal.

It's really not that bad considering what AMD was planing when they designed it. Over provisioning. And sure running point to point is going to be less of a burden. That's the nature of pathways.

french toast said:
I'm saying using IF to connect multiple small 4 core CCX for desktop is a bad situation long term, even worse when using multiple IF links to connect 4 of those dice together for threadripper.
They would be better off moving to a new more efficient topology for a larger CCX, better connections between CCX, finally some kind of active interposer with advanced topology for connecting dice.
Most people seem to think ..large ring bus CCX + active interposer+butter donut is a better long term solution.

Why are you trying to make this into intel? It's not.

CCX is not getting any bigger. When you factor it, the factorial figures return 4 or 6 or 4c2.

Regardless all this should be in the Speculation thread

Shivansps · Aug 14, 2018

Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory? This has to be the reason of why it performs so bad on Windows compared to Linux.

Just guessing here, maybe they decided to improve performance for EPYCS whiout all 8 memory channels populated?

wahdangun · Aug 14, 2018

french toast said:
What could they do next gen to alleviate the memory issues with 2990wx?
Aside from different topologies, would adding a large L4 help much?

It will just increase latency, because now they need to look up l4 before ram.

scannall · Aug 14, 2018

wahdangun said:
It will just increase latency, because now they need to look up l4 before ram.

16 gig of HBM 2 as L4, squatting between all those dies would likely help a LOT. But the cost won't cut it yet. This year anyway it's a Unicorn.

Batmeat · Aug 14, 2018

bauerbrazil2014 said:
OMG, people are complaining about gaming performance.

Don't see why. I did read a lot of posts about how TR2 will suck at gaming,but.....first post on this thread from mark

Posted by Dookey https://www.hardocp.com/article/2018/08/13/amd_ryzen_threadripper_2990wx_2950x_cpu_review

Gaming wise, the chip is still a beast. I plan on jumping on the 2920x Should be near the same level of performance.

Timorous · Aug 15, 2018

CuriousMike said:
Hardware unboxed explained it in a way I finally understood -
https://youtu.be/QI9sMfWmCsk?t=3m49s

And how that impact memory bandwidth
https://youtu.be/QI9sMfWmCsk?t=17m3s

I am not so sure about that considering how much better it does in Linux. It makes me think the issue is with Windows rather than the CPU memory configuration. I am sure you could find some very specific workloads that do not scale because they can saturate the memory bandwidth when using 16 or fewer cores but I am not sure how prevalent they are.

BigDaveX · Aug 15, 2018

Batmeat said:
Don't see why. I did read a lot of posts about how TR2 will suck at gaming,but.....first post on this thread from mark

Gaming wise, the chip is still a beast. I plan on jumping on the 2920x Should be near the same level of performance.

Tech Report's review shows that while frame rates on the 2990WX might not be completely catastrophic (though they're certainly on the low side), there are some pretty nasty frame latency spikes.

All that aside, seeing the Linux benchmarks makes me wonder whether Windows 10's NUMA optimizations might still not be entirely up to snuff. IIRC, Linux has had full NUMA support baked into the kernel since the time of the original Opteron's release 15 years ago, whereas Windows didn't add it until Vista, and even then it barely worked from what I remember.

french toast · Aug 15, 2018

wahdangun said:
It will just increase latency, because now they need to look up l4 before ram.

Only on large data sets though surely? If you had a large chunk of L4 with good bandwidth, even 256mb...that would cut alot of traffic to main memory and improve performance of some workloads by decreasing latency.

LightningZ71 · Aug 15, 2018

Where an L4 cache would help is on each die. It would need to be at least 8x (preferrably 16x) as large as the combined L3 cache size on each die (to account for the 4 die configuration) and would definitely add another few cycles of latency for memory access. So, that's 128MB per die minimum and preferrably 256MB. That's a LOT of die space. However, for a 4 die TR2, that's an L4 cache of 512MB to 1GB in size. To make that work effectively, you'd need to increase inter-die bandwidth significantly. With the advent of PCI-E 4.0 and 5.0 though, that base clocking approach can help solve this (IF borrows a lot from it it seems).

so, in the long run, I think that we'll eventually see an L4 cache on the Zen die in future iterations. I also see that, eventually, there will be a split in the lines with Epyc and TR-WX series getting die that are custom made for them and may be able to filter down to the HEDT market in another performance tier.

LTC8K6 · Aug 15, 2018

I would guess that tweaks to software will help out in the future.

french toast · Aug 15, 2018

LightningZ71 said:
Where an L4 cache would help is on each die. It would need to be at least 8x (preferrably 16x) as large as the combined L3 cache size on each die (to account for the 4 die configuration) and would definitely add another few cycles of latency for memory access. So, that's 128MB per die minimum and preferrably 256MB. That's a LOT of die space. However, for a 4 die TR2, that's an L4 cache of 512MB to 1GB in size. To make that work effectively, you'd need to increase inter-die bandwidth significantly. With the advent of PCI-E 4.0 and 5.0 though, that base clocking approach can help solve this (IF borrows a lot from it it seems).

so, in the long run, I think that we'll eventually see an L4 cache on the Zen die in future iterations. I also see that, eventually, there will be a split in the lines with Epyc and TR-WX series getting die that are custom made for them and may be able to filter down to the HEDT market in another performance tier.

Even if it was on die, that would presumably only add latency on certain workloads, most of the time it would lower latency no?.
There is going to be a hard limit of core amounts in a few years, then bigger caches will be here to stay.

Karnak · Aug 15, 2018

https://translate.google.com/transl...wx-laeuft-mit-radeons-besser-1808-136016.html

Interesting, looks like Nvidia's driver doesn't work properly.

BigDaveX · Aug 15, 2018

Karnak said:
https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https://www.golem.de/news/32-kern-cpu-threadripper-2990wx-laeuft-mit-radeons-besser-1808-136016.html

Interesting, looks like Nvidia's driver doesn't work properly.

That actually would explain a fair bit. nVidia's driver supposedly makes pretty heavy use of multi-threading, so if it's making bad choices which cores/dies to allocate threads to, performance is going to suffer.

EXCellR8 · Aug 15, 2018

Karnak said:
Interesting, looks like Nvidia's driver doesn't work properly.

Most likely an easy fix... the Radeon driver team most likely has much more access to TR CPU samples

wahdangun · Aug 15, 2018

french toast said:
Only on large data sets though surely? If you had a large chunk of L4 with good bandwidth, even 256mb...that would cut alot of traffic to main memory and improve performance of some workloads by decreasing latency.

scannall said:
16 gig of HBM 2 as L4, squatting between all those dies would likely help a LOT. But the cost won't cut it yet. This year anyway it's a Unicorn.

TR already have plenty cache, and it's search in each l3 cache before going to ram so it's not worth it if they added another level of cache, it will just increasing the cost and latency but will not have a meaningful increase in performance. What AMD need to do is increasing the IF frequency but with the current power consumption I guess AMD really need that 7nm.

StefanR5R · Aug 15, 2018

Shivansps said:
Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory?

I am not sure, but it occurs to me that this case should be handled by any NUMA implementation from day one, because it is just a variation of the case that the memory of a node is fully allocated, while another node still has free memory.

LightningZ71 · Aug 15, 2018

wahdangun said:
TR already have plenty cache, and it's search in each l3 cache before going to ram so it's not worth it if they added another level of cache, it will just increasing the cost and latency but will not have a meaningful increase in performance. What AMD need to do is increasing the IF frequency but with the current power consumption I guess AMD really need that 7nm.

TR has plenty of cache for the total package, but, we're looking at very specific use cases here. A large L4 cache for each die would enable each die to keep a local copy of the contents of each remote L3 cache local to each die. This would alleviate the strain on the IF links between the various Die and reduce the power demand on it as a result. It would also alleviate the issues that having distant RAM on TR-WX packages can cause. Yes, there would be invalidation and copy traffic, but, in most use cases, that would be a "less than frequent" case for a properly NUMA aware scheduler OS. Now, will it be expensive? You betcha! Will it be worth the cost at 12nm? Absolutely not. At 7nm? Maybe on 7nm+ for a hypothetical die that's targeted at EPYC/TR/HEDT AM4. I think that as AMD gets better market penetration and volume on their products, they will also have enough revenue to have three different actively developed die: Power and Size optimized Ryzen Mobile (current Raven Ridge), Clock optimized Ryzen Desktop (current Zepplin) for AM4/TR-X, and balanced Ryzen HCC for EPYC/TR-WX/HEDT AM4. Mobile and Desktop can stay at GloFo and Server can live at TSMC, each on a process node that fits their use case. With a separate HCC die, they can afford to make tweaks to it that address these shortcomings, such as adding extra CCXs and an L4 cache.

wahdangun · Aug 15, 2018

LightningZ71 said:
TR has plenty of cache for the total package, but, we're looking at very specific use cases here. A large L4 cache for each die would enable each die to keep a local copy of the contents of each remote L3 cache local to each die. This would alleviate the strain on the IF links between the various Die and reduce the power demand on it as a result. It would also alleviate the issues that having distant RAM on TR-WX packages can cause. Yes, there would be invalidation and copy traffic, but, in most use cases, that would be a "less than frequent" case for a properly NUMA aware scheduler OS. Now, will it be expensive? You betcha! Will it be worth the cost at 12nm? Absolutely not. At 7nm? Maybe on 7nm+ for a hypothetical die that's targeted at EPYC/TR/HEDT AM4. I think that as AMD gets better market penetration and volume on their products, they will also have enough revenue to have three different actively developed die: Power and Size optimized Ryzen Mobile (current Raven Ridge), Clock optimized Ryzen Desktop (current Zepplin) for AM4/TR-X, and balanced Ryzen HCC for EPYC/TR-WX/HEDT AM4. Mobile and Desktop can stay at GloFo and Server can live at TSMC, each on a process node that fits their use case. With a separate HCC die, they can afford to make tweaks to it that address these shortcomings, such as adding extra CCXs and an L4 cache.

But the design of L4 cache is always because constraints in bandwidth not latency, just like why broadwell and Xbox one use it, and when there are available faster memory option they abandoned it ( in this case ddr4 or gddr5). And the problem with TR isn't about bandwidth but latency, and when ddr5 come out it wil bring much higher bandwidth, and making adding l4 cache is kinda useless.

Kenmitch · Aug 15, 2018

Karnak said:
https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=https://www.golem.de/news/32-kern-cpu-threadripper-2990wx-laeuft-mit-radeons-besser-1808-136016.html

Interesting, looks like Nvidia's driver doesn't work properly.

Hmm....Anybody do similar testing in Linux?

richaron · Aug 15, 2018

Shivansps said:
Anyone knows why Linux already supports this thing of having NUMA nodes whiout local memory? This has to be the reason of why it performs so bad on Windows compared to Linux.

Just guessing here, maybe they decided to improve performance for EPYCS whiout all 8 memory channels populated?

I don't think Linux is aware of the specific NUMA architecture at all.

It's been my understanding (at least over the past decade) that Linux has been better with multithreading in general (scheduling?). Even slightly superior (SMT aware for example) scheduling could cause a huge increase in performance in these large core count and funky arrangement systems.

So I think it's entirely likely the Linux results we are seeing are due to fundamental advantages in pre-existing code. And real specific NUMA code could give even greater advantages on top of this.

Shivansps · Aug 15, 2018

StefanR5R said:
I am not sure, but it occurs to me that this case should be handled by any NUMA implementation from day one, because it is just a variation of the case that the memory of a node is fully allocated, while another node still has free memory.

if Windows sees a node without avalible memory that may wreak havoc with Windows virtual memory management that is far different to how Linux manages virtual memory.

But im not sure, i never looked into NUMA before.

french toast · Aug 15, 2018

wahdangun said:
But the design of L4 cache is always because constraints in bandwidth not latency, just like why broadwell and Xbox one use it, and when there are available faster memory option they abandoned it ( in this case ddr4 or gddr5). And the problem with TR isn't about bandwidth but latency, and when ddr5 come out it wil bring much higher bandwidth, and making adding l4 cache is kinda useless.

Not true.
https://www.anandtech.com/show/9320/intel-broadwell-review-i7-5775c-i5-5675c/10
You can clearly see the impact of having an L4 cache has to the cpu here.

2990WX review thread Its live !

Senior member

Elite Member

Golden Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Senior member

Golden Member

Senior member

Senior member

Platinum Member

Lifer

Senior member

Senior member

Senior member

Diamond Member

Golden Member

Elite Member

Platinum Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Senior member