ub4ty
Senior member
- Jun 21, 2017
- 749
- 898
- 96
This is a terribly annoying bug that does not seem to affect general day to day use if you run Windows. It does however cause issues if you compile large codebases or perform frequent compilations (which is the sort of use case you might imagine a Threadripper used for). There are a number of places this is documented, however the most authoritative source might be AMDs own forum.
https://community.amd.com/thread/215773
I don't want to make a big deal of it, but there doesn't seem to be a general awareness of this issue outside of people using a *nix variant. It has been reproduced on Windows using the "Windows Subsystem for Linux" and also in a VM, so it's certainly not a Linux specific issue as some would seem to believe (first really identified on a BSD variant anyway). If can affect random processes on a rare but day to day basis, but is most easily reproduced using either the programs built especially to reproduce the fault, or just building large codebases (a real-life use case that bit me in early May).
So I raised it here in a legitimate complaint that I'd *really* love to buy one of these setups, but as I can't even get a Ryzen stable (with AMDs help) for my specific stability testing I'm loathe to even contemplate putting a couple of extra grand down on what would normally be a very compelling purchase.
We're (the people who this bug affects in the AMD forum) all hoping AMD gets it sorted, and would dearly love anyone who buys one of these things to run some tests on it to demonstrate they have got these issues sorted. Until now, people have been disabling SMT, turning up voltages (at AMDs request), playing with kernel options to disable features like address space randomization to mixed success (usually it just makes the bug harder to hit and after 24 odd hours of testing it usually only bites about 30 seconds after the tester posts a "it worked, my problems are solved" message to the forum. The only thing that seemingly consistently makes it go away is disabling the OpCache, but very few BIOS seem to expose that option at the moment (like mine).
Anyway, I'll go back to lurking now and wait until this gets positively sorted before dropping > $2200 for another chip & motherboard. The Asus Prime X370 & 1800X that sits idle in the corner is punishment enough.
> Hit the segfault issue after 1hr15min of compilation using some test script that's floating around that uses livedisk and ramdisk for some crazy set of compilations... So, obviously there's a highly technical issue with a long series of compilations?
> Downloaded Phoronix's test suite
Ran 3 consecutive back to back linux kernel compilations w/ j16. Zero issues
My question here is : What in the world is this use case that everyone's talking about that's causing seg faults? What is being compiled and how many times? Is the use case a kernel compilation done over and over again w/o stop? Why the creation of a ramdisk? Is memory usage ballooning over time?
If this is what I think this is then this only likely matters for someone in enterprise using a ryzen platform as a build machine in some specific manner? If this is the case, although I recognize this is a bug, it is unlikely to be faced by a general developer running linux correct? Further, I hear that disabling OP cache nullifies the bug (albeit w/ a 5% performance penalty). Sounds like a good work around until they get this sorted which it appears they are narrowing :
https://community.amd.com/thread/215773?start=555&tstart=0
On Ryzen there is some sort of interaction
between code running at the top of user memory address space and
interrupts that can cause FreeBSD to either hang or silently reset.