Intel processors crashing Unreal engine games (and others)

igor_kavinski · Jul 16, 2024

Oh no, "Intel 7 Ultra" transistors getting squished

DAPUNISHER · Jul 16, 2024

igor_kavinski said:
Oh no, "Intel 7 Ultra" transistors getting squished

We do have a member here that reported their 1700 CPU is so warped it no longer posts. The degradation and crashes are starting to look like a sandwich made from layers of fail. Maybe the socket latch is the rancid meat.

IEC · Jul 16, 2024

12th gen has been out longer in the same socket, so I doubt it.

Hail The Brain Slug · Jul 16, 2024

I know a couple people who have done multiple RMA's and have used a contact frame for the entire life of all of them. It could be orthogonally related/cause tangential issues, but it doesn't seem to me to be causal.

gdansk · Jul 16, 2024

I suspect the Supermicro W680 boards, personally, for the "data center" game servers being unreliable.
Because I have one such board, running with two different 12600K at DDR5-3600 ECC. I've had issues to the point I am submitting RMA for the board if it shuts off randomly again. I'm pretty sure the CPU isn't the problem because like I said I tried two different 12600Ks. And neither showed issues in a cheap ASRock Z690 DDR4 board but they both shutdown randomly when in the Supermicro W680 board.

Just my experience.

And I don't think it is related to the client crashes - where it does seem to be CPU issues.

DAPUNISHER · Jul 16, 2024

Hail The Brain Slug said:
I know a couple people who have done multiple RMA's and have used a contact frame for the entire life of all of them. It could be orthogonally related/cause tangential issues, but it doesn't seem to me to be causal.

That is a reasonable hypothesis. It is just part of the crap sandwich.

Saylick · Jul 16, 2024

DAPUNISHER said:
We do have a member here that reported their 1700 CPU is so warped it no longer posts. The degradation and crashes are starting to look like a sandwich made from layers of fail. Maybe the socket latch is the rancid meat.

Hail The Brain Slug said:
I know a couple people who have done multiple RMA's and have used a contact frame for the entire life of all of them. It could be orthogonally related/cause tangential issues, but it doesn't seem to me to be causal.

If true, this adds yet ANOTHER complication to the situation... Jesus, no wonder why Intel is struggling to get to the bottom of this.

IEC · Jul 16, 2024

gdansk said:
I suspect the Supermicro W680 boards, personally, for the "data center" game servers being unreliable.
Because I have one such board, running with two different 12600K at DDR5-3600 ECC. I've had issues to the point I am submitting RMA for the board if it shuts off randomly again. I'm pretty sure the CPU isn't the problem because like I said I tried two different 12600Ks. And neither showed issues in a cheap ASRock Z690 DDR4 board but they both shutdown randomly when in the Supermicro W680 board.

Just my experience.

And I don't think it is related to the client crashes - where it does seem to be CPU issues.

Half the W680 boards were from ASUS. So failures are not unique to SuperMicro boards.

igor_kavinski · Jul 16, 2024

gdansk said:
I'm pretty sure the CPU isn't the problem because like I said I tried two different 12600Ks. And neither showed issues in a cheap ASRock Z690 DDR4 board but they both shutdown randomly when in the Supermicro W680 board.

What about OS?

H433x0n · Jul 16, 2024

DAPUNISHER said:
The doc speculating on why the socket latch could be part of the problem -

" I'm talking about torsional twist exacerbating the issue - i.e finding the root cause, not the symptom."

https://twitter.com/x/status/1812809082148892806

How does that explain laptops with the issues?

On a humorous note/pure satire: The doc is obviously guerrilla marketing for Roman and going to get a nice kickback on all of the contact frames he is about to sell.

Roughly 1 out of every 150 CPUs is dead within 100 hours or so. The statistics for "marginal" CPUs is even higher than this. The laptops having issues could just be the normal failure rate.

I wouldn't rule out that the ILM is a contributing factor. It's possible that some of the specs changed with Z790, or It's possible that they didn't effect the 12th gen chips because they are more robust. No matter what, the ILM is a weak point that should be fixed going forward.

igor_kavinski · Jul 16, 2024

gdansk said:
And neither showed issues in a cheap ASRock Z690 DDR4 board

I actually faced thermal issues with my ASROCK mobo until I clamped the heatsink down hard on the CPU socket. Did you screw it down till it felt tight enough or did you go all the way and turn the screws till your screwdriver couldn't budge the screw anymore and slipped off?

FangBLade · Jul 16, 2024

In the past few months, there have been only problems with Intel. I remember about three months ago when a streamer was considering buying an Intel setup for gaming streaming, but in the end, he chose the 7800x3d due to the issues. I remember when AMD products had a reputation for being unstable, and today it's Intel, lol. How things change. The Wccftech crew is on life support, I see they are closely monitoring this forum too.

biostud · Jul 16, 2024

If you skip to around 22.00+ Actually Hardcore Overclocking is speculating if it is VID requests ruining the ring bus.

lantis3 · Jul 16, 2024

Intel doomed.

Mopetar · Jul 16, 2024

blckgrffn said:
LMAO - I actually chuckled because 1) it called out ASUS on some "feature" only their boards seem to have, probably something else with a custom name and 2)setting it does what? and its not safe? what?

Well if it's a "fail safe" and it's not "safe" what does that leave us with? Perhaps we could approach this mathematically.

fail safe - safe = ?

Can anyone help me out with this? I'm still trying to wrap my head around AMD's last numbering scheme revamp so I'll have to foist this problem off onto other forum members.

maddie · Jul 16, 2024

Did Intel do any process/chemistry changes for the Raptor Lake Intel 7 process post Alder Lake? This was something the Fab folks always did, back in the day. Maybe they screwed up for the win.

gorobei · Jul 17, 2024

Wendell on pcworld podcast on amd zen5, arm, and intel (video cued to intel). he has another video on the intel thing on L1T yt channel coming up.

he goes over how he started investigating, some of the hash out with gamedevs and server operators.
some fun tidbits, only 50% of the cpu's in the data he had confidence about were affected and the errors in game could be as minor as a glitched texture, there will be some real shifts in server purchasing decisions coming.

if you dont watch, here is the best line from the show.
re: pat gelsinger talking about amd being in the rear view mirror, "are you sure it doesnt mean (intel) isnt driving in the wrong direction?"

poke01 · Jul 17, 2024

gorobei said:
pat gelsinger talking about amd being in the rear view mirror, "are you sure it doesnt mean (intel) isnt driving in the wrong direction?"

Intels been driving in the wrong direction since they refused the iPhone.

KompuKare · Jul 17, 2024

poke01 said:
Intels been driving in the wrong direction since they refused the iPhone.

To be fair, everyone is now margins obsessed. The other x86 vendor seems quite blasé about ignoring huge markets because margins would be lower and seem to have no strategy to offer a budget line using older nodes/Samsung etc.

The irony with Intel and the iPhone is that Intel does not know their own history or views it through rose-tinted glasses:
why did Intel win against big iron, bit RISC servers etc.?
Until they were hugely dominant and had the huge profits to invest in their fabs, it was never about technical advantages*

No, Intel's victory against RISC workstation and servers was purely due to economics of scale.
Those RISC workstation vendors had far larger margins but tiny volumes.
Intel had "huge volumes, low[er] margins".

So Intel turning down mobile due it being "huge volumes, low margins" was really ironic.

* the 8086/8088 was a CPU with a very poor ISA and 286 and 386 largely did not fix this - poor x86 programmers in the 80s/90s.

moinmoin · Jul 17, 2024

gorobei said:
buildzoid chimes in.

ac loadline setting isnt being enforced (CEP) and vdroop isnt reported correctly resulting in undervolting so low that it is making the cpu unstable. while not the cause, it does contribute.

he is speculating that the ring is killing itself when the cpu tries to crank up voltage for the 6ghz on the 2 p cores. the i9s are failing sooner because they are running higher voltages, but the i7s may just be damaging themselves slower but are still degrading.

the warframe dev's pie chart shows the main determinant by cpu model is voltage, with the only outlier being the ks models which have a different customer target.

Yes, this seems to be the likeliest cause in my view. Buildzoid themselves summarized it as follows:

https://twitter.com/x/status/1812309225701261454

"12/13/14th gen all use Vcore for powering the P/E cores and the ring. A single P-core at 6GHz only pulls around 60W even at 1.5V. So a low power limit won't really do anything to protect the ring from the P-core boost."

"If my theory is correct. intel would basically have to shave like 500-300MHz of boost from the top end chips to get the chips back down to safe operating voltages."

CakeMonster · Jul 17, 2024

I guess that could explain how its gradually worse the higher up the product stack you go, considering boost is a major point of the top SKU's.

igor_kavinski · Jul 17, 2024

Seems Intel's internal validation/binning routines didn't stress the CPUs with strenuous memory bound workloads and instead just focused on whether the cores themselves could hit higher speeds without errors.

Timur Born · Jul 17, 2024

Maybe CPU bins affect which CPUs are more or less affected by degrading and/or instabilities?! I always called my own 13900K "good enough" and in the full range of 13/14th gen it is rather average, right in the middle of VID spread published by Igor's Lab. Maybe that means my CPU doesn't ask for too much (degrading) nor too little voltage (instability) at stock settings and thus stays healthy and stable?!

I have been using my 13900K for over 1.5 years now and send it through various stress tests within mostly sane limits (but know that it hit over 415 A at least a very few times).

DAPUNISHER · Jul 17, 2024

Timur Born said:
Maybe CPU bins affect which CPUs are more or less affected by degrading and/or instabilities?! I always called my own 13900K "good enough" and in the full range of 13/14th gen it is rather average, right in the middle of VID spread published by Igor's Lab. Maybe that means my CPU doesn't ask for too much (degrading) nor too little voltage (instability) at stock settings and thus stays healthy and stable?!

I have been using my 13900K for over 1.5 years now and send it through various stress tests within mostly sane limits (but know that it hit over 415 A at least a very few times).

Which of the UE engine games known to crash do you put a fair number of hours into?

Every thread on the degrading CPUs I visit, there are the anecdotes from owners with variations of "mine is fine". Good for you. I hope it stays that way, because as of now, we have no idea if that means yours is solid, or a timebomb with a longer and or slower burning fuse.

coercitiv · Jul 17, 2024

Timur Born said:
I have been using my 13900K for over 1.5 years now and send it through various stress tests within mostly sane limits (but know that it hit over 415 A at least a very few times).

We won't know until we understand the nature of failures. We don't know if it's voltage, current, temps, mechanical, factory of origin or a combination of all these factors. Depending on the root cause and possible secondary causes that may accelerate the degradation, a CPU being stable for more than 1 year is no indication that everything will remain fine in the future. (this applies to my 12700K too)

During Nvidia's bumpgate I had 2 laptops, a personal machine and one at work, both Dell XPS from the same lineup. Personal machine was the 13" model, and the GPU would die every 6 months like clockwork. The 15" model at work had better cooling on the GPU and also saw less thermal stress over the years. It managed to hang on until warranty expired, so I guess 5+ years, then died one summer day during a reboot. Root cause was the same, degradation happened at a different pace.

Edit: I ate too many words on the first pass

Intel processors crashing Unreal engine games (and others)

Lifer

Super Moderator CPU Forum Mod and Elite Member

Elite Member

Diamond Member

Diamond Member

Super Moderator CPU Forum Mod and Elite Member

Diamond Member

Elite Member

Lifer

Golden Member

Lifer

Senior member

Lifer

Senior member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Lifer

Senior member

Super Moderator CPU Forum Mod and Elite Member

Diamond Member