Intel processors crashing Unreal engine games (and others)

Page 30 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,758
14,785
136
That is the crux why months after this came to light, Intel have not been "able" to provide a fix.

Never mind that any such "fix" would lose so much performance that pretty much 13x and 14x CPU reviews would be invalidated.

What I keep saying though: there is no way that company of Intel's size did not 100% know this was the case, and trying to throw the motherboard makers under a bus is just laughable.
But on the flip side, Intel had enough power to require mobo makers to use "x" max defaults, but did not do it, since they wanted to win reviews as the top chip. At reasonable power setting (same as AMD for wattage) it loses to the Zen 4 CPUs, and they could not have that.

So here we are.

Edit: I am sure there are other situations where they are also unstable. Lets see a 14900k do 100% load with even a crippled avx-512 for days. For one they don't have it, but even taking that out of the equation, could they even do it ?
 

In2Photos

Golden Member
Mar 21, 2007
1,688
1,699
136
I get that most people don't dwell into the bios, but then they shouldnt be buying k, ks and z mobos.
Last time I checked those were all consumer skus so why shouldn't any consumer be able to buy them? Plus there are system integrators that sell high end PCs for many uses, gaming, photo and video editing, CAD, etc. People want the best PC for that type of work and many SIs only sell Intel. Everyone uses a PC, not everyone builds a PC.
 

Joe NYC

Platinum Member
Jun 26, 2021
2,333
2,947
106
The problem is that the issue only happens under heavy overclock with all safeties removed with settings that are way beyond anything that intel allows.

Apparently, just switching memory profile from default (JEDEC) to XMP turns all overclocking to the max on some motherboards.

I don't know if this is true, but I came across some posts saying this.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,758
14,785
136
@TheELF , so you have a 14900k ? Could you reply with your before and after settings by ONLY changing XMP for the settings you refer to ? I don't believe this is true, but would love to hear those results.
 

H433x0n

Golden Member
Mar 15, 2023
1,073
1,281
96
Apparently, just switching memory profile from default (JEDEC) to XMP turns all overclocking to the max on some motherboards.

I don't know if this is true, but I came across some posts saying this.
I'm certain this was the case when I first built my RPL system using Asus Z790-E Strix board. Enabling XMP meant it had the Asus enhanced profile that came with 4096W PL1/PL2 and unlimited IccMax (511A). This also meant the Intel current excursion protection settings (the safety features designed to prevent transient current spikes) were all turned off. If you didn't toggle XMP on and kept JEDEC memory profile it would default to 125W/253W and 307A IccMax I'm unsure if CEP was enabled or not since I didn't know that was a thing back then.

Nowadays I would say it's probably fixed on Asus boards (unless your chip is already trashed). The default when enabling XMP is the Intel performance profile that runs 253W PL2, 307A IccMax with CEP enabled. You have to go out of your way to run the cancerous Asus OC profile. The Intel profile will still score 38.5K CB R23 in this configuration. It's the way it should've always been configured. It runs higher voltage for sure (AC/DC LL are both 1.0ohm now) but temperatures haven't increased, the extra voltage seems to only be present in low load situations.
 

H433x0n

Golden Member
Mar 15, 2023
1,073
1,281
96
@TheELF , so you have a 14900k ? Could you reply with your before and after settings by ONLY changing XMP for the settings you refer to ? I don't believe this is true, but would love to hear those results.
You would have to check this on a BIOS before this mess started, probably late 2023.
 

Bencher

Member
Apr 21, 2022
54
10
51
That is the crux why months after this came to light, Intel have not been "able" to provide a fix.

Never mind that any such "fix" would lose so much performance that pretty much 13x and 14x CPU reviews would be invalidated.

What I keep saying though: there is no way that company of Intel's size did not 100% know this was the case, and trying to throw the motherboard makers under a bus is just laughable.
my 12700F would run as high as 300 watt usage at time, and I got rid of it before I could find out, so yes, that one is only a maybe. BUT the point of my post of that definitely, the 13xx and 14xx are WAY overclocked. from the factory and thats the center of the problem. If there was ANY motherboard that had reasonable bios defaults, they would not sell, as they would be so far in performance compared to the other MFG's.
Oh I very much agree, the out of the box settings of the 13700k/900k - 14700k/900k is bad. They are literally - barely usable out of the box. Had both the i9s, and I had to tune them down. Which I don't mind since I'd do that regardless with any CPU.

What I disagree with is with whether or not Intel gained anything out of it. Besides the obvious crashing issues that it has caused, all the reviews are focused on "300 watts / this will catch fire / burn your house down" etc. Toned down to more reasonable wattages would still have them winning in both ST and MT performance and efficiency at most segments except the high end were the 7950x is king.
 

Bencher

Member
Apr 21, 2022
54
10
51
Last time I checked those were all consumer skus so why shouldn't any consumer be able to buy them? Plus there are system integrators that sell high end PCs for many uses, gaming, photo and video editing, CAD, etc. People want the best PC for that type of work and many SIs only sell Intel. Everyone uses a PC, not everyone builds a PC.
System integrators use their own bios settings based on the cooler / case / psu used etc. They are irrelevant to the discussion.

Of course every consumer can buy them but not every consumer can use them. Like you can buy a car, if you don't have a license though you can't drive it sort of thing.
Apparently, just switching memory profile from default (JEDEC) to XMP turns all overclocking to the max on some motherboards.

I don't know if this is true, but I came across some posts saying this.
From my experience with a lot of different mobos, it is not the case (i've tested 4 different mobos). The first time you boot into the bios on any motherboard, you are stuck to a full screen asking you to select the cooler you are using - and therefore your power limits. There are 3 options, something like air cooler - settings power limits to 125 or roundabout - tower cooler making it 250, and water making them unlimited. Most reviewers that are pushing 300+ watts into the CPU are using the last option. It's not the XMP option that removes all limits, it's the watercooler option.
 

coercitiv

Diamond Member
Jan 24, 2014
6,403
12,863
136
Apparently, just switching memory profile from default (JEDEC) to XMP turns all overclocking to the max on some motherboards.

I don't know if this is true, but I came across some posts saying this.
We are way past this, and this needs to be crystal clear for everyone in the tread: there are power and boost management features that came disabled by default, features that are intended to ensure system stability.

Let's go through them again as Intel listed them:
  • CEP (Current Excursion Protection) - think of this as clock stretching, the CPU can temporarily drop clocks when power delivery has trouble keeping up during some nasty transient. Most mobo makers disable this by default because it can lower system performance when the CPU is aggressively undervolted (and they will undervolt it using another setting below, to maximize performance)
  • TVB and eTVB (enhanced Thermal Velocity Boost) - these features are meant to allow pushing clocks even further than Turbo Boost, however they are also responsible for limiting max clocks when the CPU passes a certain temperature threshold. In practice disabling them does not mean the CPU will no longer boost to max clocks, but rather that the safety thermal clock ratio clipping is disabled, allowing the CPU to boost to max clocks even when temps are very high. This is why mobo makers will disable them by default, removing a protection layer to maximize performance.
  • TVB Voltage Optimizations - the purpose of this one is to lower Vcore when the CPU temps are lower than max. During light loads this feature alone can lower Vcore by 50mV or more, it has a big effect on CPU efficiency especially when the CPU is properly cooled. Mobo makers want this enabled... right? Wrong! It may limit OC potential out of the box, so it gets disabled.
  • AC Load Line - a parameter that describes how much voltage compensation should be applied depending on load. There is no default value here, only a worst case scenario value. Mobo makers are supposed to test their board models and establish proper values using specialized equipment. The catch here is one can agresively configure AC Load Line and effectively undervolt the CPU under heavy workloads. Undervolting maximizes performance, and... can bring the CPU to the brink of stability threshold. Combine this with disabled CEP and TVB and at least some CPUs will become unstable. IMHO this setting was the main culprit for some Intel CPUs not being able to run Cinebench with "stock" settings.
Notice I didn't even mention power and current limits. Those are the second layer of the issue.

We seem to have a third layer too though. Intel has been investigating this for months, and they have yet to come up with a final solution. For me this means that the new enforced defaults are not enough to fix stability problems for everyone.
 

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
28,839
21,632
146

Literally, problem is still there.



I assume future UE5 game would have same fate on Raptor CPU.
Their conclusion from the article they published on June 24th -

How to Stabilize the Intel Core i9-14900K/13900K​

  • Adhere to Intel’s “Default” power profile even if it may not work for all users.
  • Download the eTVB firmware fix from your motherboard vendor’s official website.
  • Cap your framerates at 60 FPS to reduce the load on the CPU. Preposterous, I know, but it works.
  • Reduce your CPU boost clock (P-core/E-core ratio) by 100-200 MHz. Or increase the core voltage by 5-15%.
  • Buy an AMD Ryzen CPU: A more reliable fix, but unfortunately isn’t free.
That's rough sledding. The last solution is right out of the Youtube comments section. 🤣

They have already RMA'd 4 CPUs. With one of the replacements also failing within weeks, and Intel issuing a refund. This gets more FUBAR all the time. However, I am happy to read Hardware Times experienced no RMA shenanigans, and that a refund was issued automatically after multiple failures. That's the level of Intel customer support I have experienced going back to the 90s. Hopefully this policy has been instituted worldwide. Because as I posted earlier in this thread, some of the users in Asia, mostly India, were having a bad time with the RMA process.
 

In2Photos

Golden Member
Mar 21, 2007
1,688
1,699
136
System integrators use their own bios settings based on the cooler / case / psu used etc. They are irrelevant to the discussion.

Of course every consumer can buy them but not every consumer can use them. Like you can buy a car, if you don't have a license though you can't drive it sort of thing.
That's your comparison? Buying and using a computer versus a car? So you want people to have a license to buy a high end PC? 🙄 Just because you have a license doesn't mean you are good at operating the vehicle or that you know how to adjust it to keep things running smoothly.

And system integrators, despite applying their own BIOS settings still had issues with crashing. I saw several of them posting on social media that they were working with Intel on a fix. So it definitely is relevant to the discussion.
 
Jul 27, 2020
18,013
11,737
116

I think this is a pretty eye-opening thread!

Some choice quotes:

Update: p-cores 4 & 5 which cause the crashes are actually the 'preferred' cores with the 6.2GHz limit.

I tried controling this with the frequency limit per active cores (p-cores).

Act.cores: 1/2/3/4/5/6/7/8
Max Freq: 6.1/6.1/5.9/5.9/5.9/5.9/5.9/5.9

Re-running the compile on cores 4 & 5 still results in a crash.

Second attempt:
Act.cores: 1/2/3/4/5/6/7/8
Max Freq: 6.0/6.0/5.9/5.9/5.9/5.9/5.9/5.9

Failed as well.
Now testing: 5.9/5.9/5.9/5.9/5.9/5.9/5.9/5.9
This passes the compile gcc test on all cores.

Okay have replicated the failure in Windows. I installed Gentoo in WSL2, then did an "emerge -1 gcc" to rebuild the compiler with itself. This runs to completion fine on "default" settings.

However if I open the task manager and select vmmemWSL (the VM running WSL) and set the affinity to CPUs 8 & 9 only I get a more or less immediate crash of the compiler.

The results match those in Linux, p-cores 0-3 and 6-7 seems to work fine when vmmemWSL has it's affinity set to the pair of hyper-threads for that core. P-core 4 causes the compile to fail almost instantly, and P-core 5 fails after a few minutes.
If I run a long GCC compile task it will eventually fail.

If I bootstrap GCC with 3 worker threads and set the affinity to vCPU 8 & 9 (both in the same 'preferred' p-core), it still fails more or less instantly, and the baseline settings appear to have made no improvement to this at all.
After a lot of testing, I have found the motherboard power limits don't solve the issue, neither does the Intel Baseline Profile in the latest BIOS. I was able to get my 14900KS stable at 5.8GHz on stock voltages (ASUS LLC 4) which is what I am using now. I could get 5.9GHz stable on LLC 6 but not sure it's good to run it like that all the time. I was not able to get it stable at 6GHz and above (basically could not get the preferred cores to boost any faster than the other cores without stability issues). I was also able to get it stable all the way up to 6.2GHz with hyper-threading disabled and LLC3.
I tried under-volting as suggested by Intel, and it did not improve stability, in fact as expected, with lower voltages I had to limit the frequency more to keep it stable.

I also tried over-volting. This was interesting because I did get it stable with 6.2GHz enabled at +250mV offset. However it was overall the same speed or slower as the single p-core being used was hitting 100°C and thermally throttling - but not crashing.

So to get the chip stable during it brief spike to 6.2GHz, we end up thermally throttling after a few seconds to 5.8GHz, resulting in overall lower performance than just limiting clock speed to 5.9GHz.

So here is where I am at:

- limiting frequency to 5.9GHz (@ Asus LLC5) gives best all core performance

- disabling hyper-threading (@ Asus LLC3) gives best single threaded performance.

Overvolting for a higher clock speed, or under-volting to reduce temperature do not result in better performance than the above.
I still think the real reason for this problem is that hyper-threading creates a hot-spot somewhere in the address arithmetic part of the core, and this was missed in the design of the chip. Had a thermal sensor been placed there the chip could throttle back the core ratio to remain stable automatically, or perhaps the transistors needed to be bigger for higher current - not sure that would solve the heat problem. Ultimately an extra pipeline stage might be needed, and this would be a problem, because it would slow down when only one hyper-thread is in use too. I wonder if this has something to do with why intel are getting rid hyper-threading in 15th gen?
Really glad I found your series of posts here – this is very similar to what I'm seeing with my i9 14900K. I don't have any interest in gaming or overclocking so this is a stock build without any attempt to push the processor beyond what the bios is doing by default. I'm running Windows 11 and use the PC exclusively for C++ software development. For the first few months the processor was stable, but i'm now getting multiple random clang compiler crashes that go away after retrying.

I'm now considering buying a new system. My work cannot withstand the downtime of taking out the CPU and doing an RMA exchange.

In my case, compiling a codebase like chromium from scratch with a pristine known-good git checkout has a 100% chance of a clang ICE. The stack traces from these crashes never make any sense either – that's made it hard to narrow down. The clang crash report might show an invalid syntax encountered while parsing some C++ AST but succeed when retrying. It's stochastic in nature – never the same error twice, or with the same file. I also see crashes in Python scripts that run as part of the build as well. I tried compiling the same project in Ubuntu and saw the same results.
It appears that even at 5.8GHz the CPU has become unstable again. Not sure whether this is due to 'degredation' (electromigration results from a combination of high heat and high current), or just warmer weather here has reduced cooling efficiency.
I have now tested my 13900ks using the same set of CPU tests above. The 13900ks passes all tests at default ASUS settings, with ASUS performance optimisations enabled, and unlimited power and current settings. The CPU has previously been run with unlimited power and current for over a year and shows no signs of the issues I have had with the 14900ks.

Verdict: 13900KS possibly better. 14900KS has a high chance of being a fail.
 
Reactions: Joe NYC and Ranulf

KompuKare

Golden Member
Jul 28, 2009
1,076
1,126
136
Verdict: 13900KS possibly better. 14900KS has a high chance of being a fail.
Interesting that they are using the ECC enabled W680 board and not a consumer board.

Not that ECC helps if the instability is inside the CPU!

Also the HT comments are interesting considering rumours are Intel are abandoning SMT - at least outside of servers.
 
Jul 27, 2020
18,013
11,737
116
Some key points from that thread:

6.2 GHz too much even for the preferred cores.

HT off may possibly make the errors go away and let these preferred cores hit 6.2 GHz.

A user with stock 14900K with no overclock started having stability issues after only a few months of mainly C++ development work.

On 16th May, user determined that 5.8 GHz all core was stable for him. On 3rd June, he reported that 5.8 GHz all core was no longer stable. So the CPU degraded in roughly two weeks.
 

IEC

Elite Member
Super Moderator
Jun 10, 2004
14,362
5,032
136
Apparently, just switching memory profile from default (JEDEC) to XMP turns all overclocking to the max on some motherboards.

I don't know if this is true, but I came across some posts saying this.
This was certainly true for my MSI Z690 ACE. First boot I enabled XMP and it immediately set some of the power limits to max (4096W - wowza) and stupid amperage limits. That was on a 12600K. Just enabling XMP did this.

Obviously, I saw this and immediately tuned PL1 and PL2 to a more sane 125W. How many users would have noticed?
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,758
14,785
136
This was certainly true for my MSI Z690 ACE. First boot I enabled XMP and it immediately set some of the power limits to max (4096W - wowza) and stupid amperage limits. That was on a 12600K. Just enabling XMP did this.

Obviously, I saw this and immediately tuned PL1 and PL2 to a more sane 125W. How many users would have noticed?
When I first questioned a user on this, it was because I could not believe that Intel would do this or allow something this crazy for just wanting to use faster memory.

I still maintain that I think Intel/board partners are responsible for the end result that is because Intel wanted to win benchmarks.
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,168
136
When I first questioned a user on this, it was because I could not believe that Intel would do this or allow something this crazy for just wanting to use faster memory.

I still maintain that I think Intel/board partners are responsible for the end result that is because Intel wanted to win benchmarks.
Keep in mind this was also the problem for a lot of X3D owners on AM5 when the vSoC problem still existed (only for AM5 it was a problem with EXPO settings raising vSoC to stupid levels, among other things). AMD stepped in and pushed AGESA fixes, bugged board OEMs, etc. to stop that madness.
 

Markfw

Moderator Emeritus, Elite Member
May 16, 2002
25,758
14,785
136
Keep in mind this was also the problem for a lot of X3D owners on AM5 when the vSoC problem still existed (only for AM5 it was a problem with EXPO settings raising vSoC to stupid levels, among other things). AMD stepped in and pushed AGESA fixes, bugged board OEMs, etc. to stop that madness.
well, I have a 7950x with EXPO enabled and a 4090 , and the CPU and GPU run 100% load at all times. Never flashed since I set it up, never had an issue, and it certainly never hit this amount of users affected and did not degrade CPUs that I heard.
 

DrMrLordX

Lifer
Apr 27, 2000
21,813
11,168
136
well, I have a 7950x with EXPO enabled and a 4090 , and the CPU and GPU run 100% load at all times. Never flashed since I set it up, never had an issue, and it certainly never hit this amount of users affected and did not degrade CPUs that I heard.
That's not an X3D chip.
 

H433x0n

Golden Member
Mar 15, 2023
1,073
1,281
96
We are way past this, and this needs to be crystal clear for everyone in the tread: there are power and boost management features that came disabled by default, features that are intended to ensure system stability.

Let's go through them again as Intel listed them:
  • CEP (Current Excursion Protection) - think of this as clock stretching, the CPU can temporarily drop clocks when power delivery has trouble keeping up during some nasty transient. Most mobo makers disable this by default because it can lower system performance when the CPU is aggressively undervolted (and they will undervolt it using another setting below, to maximize performance)
  • TVB and eTVB (enhanced Thermal Velocity Boost) - these features are meant to allow pushing clocks even further than Turbo Boost, however they are also responsible for limiting max clocks when the CPU passes a certain temperature threshold. In practice disabling them does not mean the CPU will no longer boost to max clocks, but rather that the safety thermal clock ratio clipping is disabled, allowing the CPU to boost to max clocks even when temps are very high. This is why mobo makers will disable them by default, removing a protection layer to maximize performance.
  • TVB Voltage Optimizations - the purpose of this one is to lower Vcore when the CPU temps are lower than max. During light loads this feature alone can lower Vcore by 50mV or more, it has a big effect on CPU efficiency especially when the CPU is properly cooled. Mobo makers want this enabled... right? Wrong! It may limit OC potential out of the box, so it gets disabled.
  • AC Load Line - a parameter that describes how much voltage compensation should be applied depending on load. There is no default value here, only a worst case scenario value. Mobo makers are supposed to test their board models and establish proper values using specialized equipment. The catch here is one can agresively configure AC Load Line and effectively undervolt the CPU under heavy workloads. Undervolting maximizes performance, and... can bring the CPU to the brink of stability threshold. Combine this with disabled CEP and TVB and at least some CPUs will become unstable. IMHO this setting was the main culprit for some Intel CPUs not being able to run Cinebench with "stock" settings.
Notice I didn't even mention power and current limits. Those are the second layer of the issue.

We seem to have a third layer too though. Intel has been investigating this for months, and they have yet to come up with a final solution. For me this means that the new enforced defaults are not enough to fix stability problems for everyone.
This is probably the best summary of the issue. All of these safety features you listed were disabled as well as AC/DC Load Line at 0.55/0.8ohm on Asus boards.

The issue is obvious in hindsight. I’m not sure what Intel is supposed to be investigating since it’s super clear what the problem was. The motherboards literally were running chips undervolted with all safeguards removed while Intel looked the other way. I’m guessing they’re either A) trying to run out the clock until Arrow Lake or B) working with motherboard vendors to help them develop a profile that has these features enabled with minimal performance loss. For Asus the problem is solved since the latest BIOS at the end of May. The Intel performance profile has all of these safety features and still retains the performance from before.

This issue will probably remain for years since people are generally slow to update their BIOS, especially if it means they’ll lose some performance depending on the silicon quality of their chip.
 
Reactions: igor_kavinski

Bencher

Member
Apr 21, 2022
54
10
51
Some key points from that thread:

6.2 GHz too much even for the preferred cores.

HT off may possibly make the errors go away and let these preferred cores hit 6.2 GHz.

A user with stock 14900K with no overclock started having stability issues after only a few months of mainly C++ development work.

On 16th May, user determined that 5.8 GHz all core was stable for him. On 3rd June, he reported that 5.8 GHz all core was no longer stable. So the CPU degraded in roughly two weeks.
Depending on the voltage, not surprising. I've degraded in 5 minutes, lol

It's amazing how much the process has matured from ALD to 13/14th gen but man Intel pushed them too close to the sun.

well, I have a 7950x with EXPO enabled and a 4090 , and the CPU and GPU run 100% load at all times. Never flashed since I set it up, never had an issue, and it certainly never hit this amount of users affected and did not degrade CPUs that I heard.
Personal anecdote of 1. I also had both a 13900k and a 14900k and never had any issues. Doesn't mean much, does it?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |