Question DEGRADING Raptor lake CPUs

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
I noticed some reports about degrading i9 13900K and KF processors.

I experienced this problem myself, when I ran it at 6 GHz, light load (3 threads of Cinebench), at acceptable temperature and non extreme voltage. After only few minutes it crashed, and then it could not run even at stock setting without bumping the voltage a bit.

I was thinking about the cause for this and I believe the problem is, that people do not appreciate, how high these frequencies are and that the real comfortable frequency limit of these CPUs is probably at something like 5500 or 5600 MHz. These CPUs are made on a same process (possibly improved somehow) on which Alder lake CPUs were made. See the frequencies 12900KS runs at. The frequency improvement of the new process tweak may not be so high as some people presume.

Those 13900K CPUs are probably highly binned to be able to find those which contain some cores which can reliably run at 5800 MHz. Some of the 13900K probably have little/no OC reserve left and pushing them will cause them to degrade/break.

The conclusion for me is that the best you can do to your 13900K or 13900KF is to disable the 5800 MHz peak, which will allow you to offset the voltage lower, and then set all core maximal frequency to some comfortable level, I guess the maximum level could be 5600 MHz. With lowered voltage this frequency should be gentler to the processor than running it at original 5500 MHz at higher voltage. You can also run it at lower frequencies, allowing for even higher voltage drop, but then the CPU is slowly loosing its sense (unless you want some high efficiency CPU intended for heavy multithread loads).

Running it with some power consumption limit dependent on your cooling solution to keep the CPU at sensible temperature will help too for sure.
 
Last edited:

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
29,484
24,222
146
In hindsight it's obvious the stability problems existed back then too, what I'm arguing here is one person can arrive at the correct conclusion based on the wrong evidence, and that person is still considered to be wrong.
I surmised as much. It's your prerogative to set that standard. I do not share it though. It doesn't matter why the broken clock is correct twice a day, it is still correct.

Here's how I see it. The bar isn't nearly that high to clear.

Thread title plus bold sentences are a hypothesis + Mark understood and agreed with the hypothesis in a post that many have almost exactly repeated the last few months = bar cleared.
 

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
Well, this would suggest that in your conditions (voltage, frequency, temperature and use intensity) the CPU degrades by about 50mV per year. This seems like a lot. Even worse when you realise that you are activelly trying to run it with as low voltage as possible! What would be the rate of degradation if you did not undervolt it?

Perhaps Intel 10nm process is not that great even after they "fixed it".

It is surely usable under some conditions, but pushing CPUs made on this process too hard does not seem to be a good idea.

If somebody had a lab and could measure a rate of degradation of these CPUs dependent on how hard they are run (voltage, frequency, temperature), that would be very good. It would be quite expensive project though, you need more than one CPU being subjected to the same conditions to avoid sample variation. 3 CPUs run at 3 different scenarios, that is just 9 systems? Seems doable. It would not be too time consuming, because the rate seems to be pretty high. (And 5 CPUs in 4 different scenarios makes just 20 systems, not that extreme either )

They would get a lot of views on youtube, for sure.

This could determine the safe conditions for these CPUs, so that consumers wishing to overclock them knew what they can safely afford to do and also other consumers wanting to assure long life of their CPUs would know, how to limit the voltage, etc.

We could have known the real safe frequencies for these CPUs if Youtubers were not lazy, did not just drive around in their Ferraris and instead invested some money where it really matters.

Voltage is a necessary prerequisite for current, I have no idea what is the debate about. With low enough voltage the CPU is not getting damaged with whatever load you throw on it.

I have no idea what is your agenda, but degradation are physical processes happening in every semiconductor, it just matters how quickly in what conditions.

And also - not every chip is manufactured the same and perfectly.

Remember, how AVX workloads stress CPUs hard and now they are disabled in the new Intel consumer CPUs? Can you guess why? I believe that the CPUs simply would not last long running that workload. That is the simplest and most logical explanation. It all goes back to degradation.

The consumer Intel CPUs are made to run at such speeds and voltages, that they are UNABLE to run certain workloads without killing themselves.

I believe that in certain conditions something melts or fails in other ways in the chip, I am not sure this is technically a degradation or failure.
In the meantime on another forum I tried to compute effects of different current density and temperature on shortening the time before failure using Blacks electromigration equation and the effects were dramatic. Provided that I made no mistake.

CPU running at 5 GHz and 1,2V and in resulting low current densities and temperature can last years, CPU running at 5,7 GHz, 1,4V and 100°C will have dramatically shorter life.
 
Last edited:

gdansk

Platinum Member
Feb 8, 2011
2,843
4,231
136
We could have known the real safe frequencies for these CPUs if Youtubers were not lazy, did not just drive around in their Ferraris and instead invested a bit of money where it really matters.
How? It took over a year for game developers who have access to telemetry and bug reports to notice the 13900K was having problems. Which reviewer could have replicated that amount of monkeys at keyboards? Especially when they all want to get reviews out in time for launch embargo.
 
Reactions: lightmanek

Hulk

Diamond Member
Oct 9, 1999
4,456
2,374
136
We could have known the real safe frequencies for these CPUs if Youtubers were not lazy, did not just drive around in their Ferraris and instead invested some money where it really matters.


In the meantime on another forum I tried to compute effects of different current density and temperature on shortening the time before failure using Blacks electromigration equation and the effects were dramatic. Provided that I made no mistake.

CPU running at 5 GHz and 1,2V and in resulting low current densities and temperature can last years, CPU running at 5,7 GHz, 1,4V and 100°C will have dramatically shorter life.
I ran my 13900K at "auto" BIOS settings. It degraded in about 4 months.
It's been a year on my 14900K at 5.5/4.3, no HT, manual voltage 1.3V with no problems. 1.3V of course means load voltages from 1.15 to about 1.2. It never sits at 1.3 because at low load it clocks down to c states/voltages.
 

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
How? It took over a year for game developers who have access to telemetry and bug reports to notice the 13900K was having problems. Which reviewer could have replicated that amount of monkeys at keyboards? Especially when they all want to get reviews out in time for launch embargo.
At the original reviews I looked at the thermals and wow, OK that's bad. More heat density than the burner on your electric stove. What could possibly go wrong? I wouldn't expect "click-boys" to go down that rabbit hole, but none of the good ones did either. Maybe I dunno do what Kocicak did? Using Blacks electromigration equation That would have been a good start.
 

Josh128

Senior member
Oct 14, 2022
292
405
96
At the original reviews I looked at the thermals and wow, OK that's bad. More heat density than the burner on your electric stove. What could possibly go wrong? I wouldn't expect "click-boys" to go down that rabbit hole, but none of the good ones did either. Maybe I dunno do what Kocicak did? Using Blacks electromigration equation That would have been a good start.
Nobody went down that hole because who are they to question some of the most knowledgeable CPU and silcon engineers on the planet? They would have been laughed at or called AMD shills by toxic fanboys.
 

scannall

Golden Member
Jan 1, 2012
1,960
1,678
136
Nobody went down that hole because who are they to question some of the most knowledgeable CPU and silcon engineers on the planet? They would have been laughed at or called AMD shills by toxic fanboys.
So? Be laughed at or called a 'fanboy'. If there is a problem that looks worth pursuing then just do it. Having integrity and a desire to pursue the truth will get stones thrown at you. Always has, always will. But it is still the right thing to do and will be rewarded in the end.
 

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
Somebody sent MLID a speculation, that there is something wrong with Raptor lake design, he then added information that Raptor lake has been developed very quickly, suggesting that mistakes could have happened.



I do not think that this may be the real problem, because every piece of technology has its limits and conditions, in which it can work well long term and other, in which it is stressed beyond its limits and it will fail.

Intel upgaded their 10 nm process for Raptor lake CPUs and bragged about how this new improved process can handle up to 1 GHz higher frequency. Perhaps they were too optimistic. Perhaps the real improvement was much lower. Perhaps the improvements actually impacted long term reliability and is in this respect even worse than the previous iteration of the process.

If there was something critically wrong with Raptor lake CPUs, they would fail in Intels testing prior to launch. These CPUs may have some hotspot/s in them, which will get stressed hard in certain conditions (combination of frequency and voltage resulting in high current density and temperature), but when these conditions are not met, they can happily reliably work long term.

One more thing: if a CPU reports 100°C, it does not rule out that something in the silicon die is baking at 170°C.
 

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
I tested how LGA1700 CPUs change shape in the socket on another forum:



I also wrote:

ONE VERY SERIOUS WARNING FOR LGA1700 CPU OWNERS:

Choose one mounting mechanism and then use ONLY THIS SOLUTION. The IHS is not the only thing that is deforming. If you will bend the CPU repeteadly in different directions, SOMETHING may break, if not the whole chip, then something in it, or you may get a tear in TIM between chip and IHS.

Honestly when I first bent the 14500 one way and then the other way, I expected that the CPU will not work, but it does. ...

I just checked the shape of 14900K out of the socket and it is still the same bulge, I will never force if to change its current shape by different mounting mechanism.

I have never used my current stable 14900K with any other mounting mechanism than the contact frame. However I was quite disappointed fo find out how much bulged out my CPU became. I am afraid the the silicon die itself is somewhat mechanicaly stressed even with the contact frame.

With the stock ILM the CPUs including the silicon dies are bent significantly. I cannot imagine how this could help anything, however I still believe that the main problem are too high frequencies requiring too high voltage causing too high current density and temperature wearing the CPUs out too quickly.

I wrote on another forum:

I have never said that Intel chips are failing in mass numbers, but after playing with a few of 13th gen and 14th gen Intel chips I got a good idea of what these chips can comfortably handle, and what settings are pushing these chips hard, potentially damaging them in longer or even shorter periods of time.

I do not think that Intel is enforcing specs well enough, and it does not matter much, if the specs themselves are pushed to the edge.

You can see in the above screenshot from the Korean site, that even a small numbers of failing chips can cause problems to customers and vendors, because even customers with at the moment fully functional chips want to get rid of them, when they lost confidence that they can run reliably.

BTW even when we talk about a special variant of the CPU - fully unlocked and user configurable K model - such special CPU should be in no way less reliable, while running in specs, than the normal CPUs.

It is really painful to see, what Intel does to their own products. For example 14900K limited to 180W power draw runs under an air cooler with the worst Cinebench load really cool - after few minutes the absolute maximum temperature over all cores was 73°C (average temperature much lower, P cores running at 4900 MHz), I just tested it quickly this morning before I left to work.

When you also run limited frequency (lower than specs), you never see even that temperature, when I loaded the P cores with heavy load and let them run at 5200 Mhz, their max temp was lower than what I mentioned.

I would like to know the comparison of actual electric currents running in the silicon in the two variants of 14900K - my 14900K limited to 5200/4200 MHz and 180W and a stock 14900K placed in the motherboard that does not even enforce the Intel specs power limits. The strain each silicon chip endures in these two scenarios should be somehow quantifiable.

It should be also noted that temperature sensors are not everywhere in the silicon and the differences in REAL MAXIMAL temperatures reached in the chips could be dramatically between the above mentioned settings.

I just learned, that according to Blacks equation for electromigration doubling current density can mean reducing mean time to failure up to a fourth, with higher temperature making it even worse.

EDIT:
I tried to calculate Blacks equation for the activation energy 0,9 eV, current density exponent 1,2, doubling current density and increasing temperature from 60°C (333K) to 100°C (373K), and I got 66 times shorter time before failure. Did I make a mistake in the calculation? It seems wrong.

I just found something supporting my (probably not very accurate) result, that electromigration is strongly dependent on temperature.

Increasing temperature by 57/60°C caused tenfold decrease of the time to reach the same amount of failures.
 
Last edited:

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
Perhaps Intel 10nm process is not that great even after they "fixed it".

It is surely usable under some conditions, but pushing CPUs made on this process too hard does not seem to be a good idea.
It seems that there could be a real problem with oxidation, which causes some interconnects to have high resistance and fail in higher stress conditions.

 

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
If this is the root cause for the failures and Intel knows about this problem and they know that for example 10% CPUs are going to fail, it is IMO totally unacceptable to pass such high failure rate on consumers and PC makers to deal with.

This manufacturing issue must impact their server chips too. I wonder if they mitigate it by extensive testing and burn-in of their server CPUs.
 
Jul 27, 2020
19,613
13,479
146
This manufacturing issue must impact their server chips too. I wonder if they mitigate it by extensive testing and burn-in of their server CPUs.
Doubt that. Server chips have a lot of RAS features that, depending on the nature of the error, may even be able to correct errors on the fly.


Going forward, they might develop ways to figure out the bad RPL CPUs and then sell them with disabled cores or reduced speeds. Even if they do RMAs, they can always send those defective chips to China and other third world countries in trays and offer no warranty for a wholesale price. I doubt they would throw away i7/i9 chips if they are able to work fine in office PCs with DDR4-3200 or DDR5-3600.
 

cebri1

Senior member
Jun 13, 2019
264
261
136

Kocicak

Golden Member
Jan 17, 2019
1,067
1,124
136
It seems that there could be a real problem with oxidation, which causes some interconnects to have high resistance and fail in higher stress conditions.
Perhaps the problem is that these interconnects do not directly fail, but just cause some hotspots which will cause premature degradation of the lower layers of the chip.

Server chips have a lot of RAS features that, depending on the nature of the error, may even be able to correct errors on the fly.
They mention some thermal monitoring, they could disable some cores which will start heating up too much?

I realised now that server CPUs usually run at pretty low voltages and frequencies.

Going forward, they might develop ways to figure out the bad RPL CPUs
How would they do it? The story goes that the CPUs will start failing after a few months in service.
 

maddie

Diamond Member
Jul 18, 2010
4,878
4,951
136
Doubt that. Server chips have a lot of RAS features that, depending on the nature of the error, may even be able to correct errors on the fly.


Going forward, they might develop ways to figure out the bad RPL CPUs and then sell them with disabled cores or reduced speeds. Even if they do RMAs, they can always send those defective chips to China and other third world countries in trays and offer no warranty for a wholesale price. I doubt they would throw away i7/i9 chips if they are able to work fine in office PCs with DDR4-3200 or DDR5-3600.
Most errors are probabilistic. It's not either error proof or error prone. Even error correction fails, just at a much lower probability.
 

9949asd

Member
Jul 12, 2024
73
37
51
Correction: some chips that don't spend a great deal of time hitting TJmax. That custom loop saved his CPU.
I think it’s the voltage, nearly all z690 790 will auto OC and the stock voltage will be above 1.45v. also, there is not all 13 14 cpu will fail. At least all the intel users that I know in real life no one failed.
 
Reactions: Hulk

DAPUNISHER

Super Moderator CPU Forum Mod and Elite Member
Super Moderator
Aug 22, 2001
29,484
24,222
146
This is proven untrue by the terrible failure rates on the W series boards.
Survivorship bias is powerful mojo for those experiencing it.

On top of the W series, we have word a major S.I. is failing something like 12% of Raptor Lake during the initial stress testing. That's a lot of faulty CPUs out of the gate.
 

DrMrLordX

Lifer
Apr 27, 2000
22,000
11,563
136
There is nothing proved yet.
On the contrary, Wendel's video alone had good data from people who operate thousands of these CPUs in a professional environment. More data points are coming in every day. The latest GN video has a source from a top OEM indicating immediate failure rates of 10-25% of 13th gen Raptor Lake CPUs. Keep in mind those are CPUs that probably never shipped to customers and had the failures out-of-the-gate. That doesn't include ones that can go bad after 6-12 months of use.

What is proven is that there are abnormally high failure rates for Raptor Lake CPUs.
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |