Electromigration

Rubycon

Madame President
Aug 10, 2005
17,768
485
126
When a processor is pushed too far and it starts to produce errors in programs such as Prime95, would this be considered a precursor to electromigration? As in a slight decay and if it were allowed to continue the margins would decrease? Meaning the processor would no longer work reliably at the same speed, etc.?

 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
If you're only going to read one paragraph, read the last one. If you're only going to read a few, read the last 3.

Electromigration is an effect dominated by temperature and the current density of a current flowing in a wire. It's caused by electrons bumping into metal atoms and moving them around. Higher temperature makes it easier to move the metal atoms. Higher current density affects electromigration because it means there are more electrons flowing through the wires.

If you increase the voltage, you'll be increasing the current flowing through the wires. Since there's more current, and there's not more metal in the chip, the current density goes up, which worsens electromigration.

Now, a little background: a given logic gate or transistor in a circuit generally only flips once per clock cycle. As it flips, a current flows, but once it's done flipping, there's very very little current (the transistors leak - they don't shut off completely - but for the purposes of this explanation, leakage is negligible). Each flip involves moving a little bit of charge, which involves a small current flowing for a brief time (say, 1 milliamp for 100 picoseconds).

A wire in a chip running at 1 GHz might see current flowing through it for, on average, 1 second out of every 10 seconds (100 picoseconds every 1000 picoseconds, over 10 seconds). If you overclocked that chip to 2GHz, now the wire will see current flowing for it for 2 seconds out of every 10 seconds (100 picoseconds every 500 picoseconds, over 10 seconds). This will roughly double the rate of electromigration (assuming the temperature doesn't go up).

In practice, most of the wires on the chip flip one way, then the other, then back, and so on--so if an atom gets bumped in one direction, another atom might get bumped back in the other direction when the signal flips the other way. This largely cancels out the electromigration, because the average current cancels out. Some of the metal wires, however, conduct current predominantly in one direction (for example, wires supplying power to a circuit, or short stubs of wires within individual gates) (a picture might help here, but I can't find a decent drawing program for Linux ). A wire supplying a ground connection (low voltage) to a transistor will only see current flowing when the signal switches to 0, and the current always flows the same way. These wires are the most vulnerable to electromigration, and they'll experience a linear increase with clock frequency (as explained above).

When a processor is pushed too far and it starts to produce errors in programs such as Prime95, would this be considered a precursor to electromigration? As in a slight decay and if it were allowed to continue the margins would decrease?

Processors produce errors when OCing for a few reasons, but by and large, the errors you'll see are caused by logic paths not evaluating completely in the time allotted. No damage occurs from a path not evaluating completely* - if you want to imagine what's occurring, just consider a line of people passing numbers along. It takes a certain amount of time from when the first person hands off a number to when the last person receives it. If you're trying to synchronize a lot of things, you might require that they pass numbers down the line within 60 seconds (let's say they actually take 45 seconds to do it). Every 60 seconds, the first person passes on a number, and every 60 seconds, the last person shouts out the number he's currently holding. You could overclock the system by increasing how often the first person launches a new number and how quickly the last person shouts the number he's holding. At some point (say, 40 second intervals), the number won't make it all the way down the line before the last guy shouts his number...and he'll shout out an old number rather than the most recent one. The people don't try to pass numbers faster when you overclock them - they're passing just as fast whether you give them 1 second or an hour. The only difference is that you'll get a wrong answer if you try to go too fast. Logic gates work the same way. In a given circuit, most of the gates don't "see" the clock frequency - they just switch at their own speed. If you give the circuit enough time to finish, you get the right answer. If you run it too fast, you get the wrong answer. (I'm sure a picture would have been better than this contrived example... hopefully it conveys the point ). The point is, there isn't a sudden change in damage as you go from the speed where things are working great to the speed where they're not working (it's not like an engine, where as you increase the RPM you can hit a point where you're doing large amounts of damage pretty suddenly).

* Well, I can think of contrived situations where you could increase wear & tear, but let's ignore them.

Meaning the processor would no longer work reliably at the same speed, etc.?

I don't think electromigration is likely to cause that. Electromigration kills in two ways:
1) It can cause a short circuit. When an atom gets knocked out of place, it has to end up somewhere. For various reasons, in electromigration-prone spots, a lot of atoms will end up at the same spot. They eventually bloat the wire at that spot to the point where it creates a short circuit (it touches another wire). Until the moment of failure, there won't really be any warning. This image shows atoms piling up in specific locations, and you can see that the circuit is getting pretty close to having a short.
2) It can cause a wire to break. If many atoms get knocked out of the same area, eventually there won't be any left, and the wire won't conduct any more. This picture is a fantastic example. Theoretically, this process could slow a chip down before it fails...but there's a catch. Remember, the rate of electromigration depends on the current density in a wire. As atoms move away from a spot, the wire there gets narrower. About the same amount of current flows, though, so the current density goes up. This speeds up electromigration, and the wire thins more...which raises the current density...which thins the wire even faster... and the wire will break pretty quickly from that point. So, I'm going to claim, "Electromigration is not going to cause a chip to slow down" because I would expect a chip in that state to end up dying completely soon after the slowdown would start. I don't feel like working out the math to tell for sure, but some members here play with enough silicon that the might know off the top of their heads.

There are, however, other effects that would cause a stressed chip to slow down. A particularly nasty one right now is called "NBTI" for Negative Bias Temperature Instability. It's an effect whose exact physical mechanism isn't entirely understood, but the results (and the kinds of things that cause it to happen) are understood pretty well: as transistors age, they get harder to turn on, and when they're on, the don't conduct current as well. NBTI is strongly affected by both voltage and temperature, and causes a pretty significant slowdown nowadays. In bad operating conditions, you can cause a chip to slow down significantly in a matter of hours. I have numbers, but unfortunately can't share them. Manufacturers have to add margin nowadays to account for this - they have to make sure that the 3 GHz chip you buy today will still work 5 years from now, so the chip they sell at 3 GHz would have passed their tests at a higher speed and would overclock well when new. I would not be surprised if a C2D operated in normal conditions for 5 years overclocks poorly on the 5th year, or if an aggressively-overclocked C2D had to be run slower after a few years.

I don't know if electromigration is a big killer nowadays. Modern transistors are operated pretty close to what I would call their "breaking points", whereas it's relatively easy to design with electromigration in mind and minimize it. I would expect other effects to be more dominant (TDDB, for example). Maybe someone who does reliability analysis can share with us what effects are mainly responsible for chip death in the short term.
 

gururu2

Senior member
Oct 14, 2007
686
1
81
this is something they warn you about when you overclock over long periods. its physical damage to the cpu so the accuracy of processes would deteriorate as well. i have heard that one can simply up the voltage when errors start surfacing, but that this only delays a severe system malfunction temporarily and doesnt result in more accurate readings by prime95 or other brutal stability tests.
 

wwswimming

Banned
Jan 21, 2006
3,695
1
0
electromigration can be quite spectacular at the board level.

one of my EE co-workers had a presentation to a senior VP,
where the EE was supposed to show the VP our wonderful
radio that they spent $billions developing and sold for
related prices. (military radio).

lights, camera, action, POOF.

they spent $millions doing failure analysis. found some
flux residue between traces carrying power and ground.
the tin lead molecules built a little bridge in the combination
of electrolyte and electric field. they must have finished
building the bridge about the time the VP showed up.
 

BrownTown

Diamond Member
Dec 1, 2005
5,314
1
0
When overclocking the errors produced are because the clock speed is running faster than the logic paths can evaluate and the hold times on the registers are not met and therefore they do not always latch the correct values. That is the "short term" reason why you can only overclock to a certain point before you start getting errors. However there are obviously long term effects which CTho9305 talked about caused by the increased heat and voltages. However the top level overclockers change out their CPUs every 3-6 months and so they are not as effected by greatly reduced lifespans. Usually the more someone is into overclocking the less time they keep the same CPU. So my parents have had the same reliable rev. B0 Northwood running for at least 6 years, but most people on this board probably have changed out their CPU in at least the last 2 years.
 

Special K

Diamond Member
Jun 18, 2000
7,098
0
76
Originally posted by: BrownTown
When overclocking the errors produced are because the clock speed is running faster than the logic paths can evaluate and the hold times on the registers are not met and therefore they do not always latch the correct values. That is the "short term" reason why you can only overclock to a certain point before you start getting errors. However there are obviously long term effects which CTho9305 talked about caused by the increased heat and voltages. However the top level overclockers change out their CPUs every 3-6 months and so they are not as effected by greatly reduced lifespans. Usually the more someone is into overclocking the less time they keep the same CPU. So my parents have had the same reliable rev. B0 Northwood running for at least 6 years, but most people on this board probably have changed out their CPU in at least the last 2 years.

Actually if the clock is running so fast that a logic path is not given enough time to evaluate, then it would be missing the setup time of the register, not the hold time. Hold time violations occur independent of the clock frequency (i.e., you cannot fix a hold time violation by running the clock at a lower frequency).

 

BrownTown

Diamond Member
Dec 1, 2005
5,314
1
0
Originally posted by: Special K

Actually if the clock is running so fast that a logic path is not given enough time to evaluate, then it would be missing the setup time of the register, not the hold time. Hold time violations occur independent of the clock frequency (i.e., you cannot fix a hold time violation by running the clock at a lower frequency).

MY bad, Special K is right, setup time is the time to evaluate the logic, hold time is apparently the time after the clock edge that it needs to remain at the same state in order for the register to fall into the correct state.

On another note, there are other failure modes besides electro migration for semiconductor devices that will be speed up by over clocking. Hot carrier injection is the process of carriers gaining enough energy to embed in the gate oxide. The trapped charge then changes the threshold voltage of the transistor. Since there is a certain amount of energy required for this to happen it is a function of the electric field in the transistor. The electric field of course is a function of voltage, so increasing the voltage in order to overclock further will also exacerbate this problem.
 

aigomorla

CPU, Cases&Cooling Mod PC Gaming Mod Elite Member
Super Moderator
Sep 28, 2005
21,019
3,490
126
OMG ... CTho9305

Your post is seriously excellent. We could use one in cpu and overclocking, because this is what people there believe to be the main reason why DDR2 1066 ramsticks, die all of a sudden.

Also why motherboards cant hold overclocks for long periods of time and suddenly lose there overclock, or fail all together.

 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: aigomorla
OMG ... CTho9305

Your post is seriously excellent. We could use one in cpu and overclocking, because this is what people there believe to be the main reason why DDR2 1066 ramsticks, die all of a sudden.

Also why motherboards cant hold overclocks for long periods of time and suddenly lose there overclock, or fail all together.

Thanks. I know almost nothing about DRAM - the manufacturing process they use is different, so it wouldn't surprise me if the failure mechanisms are different. I also don't know anything about board-level stuff - but the actual northbridge/southbridge itself is probably going to experience the same kinds of failure mechanisms as the CPU (the manufacturing process is similar/identical).
 

CanOWorms

Lifer
Jul 3, 2001
12,404
2
0
Electromigration is still a factor for high-reliability oriented customers. In my company, we're asked to do an EM analysis time-to-time, but we've never actually had an EM failure from a customer to look at.

There's a presentation from a German AMD group that actually shows the electromigration process happen through an animation of SEM images. It's pretty interesting. They present it at some conferences.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: CanOWorms
There's a presentation from a German AMD group that actually shows the electromigration process happen through an animation of SEM images. It's pretty interesting. They present it at some conferences.

Is this (from here, higher resolution if it works) the series of images you're talking about? Note that this experiment used a huge current density (>100x normal conditions if I did my math right), for 15 hrs at 230C before the first void became visible.
 

BrownTown

Diamond Member
Dec 1, 2005
5,314
1
0
Damn, at 230C won't there also be a non-trivial number of thermally generated carriers as well as the doped carriers?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: BrownTown
Damn, at 230C won't there also be a non-trivial number of thermally generated carriers as well as the doped carriers?

It wasn't clear to me if there were any devices involved or if they just fabbed wires and hooked them up to an external power source. I didn't read the paper.
 

soydios

Platinum Member
Mar 12, 2006
2,708
0
0
CTho9305, very informative post.

question: why does decreasing the temperature of the chip increase it's overclocking headroom, if the reason you seemed to give was that the signal didn't have time to propagate across the chip (sorta like an engine at higher-than-spec RPM running "out of tune")
 

Special K

Diamond Member
Jun 18, 2000
7,098
0
76
Originally posted by: soydios
CTho9305, very informative post.

question: why does decreasing the temperature of the chip increase it's overclocking headroom, if the reason you seemed to give was that the signal didn't have time to propagate across the chip (sorta like an engine at higher-than-spec RPM running "out of tune")

Resistance increases with temperature. If your chip is getting very hot, the resistance of the logic paths will increase, which in turn will make it take longer to evaluate.
 

Onund

Senior member
Jul 19, 2007
287
0
0
Originally posted by: Special K
Originally posted by: soydios
CTho9305, very informative post.

question: why does decreasing the temperature of the chip increase it's overclocking headroom, if the reason you seemed to give was that the signal didn't have time to propagate across the chip (sorta like an engine at higher-than-spec RPM running "out of tune")

Resistance increases with temperature. If your chip is getting very hot, the resistance of the logic paths will increase, which in turn will make it take longer to evaluate.

I believe the mobility of the charge carriers in the transistors is the dominant factor. As temps increase, the mobility of the charge carriers decrease which in turn decreases drain current. Lower currents take longer to charge parasitic capacitances which means longer delays in the logic paths.
 

Rubycon

Madame President
Aug 10, 2005
17,768
485
126
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?

For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors. It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.
 

fuzzybabybunny

Moderator<br>Digital & Video Cameras
Moderator
Jan 2, 2006
10,455
35
91
Just to get some clarification:

Logic paths function independently of clock speed and they are the determinant of how fast you can clock a processor without errors, correct? Say you have a chip at 2GHz, the logic paths will function error-free at a maximum of 2.5GHz, but you can physically clock the CPU to 3GHz. Even though you can physically clock to 3GHz, you can't run it at this speed because the logic paths are only error free before 2.5GHz, so 2.5GHz is effectively the highest overclock you can get from this chip?

Do logic paths function at different speeds depending on the individual chip? You always hear about people who get a XXX CPU that is a great overclocker, but someone else is unlucky and gets the same XXX CPU that doesn't overclock as well. Does this mean that the former got a CPU that just happened to have faster functioning logic paths than the other one? Logic path speed is luck of the draw?

Ex. I've got a Q6600. It overclocks to 3.6GHz fine and dandy at 1.5V and all other voltages set to default and temperatures being nice and controlled with my watercooling, but even if I push it a little further to 3.66GHz, no matter how much extra voltage I give vcore, MCH, and FSB, it always errors immediately in Prime95. Does this mean 3.6GHz is the maximum that my logic gates can function error-free at?

At the same time, I've heard of someone being able to overclock his Q6600 to 4.0GHz, a huge increase over mine, and still be error-free. So his logic gates just function better?

**********

I definitely agree on the electromigration slowing down CPUs. I had a Pentium D 805 @ 2.66GHz, and back in the day I could overclock that thing to 4.0GHz stable. I ran it like this for a few months, then something happened and I ran it for about a year at stock speed. Then I went back and tried to overclock it to 4.0GHz again, but it just wouldn't do it, no matter how much voltage I gave it and despite the fact that my hardware had remained completely unchanged since the first time I overclocked it. It would only do 3.66GHz max.

All this is well and good, but it's not quantitative. Just what is the rate of decay? Is it even worth it to overclock to the max if it means you can quickly burn out the CPU? Someone mentioned in the Q6600 @ 4GHz thread that this overclock futureproofs this setup for years to come. But does it really? The thing might only be able to do 3.6GHz, what, a few years, a year, a few months, from now?
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: fuzzybabybunny
Just to get some clarification:

Logic paths function independently of clock speed and they are the determinant of how fast you can clock a processor without errors, correct? Say you have a chip at 2GHz, the logic paths will function error-free at a maximum of 2.5GHz, but you can physically clock the CPU to 3GHz. Even though you can physically clock to 3GHz, you can't run it at this speed because the logic paths are only error free before 2.5GHz, so 2.5GHz is effectively the highest overclock you can get from this chip?

Correct. A given path on a particular chip may take, say, 350ps to evaluate with worst-case inputs. Clock skew and jitter may mean that the receiving flip flop has its clock tick, say, 50ps earlier than the launching flip flop, so effectively the cycle time needs to be longer than 400ps (2.5GHz or slower) for reliable operation. Note that clock skew/jitter, like other attributes that affect performance, vary chip-to-chip.

Do logic paths function at different speeds depending on the individual chip? You always hear about people who get a XXX CPU that is a great overclocker, but someone else is unlucky and gets the same XXX CPU that doesn't overclock as well. Does this mean that the former got a CPU that just happened to have faster functioning logic paths than the other one? Logic path speed is luck of the draw?

Yes. There's a huge amount of variation between chips. As we move to smaller and smaller transistors, the number of dopant atoms is plummeting. When you start getting into the realm of a few hundred dopant atoms in a transistor, plus or minus a few dozen becomes significant. Note that the gates themselves could be the same, but the metal wires could differ... variation isn't limited to the transistors.

Ex. I've got a Q6600. It overclocks to 3.6GHz fine and dandy at 1.5V and all other voltages set to default and temperatures being nice and controlled with my watercooling, but even if I push it a little further to 3.66GHz, no matter how much extra voltage I give vcore, MCH, and FSB, it always errors immediately in Prime95. Does this mean 3.6GHz is the maximum that my logic gates can function error-free at?

At the same time, I've heard of someone being able to overclock his Q6600 to 4.0GHz, a huge increase over mine, and still be error-free. So his logic gates just function better?

Probably. There could be other factors (e.g. cleaner power, inaccuracy in the thermal measurements), but assuming you were using otherwise-identical systems, his CPU's circuits are faster.

All this is well and good, but it's not quantitative. Just what is the rate of decay?
You need a pretty large sample size to get good numbers. The people who do these kinds of tests tend not to release the results. There are a huge number of overclockers, but the methodology of collecting anecdotes is almost useless from a scientific perspective.

Is it even worth it to overclock to the max if it means you can quickly burn out the CPU? Someone mentioned in the Q6600 @ 4GHz thread that this overclock futureproofs this setup for years to come. But does it really? The thing might only be able to do 3.6GHz, what, a few years, a year, a few months, from now?

That's a good question, and it'll be a while before the OC community really gets an answer (if it does at all). How many Core2's have been operating continuously under load, with large overclocks for 1.5 years? Intel probably knows what sort of decay to expect under the conditions people are using... but unless you go buy 1000 CPUs and do controlled tests, you won't get a reliable answer.
 

Foxery

Golden Member
Jan 24, 2008
1,709
0
0
Originally posted by: Rubycon
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?

For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors.

This is more or less what Prime95/Orthos does now; it runs a long series of calculations with known answers. The program stops when it gets a wrong answer, and that's your indication that you've pushed too far.

It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.

Aye. From Ctho's descrptions, it sounds like voltage is the killer, not the clock. Makes sense to me. I would also take this to imply that overclocking while remaining at a CPU's stock voltage is very "safe," i.e. should cause very little degredation.
 

Rubycon

Madame President
Aug 10, 2005
17,768
485
126
Originally posted by: Foxery
Originally posted by: Rubycon
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?

For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors.

This is more or less what Prime95/Orthos does now; it runs a long series of calculations with known answers. The program stops when it gets a wrong answer, and that's your indication that you've pushed too far.

It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.

Aye. From Ctho's descrptions, it sounds like voltage is the killer, not the clock. Makes sense to me. I would also take this to imply that overclocking while remaining at a CPU's stock voltage is very "safe," i.e. should cause very little degredation.

Something I've found about Prime95 is it does not stress the CPU nearly as much as Linpack. This could explain why a one week prime95 stable system can still crash or produce errors.

I know this is probably confidential information but what does the manufacturer use to determine max cpu speed?
 
sale-70-410-exam    | Exam-200-125-pdf    | we-sale-70-410-exam    | hot-sale-70-410-exam    | Latest-exam-700-603-Dumps    | Dumps-98-363-exams-date    | Certs-200-125-date    | Dumps-300-075-exams-date    | hot-sale-book-C8010-726-book    | Hot-Sale-200-310-Exam    | Exam-Description-200-310-dumps?    | hot-sale-book-200-125-book    | Latest-Updated-300-209-Exam    | Dumps-210-260-exams-date    | Download-200-125-Exam-PDF    | Exam-Description-300-101-dumps    | Certs-300-101-date    | Hot-Sale-300-075-Exam    | Latest-exam-200-125-Dumps    | Exam-Description-200-125-dumps    | Latest-Updated-300-075-Exam    | hot-sale-book-210-260-book    | Dumps-200-901-exams-date    | Certs-200-901-date    | Latest-exam-1Z0-062-Dumps    | Hot-Sale-1Z0-062-Exam    | Certs-CSSLP-date    | 100%-Pass-70-383-Exams    | Latest-JN0-360-real-exam-questions    | 100%-Pass-4A0-100-Real-Exam-Questions    | Dumps-300-135-exams-date    | Passed-200-105-Tech-Exams    | Latest-Updated-200-310-Exam    | Download-300-070-Exam-PDF    | Hot-Sale-JN0-360-Exam    | 100%-Pass-JN0-360-Exams    | 100%-Pass-JN0-360-Real-Exam-Questions    | Dumps-JN0-360-exams-date    | Exam-Description-1Z0-876-dumps    | Latest-exam-1Z0-876-Dumps    | Dumps-HPE0-Y53-exams-date    | 2017-Latest-HPE0-Y53-Exam    | 100%-Pass-HPE0-Y53-Real-Exam-Questions    | Pass-4A0-100-Exam    | Latest-4A0-100-Questions    | Dumps-98-365-exams-date    | 2017-Latest-98-365-Exam    | 100%-Pass-VCS-254-Exams    | 2017-Latest-VCS-273-Exam    | Dumps-200-355-exams-date    | 2017-Latest-300-320-Exam    | Pass-300-101-Exam    | 100%-Pass-300-115-Exams    |
http://www.portvapes.co.uk/    | http://www.portvapes.co.uk/    |