Electromigration

CTho9305 · Feb 21, 2008

Originally posted by: Rubycon

Originally posted by: Foxery

Originally posted by: Rubycon
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?

For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors.

Click to expand...

This is more or less what Prime95/Orthos does now; it runs a long series of calculations with known answers. The program stops when it gets a wrong answer, and that's your indication that you've pushed too far.

It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.

Click to expand...

Aye. From Ctho's descrptions, it sounds like voltage is the killer, not the clock. Makes sense to me. I would also take this to imply that overclocking while remaining at a CPU's stock voltage is very "safe," i.e. should cause very little degredation.

Click to expand...

I initially wrote: The clock speed in and of itself is almost irrelevant if you keep the voltage the same (and can cool it enough to keep the temperature the same). While there's technically a roughly linear increase in wear-out for some of the wear-out mechanisms (since at 2x clock speed, all the transistors and wires are doing their thing twice as often in a given period of time), I think the margins are such that it shouldn't really matter. Note that the "temperature" would be the real temperature at the devices and metal layers, NOT just the temperature at the top of the die. ... but there are some important caveats and I'm having a hard time coming up with a way to explain them well.

Something I've found about Prime95 is it does not stress the CPU nearly as much as Linpack. This could explain why a one week prime95 stable system can still crash or produce errors.

I know this is probably confidential information but what does the manufacturer use to determine max cpu speed?

Since manufacturers know what the actual circuits look like, they can write test patterns that exercise the absolute worst paths on a chip. Their test patterns also obtain a very high "coverage", exercising all (or very nearly all) of the logic on a chip. These paths would be very hard to hit with dumb luck, and even if you know a bit about chip design, you're still unlikely to be able to do it. Obtaining good coverage of the logic is impossible without knowing the exact circuit.

Manufacturers also have "back-door" methods of putting the chip into states that may be very hard to get into using normal software (e.g. getting specific values in specific places, all at the same time). While a test program would have to jump through hoops to put the processor into a worst-case configuration, the manufacturer has the ability to easily set every state element ("flip flop") into a particular state, then tick the clock at a given speed, and read the values held in every state element and make sure they all worked at the target speed.

For a simple 32-bit ripple carry adder, of the 2^64 possible input combinations, there are something like 2^32 patterns that hit the worst path (1 in 4 billion) - and that's assuming the previous values that were added during the las t cycle have certain properties. It's pretty hard to hit a 1-in-4 billion shot by dumb luck. With hysteresis (an effect that's significant for SOI processes), hitting the absolute worst path becomes even harder.

Consider a structure like this on a chip, where entries fill up from the 0th to the 15th, and any one of the entries can be read out from the right-hand side. Let's pretend this is a queue holding outstanding instructions to issue. You want to read out the oldest one at all times, and normally, instructions enter the 0th slot and leave right away (and all of the other entries shift right to fill the hole). Sometimes, though, an instruction will be blocked (let's say it needs a value that wasn't in the CPU cache), and it gets stuck. Other instructions queue up behind it, and we'll only allow them to execute if they don't need data that isn't ready yet.

In the real world, highly-optimized code like Prime95 will be designed specifically to not fill up this queue, because the events that make this queue fill up force a lot of resources on the chip to sit idle (wasted performance). It might never read from the 15th entry. Other software might routinely fill up this buffer and read from the 15th entry. Note that the wire from the 15th entry is going to be longer, making its access take longer. However, software that behaves like this is going to keep the CPU pretty cool (the CPU is most likely going to be "stalling" a lot), so even though you're hitting the longest path, the transistors are cold so everything is fast. You might not see anything crash until you're running something that's normally highly-optimized and heats everything up, but sometimes changes its behavior in another part of the program and fills up this buffer (or maybe you're running Prime95 and an unoptimized program at the same time, and Windows switches between them (i.e. multitasking)).

Both of these examples are contrived and simplified.

There are also parts of the chip that real-world software almost never uses.

Going back to mathematical units, there are special types of floating point numbers that are very rare, called "denormals" that behave very slightly different from regular floating point numbers. There are also values used to represent a "Not a Number" which are handled specially. Normal programs try to avoid using both of these features. If a critical path was in the logic that handled these cases, Prime95 would never hit it.

Another example of logic that's almost never used is logic that handles "segments" (a feature of x86 that makes the baby Jesus cry). It's not used by Windows or Linux (or any modern OS, except maybe OpenBSD, which IIRC only uses limited parts of it and only on certain CPUs). That said, having a critical path in logic that's rarely used seems like it'd be pretty dumb, because making that path take an extra cycle would obviously not affect chip performance significantly.

Something I've found about Prime95 is it does not stress the CPU nearly as much as Linpack.

It's hard to come up with a concise definition of "stress" that's accurate. Are you stressing the SSE execution units? The integer units? The front-end (branch prediction + instruction decoder)? For all you know, the critical path could be anywhere, and even though optimized LINPACK code generates a lot of heat, it probably only really pounds on the SSE units. IIRC, the "burnk7" program which maximized Athlon temperatures left most of the integer units almost completely idle.

dmens · Feb 23, 2008

good point regarding "stress". The ability of a piece of code to generate high temperatures from the diode point of view does not necessarily mean critical units are stressed, or even used at all. This is especially true on designs more prone to localized heating (prescott for example). The same piece of code can have a drastically different effect on two separate CPU designs, even if they're both x86 machines.

so running one or ten programs to "stress" an overclock can only give a pretty low confidence in stability... manufacturers utilize thousands of patterns and run them at numerous corners.

Nathelion · Feb 23, 2008

Would the Intel TAT be able to test C2Ds with a rather high level of confidence though? I mean, as far as i understand it's just intel's own internal stress test in a fancy package...?

CTho9305 · Feb 23, 2008

Originally posted by: Nathelion
Would the Intel TAT be able to test C2Ds with a rather high level of confidence though? I mean, as far as i understand it's just intel's own internal stress test in a fancy package...?

That's not the impression I got. I assumed TAT was a tool Intel provided to OEMs (e.g. Sony, Dell, HP, etc) to maximize heat production so they could ensure their cooling solutions were adequate.

A video I saw a while back about AMD Overdrive looked like it might be doing a larger suite of tests, but I haven't actually used the app myself and don't know what kinds of tests it used.

edit: TAT vs LINPACK temperatures show a nasty "gotcha" of Intel's power rating system. If some OEM did design their thermal solution to handle TAT's power level, and you bought a bunch of these systems for HPC work (e.g. scientific computing, supercomputing, etc) the CPUs could throttle constantly if your application looked more like LINKPACK (and LINPACK is used for HPC benchmarking precisely because it does things that are very HPC-like). Maybe TAT produces the Intel TDP power rather than the true max power.

Rubycon · Feb 23, 2008

Great post(s) CTho9305.

I find some of this information regarding errata disturbing as there are many, many systems running out of spec doing tasks (distributed computing) that produce results for research. I would never overclock anywhere in a system that's used where accuracy is paramount but it seems like a lot of enthusiast will for the sake of a scorecard. I certainly do not want to derail this excellent discussion with morality and ethics, it's just worrisome that a system may not be producing faithful results and the user not being aware of it.

Linpack certainly produces a very high peak temp/amp draw. I almost fell out of my chair when I saw it. :shocked:

CTho9305 · Feb 25, 2008

Originally posted by: Rubycon
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?

Sorry, I missed this post earlier. No. By the time there's any (EM-related) observable change, fatal damage has already been done. Same for some of the effects that are more significant than EM nowadays.

There are other effects where you can observe the wear-out, but I don't really see a point - once you've seen a change, the chip has permanently slowed down. The chip is always slowing down as it ages; high voltages and temperatures accelerate the damage. Going back to lower voltages and temperatures won't repair anything though. The best you can do (assuming you want to use high voltages) is to periodically re-run your preferred stability tests and make sure you're still operating in conditions that the chip can handle, and decrease your OC if the system is no longer stable.

For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors. It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.

That's part of what manufacturers do to analyze product reliability - run chips through a suite of tests in different corners ("corner" = voltage + temperature condition, e.g. hot and low voltage, cold and low voltage, room temperature and high voltage, etc), operate them in controlled conditions for some amount of time, and repeat. Elevated voltage & temperature are used to model the effects of running in less severe conditions for longer times (e.g. run hot with high voltage for a few weeks to figure out what years of normal operation will do).

What's difficult about testing this yourself is nailing down your operating conditions to get reliable results - daily fluctuations in room temperature, dust in your heatsink, change in the supplied power, etc all add noise to the data. The change in maximum frequency may also be small, so you'd have to measure carefully (i.e. 100MHz steps might not be enough). It doesn't help that any chip you can get will already have gone through burn-in, which is where a significant amount of slow-down occurs. Of course, buying a bunch of systems to test would also be expensive.

Rubycon · Feb 25, 2008

Originally posted by: CTho9305
...

What's difficult about testing this yourself is nailing down your operating conditions to get reliable results - daily fluctuations in room temperature, dust in your heatsink, change in the supplied power, etc all add noise to the data. The change in maximum frequency may also be small, so you'd have to measure carefully (i.e. 100MHz steps might not be enough). It doesn't help that any chip you can get will already have gone through burn-in, which is where a significant amount of slow-down occurs. Of course, buying a bunch of systems to test would also be expensive.

Definitely an issue when one considers the variables with just heatsink contact pressure/area and TIM application methods, type, etc.

Modelworks · Feb 27, 2008

At Cray when we tested the Seastar series the first samples were tested in containers that were filled with Argon gas.
The reason being we didn't want anything like humidity, o2 versus co2 content to affect the test.
Then it was put through its paces till the point of failure.
The pins on the chip are even calculated as being a very tiny form of heat sink..

I can't say exactly what we did to solve the electromigration issues we had , I don't like lawsuits, but it involves the number 12.

Born2bwire · Feb 27, 2008

Originally posted by: Modelworks
At Cray when we tested the Seastar series the first samples were tested in containers that were filled with Argon gas.
The reason being we didn't want anything like humidity, o2 versus co2 content to affect the test.
Then it was put through its paces till the point of failure.
The pins on the chip are even calculated as being a very tiny form of heat sink..

I can't say exactly what we did to solve the electromigration issues we had , I don't like lawsuits, but it involves the number 12.

You sacrificed 12 monkeys? You heartless bastards!

Modelworks · Feb 27, 2008

Originally posted by: Born2bwire

Originally posted by: Modelworks
At Cray when we tested the Seastar series the first samples were tested in containers that were filled with Argon gas.
The reason being we didn't want anything like humidity, o2 versus co2 content to affect the test.
Then it was put through its paces till the point of failure.
The pins on the chip are even calculated as being a very tiny form of heat sink..

I can't say exactly what we did to solve the electromigration issues we had , I don't like lawsuits, but it involves the number 12.

Click to expand...

You sacrificed 12 monkeys? You heartless bastards!

LOL
Bruce helped !

hokiealumnus · Feb 27, 2008

This thread was immensely educational. Haven't read much in the highly technical forum; maybe it's time I started. Anyway, a question if you please, which may have already been answered. Is this quote...

Manufacturers also have "back-door" methods of putting the chip into states that may be very hard to get into using normal software (e.g. getting specific values in specific places, all at the same time). While a test program would have to jump through hoops to put the processor into a worst-case configuration, the manufacturer has the ability to easily set every state element ("flip flop") into a particular state, then tick the clock at a given speed, and read the values held in every state element and make sure they all worked at the target speed.

...from CTho a rough explaination of why Folding@Home is such a bear for stability? I ask because I can stress a given OC for hours and hours with Prime and it's perfectly stable. I then installed the F@H SMP client and the PC literally crashed within 10 minutes. Maybe Prime's given set of calculations with known answers, while raising the temperature of the CPU more than F@H's SMP client, is actually "easier" to process than F@H work units? Thoughts?

Please be gentle. A lot of the stuff that has been said in here is admittedly way over my head; but that's why I'm here...to learn!

CTho9305 · Feb 28, 2008

The short answer to:

. Is this quote......from CTho a rough explaination of why Folding@Home is such a bear for stability?

is "yes", except for:

is actually "easier" to process than F@H work units?

"Easier" and "harder" don't really mean much. They're just hitting different parts of the CPU, and the worst path in your CPU is apparently not used in prime95, but is used in F@H. The reason I don't want people to think of one as "harder" than the other is that for a different CPU, the worst path might be one that Prime95 exercises and F@H doesn't. Make sense?

hokiealumnus · Feb 28, 2008

Makes perfect sense. Thanks for the reply!

Eskimo · Mar 5, 2008

Originally posted by: CTho9305

Originally posted by: aigomorla
OMG ... CTho9305

Your post is seriously excellent. We could use one in cpu and overclocking, because this is what people there believe to be the main reason why DDR2 1066 ramsticks, die all of a sudden.

Also why motherboards cant hold overclocks for long periods of time and suddenly lose there overclock, or fail all together.

Click to expand...

Thanks. I know almost nothing about DRAM - the manufacturing process they use is different, so it wouldn't surprise me if the failure mechanisms are different. I also don't know anything about board-level stuff - but the actual northbridge/southbridge itself is probably going to experience the same kinds of failure mechanisms as the CPU (the manufacturing process is similar/identical).

Having worked in both Microprocessor and DRAM fabs the differences of the processes are not that great once you get to the back end metal layers. I cannot tell you why your DDR2 1066 RAM is dying but I can speak to electromigration damage (EMD). Electromigration is driven by device layout, materials, and operating conditions as primary conributors. Let's look at each of these factors and how logic and DRAM compare and who 'wins' for robustness against EMD.

Layout: Both logic and DRAM are agressively scaling down their design rules to fit more functionality per mm^2 as observed by 'Moore's Law'. Logic requires much more complicated routing of metal due to the complexity of the transistors in use for core functionality. DRAM by it's nature of a commodity part has not changed in complexity for the core operational portions of the chip (Cell Array, Sense Amps) and therefore back end complexity is fairly static. Spacing has gotten a little tighter with the move from 8F2 to 6F2 but the impact up in Metal hasn't been that significant. There is some logic present on a DRAM chip for I/O but it hasn't seen much change generation to generation. So from a pure layout standpoint DRAM might be considered the winner. Logic processes however, mostly out of necessity, have increased the number of metal layers to help relax spacing and accomodate the complexity of new faster processors. CPUs from AMD and Intel are using 7+ layers of metal while DRAM has been frozen at 2 or 3 layers of Metal since I was in grade school. I would have to say this is a: Draw

Materials: This one is much simpler to determine a winner. Logic has moved to using copper interconnects while most DRAM manufacturers are still using Al-Cu-Si alloy. There's been a lot of studies comparing the resistance to EMD by material choice but exact figures for comparison are difficult. I've read that copper interconnects have been shown to be anywhere from 5-100x more resistant than aluminum. This field has been researched for some times. It was back in the 1960s that IBM changed from using pure Aluminum to an Al-Cu-Si alloy (~4% Cu) saw a vast reduction in EMD failures on their products. Armchair integration engineers would say, DRAM should just switch to copper. But its not that simple given the cost model under which DRAM has to operate. At current prices DRAM is barely profitable. The only way to eek out small profits at given price levels is to leverage economy of scale and produce as many units as possible using existing capital investment. So changing to a completely new material with the inevitable yield challenges is daunting enough before being compounded by the requirement to invest in a new toolset in order to deposit/plate the copper onto the wafers and CMP it back. All of a sudden the DRAM makers would have a large number of "useless" Al CVD/PVD deposition and Al etch tools. Winner: Logic

Operating Conditions: This category is a little simpler. Modern CPU thermal envelopes are accelerating a lot faster than DRAM due to complexity of the logic being used. Thus the internal chip temperatures which Logic is subjected are more extreme than DRAM. Some of that is mitigated by the much more complex cooling mecanisms used for logic. Winner: DRAM

We'll get to revisit this topic in the next couple years as most of the DRAM makers are going to be forced to move to copper interconnects at the sub 60nm nodes. Of course I speak in generalities as my confidentially agreements don't allow me to talk specifically about any one company's technology.
*Not speaking for any company*

Foxery · Mar 5, 2008

I cannot tell you why your DDR2 1066 RAM is dying but I can speak to electromigration damage (EMD).

At a simple level, I see DDR2-1066 as an overclocked part. If I remember correctly, JEDEC standards top out at 800. The voltage is always higher than purely reference-design parts, and RAM doesn't have active cooling (fans) the way CPUs do, only "heat spreaders" (heatsinks) to mitigate the heat caused by this voltage. From Ctho's original description of how high voltage damages gates, it makes sense to me that enthusiast RAM should have a short lifetime.

Martimus · Apr 3, 2008

Originally posted by: CTho9305
Maybe someone who does reliability analysis can share with us what effects are mainly responsible for chip death in the short term.

It has been years since I did this work, but I have done reliability analyis on zener diodes for a previous job. Of the hundreds I tested, they all consistently changed their breakdown region as I put more current through them; and that change was permanent - the breakdown region would not go back to normal after they "cooled down" or any amount of time passed.

I could see this affecting transistors, as the threshold voltage would increase over time to the point where the input voltage would no longer saturate the FET. This would mean that additional voltage would be required for it to work. (Of course this additional voltage would make it degrade even faster, but alas it is the only choice as the change in the satuation region is permanent.)

I apologize if I got any of the second paragraph wrong, as it has been a few years since I have actually done any electrical engineering, and much of the transitor workings are fuzzy in my head. Now I write technical reports for non-technical people, and they are at a much higher level than any of the work I used to do. Even so, it is nice to dust off those skills on occasion, even if I make a fool of myself by fudging something up.

Electromigration

CTho9305

Elite Member

dmens

Platinum Member

Nathelion

Senior member

CTho9305

Elite Member

Rubycon

Madame President

CTho9305

Elite Member

Rubycon

Madame President

Modelworks

Lifer

Born2bwire

Diamond Member

Modelworks

Lifer

hokiealumnus

Senior member

CTho9305

Elite Member

hokiealumnus

Senior member

Eskimo

Member

Foxery

Golden Member

Martimus

Diamond Member

TRENDING THREADS