I initially wrote: The clock speed in and of itself is almost irrelevant if you keep the voltage the same (and can cool it enough to keep the temperature the same). While there's technically a roughly linear increase in wear-out for some of the wear-out mechanisms (since at 2x clock speed, all the transistors and wires are doing their thing twice as often in a given period of time), I think the margins are such that it shouldn't really matter. Note that the "temperature" would be the real temperature at the devices and metal layers, NOT just the temperature at the top of the die. ... but there are some important caveats and I'm having a hard time coming up with a way to explain them well.Originally posted by: Rubycon
Originally posted by: Foxery
Originally posted by: Rubycon
Next step - is it possible to write a program to detect damage or change upon the onset of electromigration?
For example - a piece of software written that can not only stress the CPU but increase its FSB or multi (unlocked chips) and VCORE at the same time while checking for errors.
This is more or less what Prime95/Orthos does now; it runs a long series of calculations with known answers. The program stops when it gets a wrong answer, and that's your indication that you've pushed too far.
It would be interesting to do this on a fresh cpu then run it highly overvolted and overclocked under high stress (Linpack) for 180 days or more, then profile it again.
Aye. From Ctho's descrptions, it sounds like voltage is the killer, not the clock. Makes sense to me. I would also take this to imply that overclocking while remaining at a CPU's stock voltage is very "safe," i.e. should cause very little degredation.
Something I've found about Prime95 is it does not stress the CPU nearly as much as Linpack. This could explain why a one week prime95 stable system can still crash or produce errors.
I know this is probably confidential information but what does the manufacturer use to determine max cpu speed?
Since manufacturers know what the actual circuits look like, they can write test patterns that exercise the absolute worst paths on a chip. Their test patterns also obtain a very high "coverage", exercising all (or very nearly all) of the logic on a chip. These paths would be very hard to hit with dumb luck, and even if you know a bit about chip design, you're still unlikely to be able to do it. Obtaining good coverage of the logic is impossible without knowing the exact circuit.
Manufacturers also have "back-door" methods of putting the chip into states that may be very hard to get into using normal software (e.g. getting specific values in specific places, all at the same time). While a test program would have to jump through hoops to put the processor into a worst-case configuration, the manufacturer has the ability to easily set every state element ("flip flop") into a particular state, then tick the clock at a given speed, and read the values held in every state element and make sure they all worked at the target speed.
For a simple 32-bit ripple carry adder, of the 2^64 possible input combinations, there are something like 2^32 patterns that hit the worst path (1 in 4 billion) - and that's assuming the previous values that were added during the las t cycle have certain properties. It's pretty hard to hit a 1-in-4 billion shot by dumb luck. With hysteresis (an effect that's significant for SOI processes), hitting the absolute worst path becomes even harder.
Consider a structure like this on a chip, where entries fill up from the 0th to the 15th, and any one of the entries can be read out from the right-hand side. Let's pretend this is a queue holding outstanding instructions to issue. You want to read out the oldest one at all times, and normally, instructions enter the 0th slot and leave right away (and all of the other entries shift right to fill the hole). Sometimes, though, an instruction will be blocked (let's say it needs a value that wasn't in the CPU cache), and it gets stuck. Other instructions queue up behind it, and we'll only allow them to execute if they don't need data that isn't ready yet.
In the real world, highly-optimized code like Prime95 will be designed specifically to not fill up this queue, because the events that make this queue fill up force a lot of resources on the chip to sit idle (wasted performance). It might never read from the 15th entry. Other software might routinely fill up this buffer and read from the 15th entry. Note that the wire from the 15th entry is going to be longer, making its access take longer. However, software that behaves like this is going to keep the CPU pretty cool (the CPU is most likely going to be "stalling" a lot), so even though you're hitting the longest path, the transistors are cold so everything is fast. You might not see anything crash until you're running something that's normally highly-optimized and heats everything up, but sometimes changes its behavior in another part of the program and fills up this buffer (or maybe you're running Prime95 and an unoptimized program at the same time, and Windows switches between them (i.e. multitasking)).
Both of these examples are contrived and simplified.
There are also parts of the chip that real-world software almost never uses.
Going back to mathematical units, there are special types of floating point numbers that are very rare, called "denormals" that behave very slightly different from regular floating point numbers. There are also values used to represent a "Not a Number" which are handled specially. Normal programs try to avoid using both of these features. If a critical path was in the logic that handled these cases, Prime95 would never hit it.
Another example of logic that's almost never used is logic that handles "segments" (a feature of x86 that makes the baby Jesus cry). It's not used by Windows or Linux (or any modern OS, except maybe OpenBSD, which IIRC only uses limited parts of it and only on certain CPUs). That said, having a critical path in logic that's rarely used seems like it'd be pretty dumb, because making that path take an extra cycle would obviously not affect chip performance significantly.
Something I've found about Prime95 is it does not stress the CPU nearly as much as Linpack.
It's hard to come up with a concise definition of "stress" that's accurate. Are you stressing the SSE execution units? The integer units? The front-end (branch prediction + instruction decoder)? For all you know, the critical path could be anywhere, and even though optimized LINPACK code generates a lot of heat, it probably only really pounds on the SSE units. IIRC, the "burnk7" program which maximized Athlon temperatures left most of the integer units almost completely idle.