IDC, GaiaHunter must have been referring to this question:
Thanks for the information. So do you believe, in fact, that these teething problems can very well last for a month or more before being detected? Charlie was particularly adamant in his article about how these "problems" should only take about 1 week, tops.
As best I can gather from Charlie's innuendo article is that he's promulgating a rather embellished strawman argument.
Whether the strawman is of his own invention or if he is simply the mouthpiece of some disgruntled industry veterans I can't really determine nor do I care to bother investing the time to discern the difference.
There are a gamut of calibration issues that can and do occur (not sure if you know my past but I was a process development engineer at Texas Instruments, I have first-hand experience in the subject matter) but what I get out of the shadowy stories so far is that someone is implying that the story here is that the most innocuous and simplistic of calibration issues transpired or are claimed to have transpired and the "meat" of the story is that it took "too long" per someone else's expectations to resolve the calibration issue.
Do calibration issues like the one semi-described in Charlie's article occur? Yes they do. Do they take a week to resolve? Yes in some instances. Can they take a month to detect and resolve? Yes in some instances.
Does it mean something sinister is afoot if it takes a month to detect and resolve? No, not at all.
Does it mean Q/A and metrology issues abound if it takes a month to detect and resolve? Entirely depends on the specific process metric that has been thrown off by the miss-calibration. Only the most simplistic of process metrics are readily detected at a metrology step like film thickness measurement (called a "tool qual") or particle count measurement.
There are process metrics which require much more time-intensive analysis and characterization to quantify and determine if the process variation is a problem. Subtle shifts in a deposited film's density or refractive index can result in orders of magnitude shifts in the lifetime reliability of the film, or its leakage characteristics and etch-rates, etc.
Even further, process integration deals with the effects of sequences of process events, a shift in process A might not affect the outcome of process A in a meaningful way but it may have a knock-on effect down the line and cause process D to go awry...isolating the problem to that of process D takes time, determining that process D is going awry because of a subtle shift in process A takes even more time.
Resolving those kinds of cause-and-effect (called "root-cause determination" in industry-speak) can take months, can take days. When it takes months that doesn't mean something sinister is afoot, it speaks to the complexity of the underlying issue. 40nm is obviously a complex thing to master, count on one hand how many other foundries have figured out enough of the issues with 40nm such that they have 40nm healthy enough to put it into production today. (you should arrive at the answer of "1")
I guess it isn't so much a question as your opinion - you certainly didn't bash to the ground the possibility of a calibration causing delays in production/diminishing yields.
Absolutely, I wouldn't. That is the routine and mundane part, this happens all the time, during development as well as during production ramp. No two tools are the same, no two chambers on the same tool are the same.
It takes time to figure out what the critical metrics are for a new process and a new tool, you can't characterize everything under the sun because resources are limited. So you make judicious choices on what is most critical for tool-matching and chamber-matching during new tool installs and releases. When something like this occurs then you call that a "lesson-learned" and the breadth of the things on your checklist of things to make sure are calibrated identically chamber after chamber and tool after tool increases by one.
Everything that was already on that checklist came from prior lessons-learned. Node development is a cumulative process thing, knowledge on how to make node N-1 production-worthy are applied to making node N (a more complex node) equally production-worthy.
What is interesting is that everything I write here is self-evident and known to everyone in the industry, I'm not bestowing upon you some great super secret of the industrial world that makes the difference between great fabs and crappy ones. Everyone knows of this stuff. So when Charlie goes off saying he's got engineering sources telling him differently it just makes me sigh, he's either making up fictitious engineering sources or his sources are so far out of the loop they shouldn't be passing themselves off as authorities on the subject in the first place (something Charlie should be smart enough about that he vets them in the first place so he doesn't get led astray).
At any rate I don't doubt that Charlie believes something sinister is afoot, his opinion is expressed genuinely. And unless you've been involved first-hand in process development and tool releases for capacity ramps I'm sure his "stories" seem to have some merit. But from where I'm sitting it is all mountains out of molehills. Take a routine mundane aspect of life in the fab and work it up to be unique and rare and as such only explainable only by invoking cloak-and-dagger sinister neerdowells.
I'm sure it helps that sensationalism generates more hits than simply reporting the mundane.
Lets assume they end having 10x more available 58xx GPU in a 1-2 months. What kind of problems do you envision that could have reduced the yields of the 58xx series (I believe there is/was talk that the RV870 had decent yields and went down, correct me if I'm wrong) but allow them to increase 10x the production (or even double, triple whatever) in such a short span. And forgive me if I'm asking question you can't really answer/speculate for whatever reasons, even if it is only because it can be so many things you could throw a dart at them to decide which, and you wish not to do so.
Well I think first we need to make a distinction between yields and capacity. TSMC said the yields did not decrease, they simply failed to improve.
The chamber mismatch issue (assuming that is not fictitous as well) is a capacity issue, until you resolve the disconnect with the tool you simply "blacklist" it from the available tools for production at that point in the flow. This reduces capacity, or if it was a new tool release then it simply means capacity did not increase above what it had been prior to the tool being brought into the fab.
My take on the situation is that there are two things here, unrelated at the process level, causing supply issues. One is that yields did not improve...yield improvements are like "free" capacity increases. Yield going from 30% to 60% means you get 2x more chips, the same effect as yields staying at 30% but doubling capacity and wafer starts.
Obviously during the early stages of a new node's ramp both are happening in parallel. Yield limiting issues are being root-caused and fixes are implemented, in parallel new tools are being installed and capacity is increasing. If you hit a snag and yields don't improve it will cause supply issues, if capacity doesn't increase (tools aren't released to production because there seems to be some kind of issue with them) then that will cause a supply issue.
If both happen at the same time you don't just double your issues, it is a quadrupling (figuratively) of a problem.
So yes, absolutely a sizable discrepancy between planned supply and actual supply can be created through the combination of some rather simplistic yield improvement delays and new tool release delays. It is not common but at the same time it is not rare.
Where yield gets conflated with capacity in the most recent debacle is that if you are releasing new tools to increase capacity those new tools can cause yields to decline if the tools are not fully matched, so you must make a choice as a fab planner - (1) release the tool and increase capacity but take the added hit in yields, or (2) keep the tools in engineering release phase (not released to production) and take the hit in capacity while keeping your yields from being impacted by the new tool's mismatch.
The choice between (1) and (2) above is actually made all the time during capacity expansions, it changes dynamically on a day by day basis because sometimes the yield hit is minor enough that the added capacity is worth the yield hit for the time being.
I know this is a lengthy post, and I am trying to answer your question by empowering you with more background information than you probably wanted to be exposed to. If you want a more succinct answer let me know and I will try and distill it down to a cliff's version.