Mimsy Were the Borogoves

Mimsy Were the Technocrats: As long as we keep talking about it, it’s technology.

Doubling down on failure

Jerry Stratton, April 11, 2010

Failure happens when we pretend something cannot fail. There is no success without accepting the possibility of failure—without accepting that possibility, you’ll freeze when failure inevitably occurs, or you’ll respond with a pre-calculated response that exacerbates the problem because it doesn’t allow for failure. Without accepting that possibility, you’ll either not notice that failure has occurred, or you’ll pretend that it isn’t really a failure. Minor setbacks become major meltdowns.

When a business is too big to fail, its owners take greater risks, and continue doubling down on failure. Why not? Their upside is infinite, and the downside is covered by someone else. And we end up with economic meltdowns.

Here’s an interesting technological parable I ran across a couple of months ago. Flight 124 out of Perth, Western Australia, had six accelerometers. The software was designed to detect bad information when one failed, so that they were at least partially redundant. One failed years ago and, because they were redundant, was never replaced; the software continued to work fine: it ignored bad data from the one bad accelerometer in favor of good data from the five good accelerometers. On August 1, 2005, a second accelerometer failed. The software now started using data from the bad accelerometers. The aircraft suddenly pitched nose up, climbed rapidly; dropped 4,000 feet; rose 2,000 feet.

The comments on the IEEE article about the incident say things like “software engineers, guard your inputs and bound your outputs!”. But I suspect that the real problem was less one of programmable boundaries and more about political ones. The article doesn’t say how the software knew that one accelerometer was bad, but that’s probably the first problem: software can’t know that an accelerometer has failed, it can only guess by comparing data from all of the accelerometers and tossing outliers (or using some other algorithm to recognize bad data). When two go bad, the software now has two out of six of the inputs providing bad information and it has less ability to detect bad outliers.

The programmers most likely were checking the incoming data; that’s how they knew that the bad accelerometer’s data was out of bounds. Judging from the Aviation Safety Investigation Report, what happened was that one accelerometer failed; no one cared. The software detected the failure every time the system started up and isolated the bad accelerometer. But when a second failure occurred soon after startup, this left the system unable to re-detect that the first one had failed.

The ADIRU in the B777 aircraft was a fault tolerant, system redundant unit. The ADIRU had internal system redundancy and automatically made allowances for internal component faults to ensure the unit’s overall functionality.

With only one erroneous input, the system was designed to automatically stop accepting that input and divert to another input source for information. That event would not require any action by the flight crew, and was intended to minimise the number of checklist items that a crew would need to action.

The certification of the ADIRU operational program software (OPS) was dependent on it being tested against the requirements specified in the initial design. The conditions involved in this event were not identified in the testing requirements, so were not tested.

When you decide that something can’t fail, you must identify and guard against all possible failures. This is impossible. You have thus guaranteed failure.

Fortunately for the crew and passengers, the bad data from the accelerometers was mitigated by good data from other sensors—something that wasn’t considered necessary but had been put in place anyway.

The effect of the software error was partially offset by the inclusion of mid-value select (MVS) within the primary flight computer. The MVS function was included in the primary flight computer to moderate the effect of anomalous outputs from the ADIRU. Analysis and testing during initial development indicated that these theorized outputs could not occur, and the MVS function was deemed no longer necessary. However, a decision was made by the aircraft manufacturer to retain the MVS function in the PFC.

The mitigating effects of the mid-value select and secondary attitude and air-data reference unit on the primary flight computer response to the erroneous accelerometer outputs was not an intended function, but did prevent a more severe upset event from occurring.

What to do when accelerometers fail is a policy decision that needs to come from outside the software department. In this case, one accelerometer failing wasn’t an issue that needed fixing due to the redundancy in the system. But that’s only the major policy decision. The fact that the system reset its knowledge of bad data every restart was also a policy decision. That it was automatic and—probably—unable to be turned off manually by the pilot was another policy decision.

The problem here wasn’t that the failed component remained in place; it was that the system—both people and software—pretended there was no failure. When a second failure occurred, it was more catastrophic because the system was designed to forget failure and just keep trying the same solution when failure doubled.

Accept failure; embrace it; otherwise, you’re doomed to fail spectacularly. Success doesn’t mean pouring money into bad systems. It means letting failures fail so that the failure is obvious and remedial measures can be taken. New solutions can be found, but only when we realize that the old solutions aren’t working. Allow competing solutions; imagine if the B777 had only used data from the accelerometers. The flight would have crashed.

“Too big to fail” and “too important to fail” are never true. Anything can fail. What we really mean when we say that is that we’re unwilling to acknowledge when failure occurs. We’re unwilling to correct our failures, and we’re unwilling to start over to build a better system with what we’ve learned, or should have learned. We try to pretend that we can fix failure by adding new layers of failure.

In technology, I’ve noticed the very obvious axiom that it is easy to take a working, simple, easy-to-understand system and transform it into a complex, expert-only system. But once that happens, it’s difficult to reverse the mistake. Outside of closed systems, the simple system tends to win because of market dynamics. Which is why people who want control prefer closed systems. Whether they’re technology experts or politicians, they prefer broken systems that nobody can correct. The problem with these systems is that they never become less complex. When they fail, we add more layers of failure to them rather than rethinking the problem.

I can just remember life under the government-created monopoly called AT&T. We had to pay special monthly fees simply to use phones not sold by them; this meant we didn’t do that; and this meant no one seriously marketed phones to the general consumer. I remember getting my first Conair phone in about 1985 after the AT&T divestiture, and wondering if it would really work1.

Nowadays AT&T would be “too big to fail”. And instead of forcing it out of its monopoly—paving the way for the plethora of phone options available to us today—we’d be propping it up like the Postal Service, without any idea of the lower prices, technological advances, and useful features waiting for us if we just let failure occur. The same thing is set to happen to health care and is happening in the auto industry.

With health care, we’re adding new layers of failure. With the auto industry, instead of allowing failed companies to fail, making way for new, innovative startups, we’ve doubled down on failure. People aren’t going to stop buying cars if GM and Chrysler go out of business. They’ll continue buying cars from whoever rises up to take the place of those failed companies. What great automotive advancements are we missing because we’ve doubled down on failure?

  1. It did. I kept that phone from move to move for, I think, about ten to fifteen years.

  1. <- Turing hurdle cleared?
  2. Nothing is obvious ->