Mimsy Were the Borogoves

Mimsy Were the Technocrats: As long as we keep talking about it, it’s technology.

Anticipating failure

Jerry Stratton, November 4, 2002

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair.—Douglas Adams

When Douglas Adams wrote that in his satire Mostly Harmless, he was talking about ventilation. Somewhere in the galaxy air conditioning had given way to “climate control”.

Something even sexier and smarter than air-conditioning… to be sure that mere people didn’t muck up the sophisticated calculations which the system was making on their behalf, all the windows in the buildings were built sealed shut.

“But what if we want to have the windows open?”

“You won’t want to have the windows open with new Breathe-O-Smart.”

“Yes, but supposing we just wanted to have them open for a little bit?”

“You won’t want to have them open even for a little bit. The new Breathe-O-Smart system will see to that.”

“Okay, so what if the Breathe-O-Smart breaks down or goes wrong or something?”

“Ah! One of the smartest features of the Breathe-O-Smart is that it cannot possible go wrong. So. No worries on that score. Enjoy your breathing now, and have a nice day.”

When major heatwaves coincided with major failures of Breathe-O-Smart and people started dying of asphyxiation, Breathe-O-Smart Inc. “issued a statement that best results were achieved by using their systems in temperate climates.” The universal governments responded by requiring the warning label about “things that cannot possibly go wrong” on all mechanical, electrical, or digital devices.

I have heard and been part of nearly exactly that conversation at various times when we in network services have planned our upgrades. Too often, we seal the windows shut. We assume that our upgrades will never fail. As a result, too often we practically design our upgrades and modifications to fail in the most spectacular means possible. We schedule our upgrades for late at night, when we’re tired, and then when we run into problems we implement fixes without even testing them in the real world; we leave thinking things are working and the next morning, users come in and find out that they can’t access their e-mail, or that e-mail to them is bouncing back, or that they’re losing e-mail completely.

Or worse, we schedule our upgrades for a Thursday or even Friday evening, virtually ensuring that any problems persist for the entire weekend, making it that much more time consuming to fix them the next week.

Part of “making it possible to repair” an upgrade that goes wrong means that the person fixing it must be available. Wednesday night upgrades are better than pre-weekend upgrades; Monday, Tuesday, or Wednesday morning upgrades are better yet, since they mean that the person responsible for fixing any problems will be there, at the system, watching it for any signs of failure or loss of service. Instead of being at home sleeping off a late night implementing fixes with just enough rubber band and bailing wire to let them go home.

Our most recent upgrade included at least one example of this; I only know that one example because I was one of the “users” affected; there were probably others. Some of our users have chosen to use SpamBouncer to block viruses and filter spam. Some of them also use similar technology to archive professional discussion lists.

I had been asking for months before the upgrade whether there were going to be any problems with that; the answer was always that it was under control.

Well, we did that upgrade on a Thursday night, ensuring that when things went wrong, not only would the people able to fix it be at home asleep after a long night, but that when they did come in they’d be in a rush to fix everything before the weekend.

So on Friday morning I came in to discover that the upgrade was filtering out all of my real mail, but ensuring that spam mail arrived in my mailbox. And it was doing this to all users of the spambouncing system. The fix should have been relatively easy… except that it was virtually impossible to get at and repair. The system administrators had actually hand-edited each individual SpamBouncer user’s configuration files, and had done so incorrectly.

From my perspective the next morning, it was obviously incorrect. I’m sure that at 11 o’clock at night it was less obviously incorrect. Working late and tired instead of early and fresh was the first error. But fixing the problem after the error meant hand-editing each individual user’s configuration file again—at the same time that lots of other users were complaining about lost mail, too, for many other reasons. All on a Friday, when the lost mail had had all night and a significant portion of the morning to build up.

The solution was not only obviously wrong, but it was nearly impossible to get at and repair. For some users it wasn’t fixed until well into the next week. They were losing mail (except viruses and spam, of course) for three or four days. The person who implemented the upgrade quit to avoid the stress he’d created.

The first question I asked on Friday morning was “couldn’t we edit some central configuration file instead?” The answer was, “well I suppose we could.” In the end, we did do that, although we still had to hand-edit each individual user’s file to remove the “fix” that was causing their mail to get diverted.

The solution should have been a simple one; all that was needed was for the right question to be asked: “how can we do this in a way that is easier to fix if we do it wrong?” We almost never ask that question. Instead, we almost belligerently do things in ways that make it hard to fix if we do it wrong, and, worse, that make it more likely that we will in fact do it wrong and that when it goes wrong it will go spectacularly wrong.

Part of the answer is to assume that you will, in some important way, fail. Do your best to ensure that you don’t, but also do your best to ensure that when you do fail, because you will, you will be able to easily fix the problem.

  • Work fresh, not tired or hurried.
  • Stay after the work is completed! Be there when the users arrive, fresh and with the changes you’ve made still in mind. Don’t make them call you and wake you up. If you aren’t going to be able to stay for a few hours after the change, you don’t have time to make the change.
  • Make configuration changes at central locations. Let users make configuration changes at their location. Don’t overwrite their work without telling them.
  • Leave lots of time to fix any problems that arise; one day before the weekend will not be enough.
  • Know when to give up and roll back. Let yourself open the windows.

These guidelines all boil down to one simple question: if something goes wrong, will we be able to fix it easily and quickly? Work for success, but assume that we will fail. If we assume that our work can never break, it will be virtually impossible to get at and repair when it does fail.

Whenever a computer “expert” claims that you won’t have to “open the window” and that it is okay to seal it shut, require that somewhere on their upgrade they have to include Douglas Adams’s quote about air conditioning.

  1. <- Forward Looking Design
  2. iPod vs. Real ->