If a tree falls in a forest, it kills Schrödinger's cat

June 2nd, 2010

Schrödinger used to have this quantum cat which was alive and dead at the same time as long as nobody opened the box, and it was the very act of looking at the cat that made it either alive or dead. Now, I'm not sure about this quantum stuff, but if you ask me you'd always find a dead cat upon opening the box, killed by the act of not looking. In fact, if you open any random box nobody was looking into, chances are you'll find a dead cat there. Let me give an example.

I recently chatted with a former employee of a late company I'll call Excellence (the real name was even worse). Excellence was a company with offices right across the street that kept shrinking until the recent financial crisis. It then had to simultaneously fire almost all of its remaining employees, carefully selected as their best during the previous years when other employees were continuously fired at a lower rate. Giving us a whole lot of great hires, including MV, the co-worker in this story (though he was among those who guessed where things were going and crossed the street a bit earlier).

Rumors tell that to the extent possible, Excellence used to live up to expectations created by its name. In particular, without being encouraged or forced by customers to comply to any Software Development Methodology such as the mighty and dreadful CMM, they had (as CMM puts it) not only Established, but rigorously Maintained an elaborate design, documentation and review process which preceded every coding effort. Other than praise, MV had little to say about the process, except perhaps that occasionally someone would invent something awfully complicated that made code incomprehensible, having gone just fine through the review process because of sounding too smart for anyone to object.

Now, in our latest conversation about how things were at Excellence, MV told how he once had to debug a problem in a core module of theirs, to which no changes had been made in years. There, he stumbled upon a loop searching for a value. He noticed that when the value was found, the loop wouldn't terminate – a continue instead of a break kind of thing. Since the right value tended to be found pretty early through the loop, and because it was at such a strategic place, test cases everyone was running took minutes instead of seconds to finish. Here's a dead cat in a gold-plated box for ya, and one buried quite deeply.

My own professional evolution shaped my mind in such a way that it didn't surprise me in the slightest that this slipped past the reviewer(s). What surprised me was how this slipped past the jittery bodybuilder. You see, we have this Integration Manager, one of his hobbies being bodybuilding (a detail unrelated, though not entirely, to his success in his work), and one thing he does after integrating changes is he looks at the frame rate. When the frame rate seems low, he pops up a window with the execution profile, where the program is split into about 1000 parts. If your part looks heavier than usual, or if it's something new that looks heavy compared to the rest, it'll set him off.

So I asked MV how come that the cat, long before it got dead and buried, didn't set off the jittery bodybuilder. He said they didn't have one for it to set off. They were translating between the formats of different programs. Not that performance didn't matter – they worked on quite large data sets. But to the end user, automatic translation taking hours had about the same value as automatic translation taking seconds – the alternative was manual translation taking weeks or months. So they took large-scale performance implications of their decisions into account during design reviews. Then once the code was done and tested, it was done right, so if it took N cycles to run, it was because it took N cycles to do whatever it did right.

And really, what is the chance that the code does everything right according to a detailed spec it is tested against, but there's a silly bug causing it to do it awfully slowly? If you ask me – the chance is very high, and more generally:

Though not looking at performance followed from a reasonable assessment of the situation,
Performance was bad, and bad enough to become an issue (though an unacknowledged one), when it wasn't looked at,
Although the system in general was certainly "looked at", apparently from more eyes and angles than "an average system", but it didn't help,
So either you have a jittery bodybuilder specifically and continuously eyeballing something, or that something sucks.

Of course you can save effort using jittery automated test programs. For example, we've been running a regression testing system for about a year. I recently decided to look at what's running through it, beyond the stuff it reports as errors that ought to be fixed (in this system we try to avoid false positives to the point of tolerating some of the false negatives, so it doesn't complain loudly about every possible problem). I found that:

It frequently ran out of disk space. It was OK for it to run out of disk space at times, but it did it way too often now. That's because its way of finding out the free space on the various partitions was obsoleted by the move of the relevant directories to network-attached storage.
At some of the machines, it failed to get licenses to one of the compilers it needed – perhaps because the env vars were set to good values with most users but not all, perhaps because of a compiler upgrade it didn't take into account. [It was OK for it to occasionally fail to get a license (those are scarce) - then it should have retried, and at the worst case report a license error. However, the compiler error messages it got were new to it, so it thought something just didn't compile. It then ignored the problem on otherwise good grounds.]
Its way of figuring out file names from module names failed for one module which was partially renamed recently (don't ask). Through an elaborate path this resulted in tolerating false negatives it really shouldn't.
...

And I'm not through with this thing yet, which to me exemplifies the sad truth that while you can have a cat looking at other cats to prevent them from dying, a human still has to look at that supervisor cat, or else it dies, leading to the eventual demise of the others.

If you don't look at a program, it rots. If you open a box, there's a dead cat in it. And if a tree falls in a forest and no one is around to hear it, it sucks.