If a tree falls in a forest, it kills Schrödinger's cat

Schrödinger used to have this quantum cat which was alive and dead at the same time as long as nobody opened the box, and it was the very act of looking at the cat that made it either alive or dead. Now, I'm not sure about this quantum stuff, but if you ask me you'd always find a dead cat upon opening the box, killed by the act of not looking. In fact, if you open any random box nobody was looking into, chances are you'll find a dead cat there. Let me give an example.

I recently chatted with a former employee of a late company I'll call Excellence (the real name was even worse). Excellence was a company with offices right across the street that kept shrinking until the recent financial crisis. It then had to simultaneously fire almost all of its remaining employees, carefully selected as their best during the previous years when other employees were continuously fired at a lower rate. Giving us a whole lot of great hires, including MV, the co-worker in this story (though he was among those who guessed where things were going and crossed the street a bit earlier).

Rumors tell that to the extent possible, Excellence used to live up to expectations created by its name. In particular, without being encouraged or forced by customers to comply to any Software Development Methodology such as the mighty and dreadful CMM, they had (as CMM puts it) not only Established, but rigorously Maintained an elaborate design, documentation and review process which preceded every coding effort. Other than praise, MV had little to say about the process, except perhaps that occasionally someone would invent something awfully complicated that made code incomprehensible, having gone just fine through the review process because of sounding too smart for anyone to object.

Now, in our latest conversation about how things were at Excellence, MV told how he once had to debug a problem in a core module of theirs, to which no changes had been made in years. There, he stumbled upon a loop searching for a value. He noticed that when the value was found, the loop wouldn't terminate – a continue instead of a break kind of thing. Since the right value tended to be found pretty early through the loop, and because it was at such a strategic place, test cases everyone was running took minutes instead of seconds to finish. Here's a dead cat in a gold-plated box for ya, and one buried quite deeply.

My own professional evolution shaped my mind in such a way that it didn't surprise me in the slightest that this slipped past the reviewer(s). What surprised me was how this slipped past the jittery bodybuilder. You see, we have this Integration Manager, one of his hobbies being bodybuilding (a detail unrelated, though not entirely, to his success in his work), and one thing he does after integrating changes is he looks at the frame rate. When the frame rate seems low, he pops up a window with the execution profile, where the program is split into about 1000 parts. If your part looks heavier than usual, or if it's something new that looks heavy compared to the rest, it'll set him off.

So I asked MV how come that the cat, long before it got dead and buried, didn't set off the jittery bodybuilder. He said they didn't have one for it to set off. They were translating between the formats of different programs. Not that performance didn't matter – they worked on quite large data sets. But to the end user, automatic translation taking hours had about the same value as automatic translation taking seconds – the alternative was manual translation taking weeks or months. So they took large-scale performance implications of their decisions into account during design reviews. Then once the code was done and tested, it was done right, so if it took N cycles to run, it was because it took N cycles to do whatever it did right.

And really, what is the chance that the code does everything right according to a detailed spec it is tested against, but there's a silly bug causing it to do it awfully slowly? If you ask me – the chance is very high, and more generally:

  • Though not looking at performance followed from a reasonable assessment of the situation,
  • Performance was bad, and bad enough to become an issue (though an unacknowledged one), when it wasn't looked at,
  • Although the system in general was certainly "looked at", apparently from more eyes and angles than "an average system", but it didn't help,
  • So either you have a jittery bodybuilder specifically and continuously eyeballing something, or that something sucks.

Of course you can save effort using jittery automated test programs. For example, we've been running a regression testing system for about a year. I recently decided to look at what's running through it, beyond the stuff it reports as errors that ought to be fixed (in this system we try to avoid false positives to the point of tolerating some of the false negatives, so it doesn't complain loudly about every possible problem). I found that:

  • It frequently ran out of disk space. It was OK for it to run out of disk space at times, but it did it way too often now. That's because its way of finding out the free space on the various partitions was obsoleted by the move of the relevant directories to network-attached storage.
  • At some of the machines, it failed to get licenses to one of the compilers it needed – perhaps because the env vars were set to good values with most users but not all, perhaps because of a compiler upgrade it didn't take into account. [It was OK for it to occasionally fail to get a license (those are scarce) - then it should have retried, and at the worst case report a license error. However, the compiler error messages it got were new to it, so it thought something just didn't compile. It then ignored the problem on otherwise good grounds.]
  • Its way of figuring out file names from module names failed for one module which was partially renamed recently (don't ask). Through an elaborate path this resulted in tolerating false negatives it really shouldn't.

And I'm not through with this thing yet, which to me exemplifies the sad truth that while you can have a cat looking at other cats to prevent them from dying, a human still has to look at that supervisor cat, or else it dies, leading to the eventual demise of the others.

If you don't look at a program, it rots. If you open a box, there's a dead cat in it. And if a tree falls in a forest and no one is around to hear it, it sucks.

14 comments ↓

#1 Antiguru on 06.02.10 at 5:53 pm

Well, if there's one thing I'm convinced is useless, it's got to be profilers. I turned around on debuggers but only because they actually became useful over time, but there's an inherent problem with profiling like using a debugger in that it seems to completely turn off people's brains.

So you get results like someone coming to you and saying that the performance bottleneck of an app is string concatenation. So you give a nomcommital grunt instead of shame them publicly in the interest of a happy workplace, and off they go rewriting string classes. Making stringbuffers for concatenation, and then they throw a meeting about it and two of the would-be alpha coders get into a long argument about whether char* or string is better to use. As if any time, ever, the performance of string concatenation mattered to someone who made a sane program.

Then the boss offers you a delicious can of soda if you find out why 'the boys' are in a big uproar the last couple days. So you go and look and see wtf is really happening. Oh! A parser. Fancy that. A parser written without a scanner! And no one knows what that is. Or why strings concatenation and parsers don't go together well except in throwaway code (which, if it were, it ought to be in Perl or some laguage that is actually good with strings). Likewise, ideas like tokens and grammars are totally alien.

After about the third time of this happening, the flaw's apparent no matter how apathetic you are. The problem is that a profiler can help with obvious mistakes like you describe but usually performance problems are not related to stupid mistakes – and the ones which are are easy for any peabrain to find anyway.

Of course we all get annoyed by quotes from CSC 101 about not optimizing or optimizing only by algorithm use, but algorithm aside there's always something that is holding you back but usually it's something like overbloated code size, disk access, oversimple blocking, etc. etc. All things you can't figure out by instruction counting, and nowadays instruction counting is never the limiting factor anyway so it becomes doubly useless. Not to mention you can't optimize everything, so hunting through each scrap is useless, especially when the people causing problems usually cause them systemwide, not in a way that shows up in the code they own. Such as if your newly minted CS grad is assigned some init process that allocates a huge amount of memory, mostly in tiny chunks via string concatenation, hopelessly fragmenting your memory before any real work gets done.

Ultimately, I think the only real thing is testing, plus having people who know what they are doing. It's just like having a rehearsal for a big play. It's guaranteed to never work without rehearsal, no matter how great the actors or the play, but with enough rehearsal even the worst play with awful actors could at least complete without a major mishap. Performance gains can also be astounding, but it's almost always a case of either a waste of time, or requiring so much in detail knowledge of every bit of C++ and your app that it is a big waste of effort of what's probably your best technical asset.

#2 raghava on 06.03.10 at 6:19 am

@antiguru: :) Seriously! Profilers (and the kind of issues they identify) are simply overrated. Agree a hundred percent about the play analogy; but it also has to be seen that the actors need to be smart enough (to a minimum extent at least) and have some presence of mind, for Murphy would always be around.

@Yossi: A good post, thanks. Once, my team was asked to look into a peculiar problem with a strange pattern, involving elements in a list. A deeper inspection of the underlying library (which was built some five years ago, no fixes done till date) revealed a dead cat: an off by one error. Why did it not stink till now? None tested it with a case which would have failed. :D

#3 njn on 06.03.10 at 12:04 pm

Profilers are useless? That's the dumbest idea I've heard in a while.

#4 Tommy on 06.03.10 at 8:10 pm

Having recently adopted two adorable kittens, I thought you were going to make the obvious point "cats in boxes die". I tried to transport my two kittens in a box, but they quickly escaped and crawled underneath of the brake pedal, so cats not in boxes almost made us all die.

Instead, what I'm getting out of this is that cats need to be looked after by humans or other cats, as long as a human is watching a cat somewhere. I'm not watching my cat chew an electrical wire right now, I hope that it doesn't die.

I completely empathize with the desire to automate everything. I'd like to automate as much of my life as possible. Clearly, the important thing to consider is that the automation process if easily made fallible, and that someone is going to notice eventually.

Problems are better noticed by a jittery bodybuilder in the office, than a run over cat on the street.

#5 Yossi Kreinin on 06.04.10 at 4:21 am

@Antiguru: if performance isn't measured, it becomes crap; I'm not saying that it can't become crap otherwise or that sometimes it's OK when it's crap, just that it will become crap if not measured. As to profilers – you generally need custom ones, timers at logically meaningful places plus counters of "things" (pixels processed/basic blocks optimized/whatever).

@raghava: that an uncovered case didn't work is perfectly legitimate and basically unavoidable, what is avoidable is covered cases that work defectively though.

@Tommy: it seems cats and death go well together. Hence the nine lives, presumably.

#6 Aaron Davies on 06.04.10 at 5:36 am

from the tao:

A well-used door needs no oil on its hinges.
A swift-flowing stream does not grow stagnant.
Neither sound nor thoughts can travel through a vacuum.
Software rots if not used.

These are great mysteries.

#7 gus3 on 06.04.10 at 7:36 pm

"Premature optimization is the root of all evil." But after a certain point, a well-designed profiler can expose contradictory assumptions between a subroutine and its caller.

#8 Amir Barak on 06.07.10 at 6:38 am

Of course sometimes dead cats were simply put on top of leaky drainpipes; especially it seems when it isn't your code and sometimes because it IS your code [or mine :P]

I would always hesitate before touching anything that's been running for years [profiled badly or not]…

#9 Yossi Kreinin on 06.07.10 at 7:59 am

Well, I'm conservative enough to hesitate to touch anything that's been running for weeks if not days, it's just that sometimes you have to.

#10 Antiguru on 06.18.10 at 1:04 am

Not using a profiler doesn't mean never measure performance. It just means that you don't get sucked a false idea of how optimization is supposed to work which seems to be the case for everyone I personally know using them.

A argument I got into recently was against virtual methods being slow. My idea was to use pure virtual methods to accomplish some tasks in a sort of base node object. Mr scruples just couldn't allow the word virtual to pass his fingers, let alone a pure virtual.

He even made up a case and executed it a billion times or so to prove his point. It was a little faster, not not a lot. But I knew for a fact it was actually faster in a real world app. The virtual lookup is static data. Instead there's this dumb 'control' object that is an object with its own function pointer. So it's not static but there for every single listener.

So I make a realworld case and tahdah! It's obvious the virtual method is faster in the case we actually care about. and the margin wider than it had been in the BS case.

The reason is as the bits fly none of this crap is in cache any more, and what's worse is it's not together in memory and it has some bizarre indirection, so you wind up doing 2-3 cache mishes every single lookup when things are not in chache. Plus, it takes up more cache and pretty much everything is a node, so this was also degrading the whole system by helping choke the cache out.

Of course not everyone just knows off the top of their head, but you are going to solve 100X more perfomance issues by knowing than by just measuring and coming to an obvious conclusion.

#11 Yossi Kreinin on 06.18.10 at 3:56 am

Of course you need some mental model of why things could take the time they take to make sense of measurements; and, when you write optimized code, obviously you're not doing measurement-driven iteration (change, measure, change, measure…) but rather have some overall plan of an optimized implementation, having weighed alternatives in your head rather than done measurements for N implementations.

All I'm saying is that on top of that, when you have an app coded by reasonable people, where performance isn't then continuously measured, it becomes crap.

#12 6502 on 07.03.10 at 1:50 pm

One phrase I found (sadly) true is that if everything works then it's just because you didn't look closely enough.

It's *so* common in complex systems to look for a bug and then after some hunting you think you nailed it down and say "AH!… here it is!! when we introduced that change this part of the program wasn't updated correctly and .. blah blah blah" and indeed you found a true genuine bug!

But it's another one, not the one you were chasing…

dead cats are everywhere…

Andrea

PS: For code efficiency is even worse… especially when people starts doing "optimizations" without using a decent (i.e. passive) profiler first. I've seen a major code rewrite because of a single operation that was O(n^2) and taking 99.9% of cpu time and could have been coded as O(1) with minimal impact on the rest of the program.
After the change computation in one use case went down from 25 minutes to 3 seconds (and those were mostly the update of the progress bar that was added to inform users that the system wasn't indeed dead).

#13 noop on 01.12.11 at 10:34 am

Yes, profilers are useless in most cases. Problem is that good programmer will use correct and efficient method from the start and bad programmer can't improve by only using a profiler or come other tool, instead of reading some good books.
Existence of profilers and optimizing compilers is being used as excuse for writing sloppy code. Today many students are taught: "write working code first, think about speed later( usually never )".
P.S. Sorry for bad english.

#14 Oisín on 06.12.12 at 3:22 pm

@noop:
"Existence of profilers and optimizing compilers is being used as excuse for writing sloppy code. Today many students are taught: “write working code first, think about speed later( usually never )”."

People are taught this for a good reason. If you try to write "correct and efficient" code from the start as you recommend, you are making your task artificially more difficult, and therefore increasing the likelihood of making errors.

Instead, a more robust approach would be to focus on writing correct code first, with some decent test coverage, before reviewing your design and code and refactoring/optimising.

This is not to say that you should purposely write "sloppy" code or reject all thoughts of performance, just that the focus should not be on writing optimal code in the first pass. To presume that you can write correct and well-optimised code on the first attempt is optimistic and rather dangerous.

Going back to profilers, I don't buy the idea that they only help people to solve trivial optimisation problems.
A more important use is to justify where optimisation is worthwhile.

For example, imagine a programmer comes to a team meeting claiming that some important feature was currently taking O(n^3) time when it could be O(n log n) with another algorithm, and that we should replace the algorithm immediately.
Someone raises their hand and asks how much time this would save. Nobody answers, so the programmer does a profiling run and discovers that the "slow" algorithm currently takes 50ms, which is dwarfed by the slow network queries that are performed immediately afterwards.

Like any tool, profilers can be misused, but they certainly have some value.

Leave a Comment