Code, data and interactive programming

February 16th, 2008

"Are code and data the same thing?" I haven't conducted a poll, but I think the following answers are the most popular ones:

"What?" (I don't know about universal Turing machines)
"Sure!" (I know about universal Turing machines)

My answer is "No". I'll now try to explain briefly why my opinion makes sense in general. After that, I plan to get to the point, which is how the code/data distinction matters in interactive programming environments.

I think that, um, everything is data, at least everything I can think about. I mean, the only things that can technically enter my brain are either data or cigarette smoke, because I don't do drugs. And I hope that the effect of passive smoking is negligible, so it's just data.

In particular, code is data. But not all data is code. Code is a special kind of data, that looks like this:

There are lots of blocks.
Each block defines the value of something.
The blocks depend on each other, and the dependencies can be cyclic.

What this means, and of course everybody knows it, is that you can't make any sense of code in the general case. That is, the only way to compute the values defined by the blocks of code is to "run" the code – keep chasing the links between the blocks, computing the values they define as you go. You can't even prove that this process will terminate given an arbitrary bulk of code, not to mention proving its correctness.

Now, an image, for example, isn't code. Well, philosophically, it is, because if they show you an image and it's really ugly, you'll say "ewww". So the image was in fact a program giving instructions to your brain. The output of your brain's image compiler is the following code in the human body assembly language:

MOV R0, dev_mouth
MOV R1, disgust_string
JMP write
RET
disgust_string:
.asciz "ewww"

More interestingly, you can write a program that processes images, and this particular image may be the one that makes your program so confused that it never terminates. However, this doesn't mean that the image itself is "code". The image doesn't have interconnected blocks defining values. Even if the image is a screenshot of code.

An image is a two-dimensional array of pixels, a nice, regular data structure. You don't have to "run" it in order to do useful things with it, like computing its derivatives or e-mailing it to your friends so they'll go "ewww". And programs doing that can be proven to terminate, unless you have an infinitely slow connection to the outgoing mail server.

So what I'm saying is, code is a special kind of data, having blocks which define values and depend on each other. Does it really matter whether a particular piece of data is "code" according to this definition? I think it does. One reason is the above-mentioned fact that you can't really make sense of code. Many people realize the practical drawbacks of this, and so in many contexts, they use data-driven programming instead of the arguably more natural "code-driven" programming.

Everything you represent as "data" can be processed by many different programs, which is good. Everything you represent as "code" can only be processed by a kind of interpreter, which is bad. I'm not talking about the difficulty of parsing the syntax, which doesn't exist with Lisp or Forth, isn't a very big deal with C or Java and is a full-featured nightmare with C++ or Perl. I'm talking about the semantics – for many purposes, you can't really "understand" what you've parsed without running it, and this is common to all Turing-complete languages.

But this isn't going to be about the inability to analyze code. This is going to be about the somewhat more basic problem with code – that of blocks which point to each other. In order to explain what I mean, I'll use the example of 3 interactive programming environments – Matlab, Unix shells and Python, listed in decreasing order of quality (as interactive environments, not programming languages).

Interactive programming is the kind of programming where the stuff you define is kept around without much effort on your behalf. The other kind of programming is when you compile and run your code and it computes things and exits and they are gone. Clearly interactive programming is nicer, because it makes looking at data and trying out code on it easy.

Or so it should be; in practice, it looks like more people prefer "batch programming", so there might be some drawbacks in the actual interactive environments out there. What makes for a good interactive environment, and what spoils the fun? Let's look at some well-known gotchas with existing environments.

Some of the most upset people I've seen near computers were the ones that had a Matlab session running for months when their machine crashed. It turned out that they had a load of data there – measurements, results of heavy computations, symbolic equations and their solutions – and now it's all gone. GAAA!! This doesn't happen with batch programming that much, because you send the output of programs to persistent storage.

This problem, nasty as it may be, looks easy to fix – just have the system periodically save the workspace in the background. Perhaps Matlab already has this. I wouldn't know, because I tend to manually save things once in a few minutes, since my childhood trauma of losing a file I loved. Anyway, this doesn't look like an inherent problem of interactive computing, just an awfully common implementation problem. For example, do Unix shells, by default, save the command history of each separate concurrent session you run? I think you know the answer.

Speaking of Unix shells. Ever had the pleasure of typing "rm -rf *" in the wrong directory because of command completion from history? GAAA!! OK. Ought to calm down. Let's do Fault Analysis. Why did this happen? The command string with "rm" in it is, basically, code; shell code invokes processes. This code depends on another piece of code, the one that determines the current directory. The command string doesn't have a fixed meaning – you must run getcwd in order to figure it out.

The shell couldn't really warn us about the problem, either. That's because the meaning of "rm" is defined by the code at /bin/rm (or by some other program in the $PATH which happens to be called "rm"). Since the shell can't understand that code without running it, it doesn't have an estimation of the potential danger. And if the shell warned us about all commands completed from history that originally ran in a different directory than the current one, the completion would likely be more annoying than useful.

At some point I've got fed up with Unix shells, and attempted to switch to a Python shell. I tried IPython and pysh, and I still use IDLE at home on my XP box. I ought to say that Python shells suck, and I don't just mean "suck as a replacement for a Unix shell", but also "suck as a way to develop Python code". The single biggest problem is that when you change your code, you must reload modules. It's unclear which modules should be reloaded, there's no way to just reload everything, and ultimately you end up with a mix of old code and new code, which does something, but you aren't quite sure what exactly.

Die-hard Pythonistas refuse to acknowledge there's a problem, though they do bend their programming style to work around it. What they do is they write all of their code in one file, and use execfile instead of import to make sure everything is indeed redefined, the Zen of Python with its love of namespaces be damned. Sure, an interesting project in Python can be just 5000 lines worth of code, but I don't like to navigate in a file that big. And sometimes you do need more lines, you know.

Another thing they do is implement __repr__ in their classes so that print displays their objects, and they'll invest a lot of effort into making eval(repr(obj)) work. The fact that eval'able strings aren't necessarily the most readable way to produce debug prints doesn't seem to bother them. Nor do the contortions they have to go through to solve the prosaic problem of making references to other objects display reasonably. One way to do it is to use dictionary keys instead of pointers, so that member object references aren't expanded into a full object description when they are printed. If you don't know why they're doing this, you'll find their code fairly puzzling.

I find the struggle to make interactive Python programming work very depressing. It reminds me of me, before the invincible idiocy of C++ crushed my spirit. People have a tendency to assume that things are well thought-out and hence should work.

We have this extremely widespread language called C++, and it's centered around type-based static binding. And it's easy to see how this could help a compiler spot errors and optimize the code. Therefore, this programming style can be a good way of writing software, if applied consistently. Ha!

We have this Python language, and several shells for it. Quite obviously, interactive programming is a good way to speed up the development cycle. Therefore, adapting our Python code for interactive programming will pay off, if we do it consistently. Ha!

But I digress. This isn't about the trusting nature of software developers, nor is it a comparison between C++ and Python, mind you. They're hard to compare, since they are very different beasts: Python is a programming language, and C++ is a karmic punishment. So I should get back to the topic of interactive programming.

Here's my opinion on the example programming environments I used in this entry.

Matlab is a great one, unless you lose your workspace. I used it for a while several times and it just never itched, and nothing went wrong.

Unix shells are good in terms of their ability to preserve your data (everything is a flat, self-contained string of bytes). I'd love them if they didn't suck so badly as programming languages. Since they do, I only use shell scripting for one-shot throwaway things, like debugging (fiddling with log files and core dumps).

Python is awful. So when I'm on Unix, I run Python processes from the shell, and sometimes use Python's reflection to make my batch programming just a bit more interactive. For example, if you have a Python function f(a,b,c), you can have your command line parser introspecting its arguments and adding the command line options -a, -b and -c.

So much for specific examples. What's the generic rule? I think it's this: pointer-happy systems can't be interactive. That's because interactive programming is about saving your data objects. And this is only useful when the current value of a preserved object is clear to you. Otherwise, you can't use the object, so what's the point?

When you have pointers in your objects, the objects aren't self-contained, and when the pointed objects are redefined, it isn't clear what should happen with the pointing objects. Should they point to the new thing or the old thing? Either answer can be counter-intuitive to you, and the whole point of interactive programming is to let you enter a state of flow, and if you scratch your head and can't easily guess what the old object means right now, you aren't in a state of flow.

In particular, pointers to code are the worst kind of pointers, because code is the most intertwined data of your program, and a pointer to a single block of code basically points to the entire code base. When an object points to an old function, and the function was redefined, and the system keeps the old definition around, you may easily get a call sequence with both the new function and the old function, which is probably ridiculous. And if you make the old object point to the new function, the function might simply fail to work with that object, and you just can't tell whether it will work or not without running it, remember?

For example, Python is a good interactive calculator, because arithmetic expressions are self-contained. Even if they do contain references to variables, it's fairly clear what happens when you change a variable – all expressions mentioning it will now recompute differently. Note that arithmetic expressions aren't Turing-complete and can't have cyclic references. Now, if you use Python's object-oriented features, then you have objects which point to their class definition which is a bunch of code pointers, and now when you reload the module defining the class, what good are your old objects?

This is why I don't believe in Numeric Python. The whole point of using Python is to use its pointer-happy features, like classes and hairy data structures and dynamically defined functions and stuff. Numeric programming of the kind you do in Matlab tends to use flat, simple objects, which has the nice side-effect of making interactive programming work very well. If you use a numeric library inside a pointer-happy language like Python, quite soon the other libraries you use will make interactive programming annoying. So you'll either move to batch programming or suffer in denial like the die-hard Python weenie you are. Someone using Matlab will be better off, since interactive programming is more productive than batch programming, when it works well.

So at the bottom line, I think that interactive programming has limited applicability, since "general-purpose" programming environments pretty much have to be pointer-happy. That is, if a language doesn't make it very easy to create a huge intertwined mess of code and data pointers, I don't see how it can be usable outside of a fairly restricted domain. And even in the "flat" environments like Matlab or Unix, while old data objects can be useful, old commands are, and ought to be, a two-edged sword. Because they are code, and code is never self-contained and thus has a great potential to do the wrong thing when applied in a new context.

This whole claim is one of those things I'm not quite sure about. From my experience, it took me quite some time to realize which interactive features help me and which get in the way with each environment I tried. So I can't know what happens in Lisp or Smalltalk or Tcl or Excel or Emacs, in terms of (1) applicability to "general-purpose" tasks, (2) the amount of self-contained data compared to the kind with pointers, especially pointers to code and (3) the extent to which the thing is itchy and annoying at times. So comments are most welcome. In particular, if you know of an environment that, put simply, isn't more itchy than Matlab but isn't less general-purpose than Python, that would be very interesting.