I want a struct linker

September 26th, 2008

Here's a problem I've seen a lot (it's probably classified as an "Antipattern" or a "Code Smell" and as such has a name in the appropriate circles, but I wouldn't know, so I'll leave it nameless).

You have some kind of data structure that you pass around a lot. Soon, the most valuable thing about the structure isn't the data it keeps, but the fact that it's available all the way through some hairy flow of control. If you want to have your data flow through all those pipes, just add it to The Data Structure. (To antipattern classification enthusiasts: I don't think we have a god object yet because we really want to pass our data through that flow and it's right to have one kind of structure for that and not, say, propagating N+1 function parameters.)

Now suppose the structure holds an open set of data. For example, a spam filter could have a data structure to which various passes add various cues they extract from the message, and other passes can access those cues. We don't want the structure to know what passes exist and what cues they extract, so that you can add a pass without changing the structure.

I don't think there's a good way to do it in a 3GL. In C or C++, you can:

Aggregate the cue structures by value (which means you have to recompile everything once you change/add/remove a member from any of them)
Keep pointers to the cue structures and use forward declarations to avoid recompilation (a bit slower, and you still have to recompile when you add/remove a whole cue structure)
Keep an array of void* or base class objects (not debugger-friendly, and requires a registration procedure to resize the arrays according to the number of passes and deal dynamically computed indexes to the cues to all who wish to access them)
Keep a key -> void* map (increasingly slow and debugger-unfriendly, and you need registration to compute the keys from cue names, or use the C substitute for interning – use pointers to global variables with names like &g_my_cue_key as keys)
Keep a string -> void* map (no registration or pseudo-interning, but really slow)

On top of JVM or .NET, you have pretty much the same options, plus the option to generate the cue container structure dynamically. Each cue would define an interface and the container structure would implement all those interfaces. The debugger would display containers nicely, and the code accessing them wouldn't depend on the container class. I'd guess nobody does that though because the class generation part is likely somewhat gnarly.

In a 4GL, you can add attributes to class objects at run time. This is similar to keeping a key->pointer map in a 3GL, except the name interning is handled by the system as it should, and you don't confuse debuggers because you're using a standard feature of the object system. Which solves everything except for the speed issue, which is of course dwarfed by other 4GL speed issues.

Now, I used to think of it as one of the usual speed vs convenience trade-offs, but I no longer think it is, because a struct linker could solve it.

Suppose you could have "distributed" struct/class definitions in an offset-based language; you could write "dstruct SpamCues { ViagraCue viagra; CialisCue cialis; }" in the Medication spam filter module, and "dstruct SpamCues { FallicSymbolsCue fallic; SizeDescriptionsCue size; }" in the Penis Enlargement module. The structure is thus defined by all modules linked into the application.

When someone gets a SpamCues structure and accesses cues.viagra, the compiler generates a load-from-pointer-with-offset instruction – for example, in MIPS assembly it's spelled "lw offset(ptrreg)". However, the offset would be left for the linker to resolve, just the way it's done today for pointers in "move reg, globalobjectlabel" and "jump globalfunclabel".

This way, access to "distributed" structures would be as fast as "normal" structures. And you would preserve most optimizations related to adjacent offsets. For example, if your machine supports multiple loads, so a rectangle structure with 4 int members can be loaded to 4 registers with "ldm rectptrreg,{R0-R4}" or something, it could still be done because the compiler would know that the 4 members are adjacent; the only unknown thing would be the offset of the rectangle inside the larger struct.

One issue the linker could have on some architectures is handling very large offsets that don't fit into the instruction encoding of load-from-pointer-with-offset forms. Well, I'd live even with the dumbest solution where you always waste an instruction to increment a pointer in case the offset is too large. And then you could probably do better than that, similarly to the way "far calls" (calls to functions at addresses too far from the point of call for the offset to fit into 28 bits or whatever the branch offset encoding size is on your machine) are handled today.

The whole thing can fail in presence of dynamic loading during program run as in dlopen/LoadLibrary; if you already have objects of the structure, and your language doesn't support relocation because of using native pointers, then the dynamically loaded module won't be able to add members to a dstruct since it can't update the existing objects. Well, I can live with that limitation.

If the language generates native object files, there's the problem of maintaining compatibility with the object file format. I think this could "almost" be done, by mapping a distributed structure to a custom section .dstruct.SpamCues, and implementing members (viagra, cialis, fallic, size) as global objects in that section. Then if an equivalent of a linker script says that the base address of .dstruct.SpamCues is 0, then &viagra will resolve to the offset of the member inside the structure. The change to automatically map sections named .dstruct.* to 0 surely isn't more complicated than the handling of stuff like .gnu.linkonce, inflicted upon us by the idiocy of C++ templates and the likes of them.

And here's why I'll probably never see a struct linker:

If the language uses a native linker, a small change must be done to that linker in order to handle encodings of load/store instructions in ways it previously didn't (currently it only has to deal with resolving pointers, not offsets). Since it's platform-specific, the small change is actually quite large.
You could compromise and avoid that change by generating less efficient code which uses the already available linker ability to resolve the "address" of the viagra object in the zero-based .dstruct.SpamCues section – the code can add that "address" (offset, really) to &cues. But that could still force changes in the compiler back-end because now it has to generate assembly code adding what looks like 2 addresses, which makes no sense today and might be unsupported if the back-end preserves type information.
The previous items assume that the portable "front-end" work to support something like dstruct isn't a big deal. However, I'd guess that not enough people would benefit from it/realize they'd benefit from it to make it appear in a mainstream language and its front-ends.
I could roll my own compiler to a language similar to a mainstream one, with a bunch of additions like this struct linker thingie. Two problems with this. One – it's too hard to parse all the crud in a mainstream language (even if it isn't C++) to make it worth the trouble, unless your compiler does something really grand; a bunch of nice features probably aren't worth it. Two – most programmers take a losing approach towards their career where they want to put mainstream languages on their resume so that losers at the other end can scan their resumes for those languages; if your code is spelled in a dialect, you'll scare off the losers forming the backbone of our industry.

It still amazes me how what really matters isn't what can be done, but what's already done. It's easier to change goddamn hardware than it is to change "software infrastructure" like languages, software tools, APIs, protocols and all kinds of that shit. I mean, here we have something that's possible and even easy to do, and yet completely impractical.

Guess I'll have to roll my own yet-another-distributed-reflective-registration bullshit. Oh well.