Small Pleasures of Programming
One of the little secrets of computer programming is that it is actually quite a bit of fun. You discover little intellectual puzzles to wrestle with and solve on your way to building something bigger. And then instead of throwing away the puzzle solution like Saturday’s completed Sudoku, you get to tuck it into your app and keep it around in your bag of tricks.
I have a couple recent examples that can give a flavor of these experiences.
The app I am working on, for a variety of reasons, occasionally needs to be able to take a census block ID — a 15 character string — and map it to the ID of the voting district that includes it — a 9–12 character string. The census department publishes this information and makes it freely available on their website. We imported this information and stored it in probably the simplest JSON/JavaScript representation you could imagine — just an object with block IDs as the keys and voting district ID as the values. The processing that needed to do the mapping would load the JSON file and then just use the object to do the mapping from key to value.
These files are stored by state, so to pick the worst case, Texas has almost one million census blocks and this structure is almost 30 megabytes when stored on disk. In memory it expands to 50 MB. JSON (JavaScript Object Notation — a way to take a JavaScript object and pass it as a string or store it in a file) can be pretty wordy so sometimes it can get smaller when imported but in this case the file is mostly just strings so actually expands once it is parsed and stored into memory.
The processing that required this mapping ability previously had to load a number of even bigger files, so despite its girth, it wasn’t quite on the list of top issues. But as is common, some other changes and improvements suddenly made the download and processing costs associated with this mapping table something we cared about.
Now again looking at Texas, that 30 MB JSON file was actually only 4 MB when compressed which is how it was stored and transmitted. Compression is a good quick way to get some intuition about what the “real” information content of some block of data is. The compressed form might not be in the best form for processing, but it does give you a sense of how much redundancy there is in the data. In this case, there is a ton of redundancy. That redundancy is pretty obvious when you think about the content. Each voting district is made up of about 65 census blocks, so each value (the 9 to 12 character voting district ID) is repeated on average 65 times. The block IDs are actually in a very structured form where the first 2 characters are the state ID (so are the same for all values in a file organized by state), the next 3 characters are county ID, then census tract, census block group and then finally the individual block number.
Given all this redundancy, it seemed likely to me that we could come up with a much denser data structure that could be stored, transmitted, loaded and used directly in this dense representation without having to blow it up in memory. I’ve recently become a fan in my JavaScript programming when needing to speed something up or reduce memory usage (often the same thing) of identifying where the naïve JavaScript data representation is unnecessarily bloated and instead using a compact binary representation. Maybe it’s my C/C++ roots.
In this case, I had two areas to focus on: how to store the values and how to store the keys. Since I knew the values were duplicated on average 65 times, a trivial solution would be just to keep a copy of all unique strings and store a reference to the unique string rather than the literal string (compilers and languages often do this automatically internally for some parts of your program). I thought I could do a little better since the values also have a ton of internal redundancy — many values share the same first part and then many values have the same last few distinguishing characters. A simple solution was just to break the strings into two pieces (dynamically) and find the splits that minimized the string storage. This is essentially an approximation of what a general compression algorithm does (find common shared sequences and then reference them compactly).
For the keys, when you have a large amount of redundancy in your key space, a trie is a good choice. Coding this up was mostly about designing how to pack it into a single block of data, using offsets from the start of the buffer rather than pointers to walk through the data structure. As is typical when you start playing around with memory in tricky ways, this took a bit of testing and debugging to get working.
I was pretty happy with the final results. That Texas file ended up as a compact buffer of 2.7 MB that could be directly loaded into memory as a single block and used without further transformation. It compressed to less than 1 MB for storage and transmission (which argued that there was even more room for improvement in my solution). But I already had a 20X improvement in memory use and significant improvements in storage and transmission so I stopped there.
This was a fun little problem. It was well-isolated and easily described so I didn’t have to spend a ton of time on gnarly integration and complex testing and deployment issues. I was able to use intuitions from a lifetime of programming and also required some careful bit fiddling to get right. And then it actually worked and solved the problem.
I will say that the general problem of “compact dictionaries” (which this was a special case of) is such a well-studied problem that I felt a little guilty writing something from scratch. Surely there was something I could grab off the shelf? A search didn’t find anything so I had the advantage of being retired; I could suppress the guilt and just have fun solving the problem.
My second example is a little different. In this case, it was the fun of learning something new and then putting it to effective use.
I was trying to solve a trio of problems that turned out to be related in a surprising way. The core of our app loads and displays voting district shapes. Voting districts are created from the underlying census block shapes. The memory required to store and process these shapes ends up having a big influence on our overall performance.
The most basic way of improving performance is to take the high resolution shapes provided by the census department and simplify them by removing points. This helps everywhere (in memory use and in every stage of processing) so is the obvious thing to focus on. The extra points either make no visual difference when displayed on a computer screen or make no effective difference because they don’t change how the user interacts with the application. The underlying census shapes need to distinguish whether a single house is in one census block or another, but for our purposes an approximation of that works just fine since none of our other data (census and elections) is available at that resolution.
So shape simplification is our first problem.
The second problem is how to combine a set of shapes into a single shape or more generally “polygon union”. The core place we had this problem is taking a set of voting district shapes and combining them into a single congressional (or state legislative) shape. Polygon union is a well-studied problem in computer graphics, but the general solution is both memory and processing intensive and when we integrated an open-source solution into our app, it was a significant load on the interactive performance of the application. I had to build a complex asynchronous work-slicing mechanism on top of it to prevent it from locking up our app for seconds at a time.
The third problem was how to determine if two shapes are contiguous. This is core to how we analyze a redistricting plan. In our application, this happens “offline” to produce a contiguity graph. It did not need to be especially fast but had originally been solved with a set of tools that were separated from our core tool set so we had motivation to try to unify this.
There are a lot of simplification algorithms to choose from, but the additional challenge in our case is we have a set of shapes that we are drawing together on a surface. The shapes cover the surface and when you simplify, you can run into two related issues if you make different simplification decisions for two shapes that lie next to each other. Making different decisions happens easily because one shape may have a single continuous line while there might be multiple shapes on the other side that border that line. Maintaining the integrity of those smaller shapes means keeping more points, while the larger shape can simplify and remove those points. When you make different simplification decisions, you end up either leaving holes in the map when you draw adjacent shapes, or you have overlapping shapes and this causes visual anomalies when you fill shapes with partial opacity (because the overlapping areas are drawn twice and appear darker).
The gaps also cause problems when you want to combine shapes (the polygon union problem I mentioned above) because you end up generating shapes with lots of little anomalous holes that aren’t actually in the real data. These both make the processing more expensive as well as introducing visual and analytic anomalies.
In researching this, I came across the TopoJSON libraries written by Mike Bostock. Mike is quite well-known in the visualization community, formerly doing innovative work at the New York Times and the co-author of an important open-source visualization library, d3.js. Investigating this is where I got to learn something new.
The TopoJSON libraries take a different approach to the problems I was facing. The core insight is that when you have a set of shapes like the outlines of the US states or the voting districts of a state, these shapes actually have the important characteristic of dividing up the surface — they don’t overlap and they fully cover the part of the surface you care about. You really have a topology (hence the package’s name) and the polygons share the line segments or arcs of this topology.
The core insight is that instead of thinking of your data as a set of shapes, you really have this set of topological arcs that divide up your surface. This ends up being key to the processing that follows. This is so cool!
This pattern of breakthrough recurs often. You have some general problem that is “theoretically hard” but by identifying a key insight you can solve a different, simpler problem in a way that cuts through the complexity with much better performance.
The first stage of processing is to take your polygons and split them into arcs broken at the points where the polygons intersect. Then you eliminate duplicate arcs by identifying all polygons that share an arc and have them reference the same arc.
With this representation — each polygon is described as a set drawn from a shared collection of arcs — the key challenges I described above become straightforward.
Now when you simplify, you simplify at the granularity of these shared arcs. This automatically guarantees that two polygons that share an edge make the same simplification decisions on both sides of the edge. That edge is specified by the shared simplified arc.
The union problem becomes even easier. Given a collection of polygons described as a set of arcs, you simply remove any arcs that are referenced by more than one polygon. The arcs that remain describe the boundary of the merged shape (there are a few subtleties to deal with disjoint shapes and holes when reconstructing the final shape).
The contiguity problem is also trivial in this representation — two shapes are contiguous if they share an arc.
The other thing I found remarkable was that all this functionality (and other features I’ve skipped over) was implemented in a few hundred lines of code. A very neat piece of work. I’m more familiar with codebases like Microsoft Office where you might have 100’s of thousands of lines of code just dealing with copy/paste (which to be fair, really is a semantically complex problem — let me tell you about copy and paste in HTML tables some time).
I got a chance to dive deeper into this bit of code because things didn’t quite “just work”. I personally find it difficult to read and understand a piece of code if I am not digging in to try to fix a bug or extend it in some way. I just don’t have the persistence to give it the attention it requires if I’m not trying to achieve some explicit goal.
In this case, there were a few issues that needed to be addressed before we could make use of it in our app so I got a much deeper appreciation for the work as well as the additional satisfaction of solving the problems I encountered along the way.
The challenges with simplification were basically the converse of simplifying at the granularity of polygons. The same simplification decision is made for two polygons that share an edge, but inconsistent decisions could be made for separate edges of the same polygon. The result would be a polygon that crosses over on itself. This typically only happened for narrow twisting shapes, for example a census block that had been drawn to cover the path of a river or stream. Unfortunately, the result of these self-crossing polygons was exactly the sort of visual and processing anomalies I was trying to avoid in the first place!
This gave me an opportunity to dive deeper into how simplification worked inside TopoJSON. The library treats simplification as a two step process. The first step is a configurable process that computes a weight for each point, most commonly by computing the area of a triangle formed by that point with its adjacent points. You can think of the area of that triangle as measuring the importance of that point in showing the true path of the line. The second stage is to specify some weight limit and remove points that fall below it (but always keeping the key points that mark the points of polygon intersection).
The approach I took was an iterative one. I knew that if I kept all the points, I clearly would have a set of well-formed shapes. So, I would run the simplification process, identify the mal-formed shapes and step-wise artificially increase the “weight” of the points along the arcs that were referenced by those degenerate polygons. This was both possible and reasonably efficient because the weights were explicitly exposed as an output of the first stage of the simplification process. At the start, I wasn’t certain this would result in sufficient simplification, but in practice it ended up working very well. The fact that the library had exposed this intermediate stage rather than treating the whole process as a black box made this reasonably straight-forward.
The development process had a few twist and turns. I would process tens of millions of shapes covering the US and a few hundred would end up giving me trouble. What was it about these shapes? So I would journey to a spot along the Boston Harbor where a mostly rectangular shape had a long thin finger covering a wharf. Or a set of shapes in Colorado carefully drawn by a geographer to cover a narrow winding stream. To Minnesota, which uses far more shapes than you might expect, carefully drawn around all its bodies of water. Land of ten thousand lakes. I got to pursue a general bug with the TopoJSON merge algorithm when dealing with polygons with self-touching holes (think of a figure eight). Why does it only happen a few dozen times in the whole US? And then I see a set of holes — a string of pearls — carefully drawn around a line of islands in the middle of a river.
The second problem was that the representation TopoJSON used for points and line sequences was consistent with other JavaScript packages and formats for geographical processing (essentially a point was represented by a two element array and a line sequence was represented as an array of points) but was horribly memory inefficient. A C programmer thinks of an array as perhaps the simplest and most compact of data structures, but JavaScript treats arrays as a special form of a general object or hash table, just with special string keys of the form “0”, “1”, etc. The upshot is that the simple point / line sequence representation used a ton of memory and actually dominated our memory usage. This was true despite the fact that TopoJSON starts off with a big win of almost 50% over formats organized by polygon by only encoding the points on a shared edge once.
I had previously come up with a packed representation that essentially stored all the points for a complex set of shapes in one binary buffer that gave me two orders of magnitude improvement in memory usage (I love JavaScript but it was just shocking to me how bad the naïve representation was). I was looking to see if I could apply the same improvement to the TopoJSON package.
Here was where the small overall size of the package (in lines of code) as well as its clean design really helped. I was able to integrate the packed format with just a few dozen additional lines of code (and the existing code needed to only change in a couple places). Even better, the core processing I needed to do using the TopoJSON library, merge, could be done directly from the packed representation. The previous approach had kept the polygons packed but had needed to unpack the coordinates in order to pass them through to the library we were using for polygon union.
When finally integrated, the completed work reduced our apps memory usage by 100’s of megabytes and essentially eliminated as an issue the most compute and memory intensive part of our interactive application. And along the way I got to learn about an interesting technology and explore some of the oddities of US geography. What fun!