The Wisdom to Know the Difference

God grant me the serenity to accept the things I cannot change, courage to change the things I can, and wisdom to know the difference.

Apologies to Reinhold Niebuhr for borrowing the Serenity Prayer for the lead-in to this short note. I want to look at three examples where I was solving the wrong problem, without the wisdom to recognize it. These stuck with me because they were examples where it was not the ability to solve a problem but rather the ability to see what the real problem — or opportunity — was that was the failing.

In my early days at BBN, my team suffered from that dreaded calamity every government-funded research team experiences at one time or another, the funding gap. We were going to get continued funding on our research into multimedia mail systems, but it was not showing up for some months. So I was loaned out to another team that was working on a project to provide integrated reports on workplace chemical exposure. A new regulation had been created requiring every workplace to post information about chemicals in use in that workplace, including specific safety information about each chemical. The system we were building could take a chemical term, query a variety of online government database systems and integrate all the information into a single report.

This was way before the days of the web and even before wide use of the Internet. Our PC-based tool dialed up these online systems by modem and then interacted with them using a text-based command line. The remote database systems were designed for interactive use, so our system simulated a user querying the system and then effectively “screen-scraped” the responses. The initial prototype had been coded on top of a BBN data analysis tool. Our task was to convert this to a C-based PC application. Much of the effort was porting individual interaction sub-routines that had been coded to interact with particular database screens.

This was rather mind-numbing (as porting can sometimes be) as well as irritatingly foolish. Why hard-code these interactions? Any time the database system made a user interface change (which they were wont to do), the interaction routine needed to be re-coded and the system needed to be recompiled and redeployed. After a couple days of porting, I decided to take a different tack and I designed an interpreted mini-language to describe the interactions. Each individual interaction was described with a simple text file. The language had constructs that were optimized for the problem domain which simplified writing the interaction routines. I spent another couple days finishing the individual routines, which were now easier to write in the special-purpose language as well as easier to validate in the interpreter. When I revealed this to both my BBN manager and our sponsor, they were thrilled. I was quite proud of myself.

But looking back, the system was still quite fragile, even if breaks could be repaired more easily. In fact, a better approach would have been to recognize that the entire system could be made more robust by exposing the data explicitly as an internet (pre-web) service, explicitly designed and constrained for interaction through an API. As an early developer of the Internet, BBN was well-positioned for this type of work. But I was too satisfied by my small step forward and missed the larger leap.

The second example is from an early workspace conferencing system (mid-80’s) that I helped build called MMConf. We were experimenting with long-running shared workspaces and would leave a session to a remote site across the country running overnight. When we returned in the morning, we would often find the TCP connection had broken at some point overnight, requiring restarting the session by hand and losing the workspace context. Now these were days where there were only a few thousand hosts on the Internet, virtually all hard-wired. It seemed very curious to me that the connection should break. In fact, there was no real “connection” — these were just packets flowing over the network. But at some point the Sun OS had decided to close the socket, which the application saw as a broken TCP connection.

I was working with Jon Postel at the time and he also found this curious. We explored for some time but never got to a clear underlying root cause. Perhaps the session was closed because of inactivity. Perhaps some extended delay on a remote system resulted in the overall socket timing out while we were trying to send some data. Ultimately, we just ended up living with the behavior. Looking back, I regret the failure to research the behavior more deeply since I’m certain it would have given me a deeper understanding of the characteristics of the systems I was dealing with. These types of mysteries are often perfect opportunities for deeper learning.

The more important thing that is clear in retrospect is that I needed to design for failure. Connections will break. That is inevitable (and perhaps a little more obvious in our wireless, partially connected world). It needed to be the responsibility of higher layers in the system to take end-to-end accountability for keeping the effective connection alive by restarting broken connections, as well as ensuring the higher level conferencing protocols were robust in the face of failures at this lower level. This was a straight-forward application of the end-to-end argument that I missed applying.

The third example arose while working on Microsoft FrontPage and is an example how working in a large organization can make you stupid. FrontPage had a feature called “Broken Links View” that would display all the broken links in your web site (and let you drill in to fix them). There were two approaches to computing whether a link was broken. If the link was internal to the web site, FrontPage could use its knowledge of the complete list of web pages in the site to determine if the link was broken because no such page existed. For external links, FrontPage would explicit ping the site (using an HTTP HEAD request) to determine if the link was valid.

Partway through the FrontPage 2000 development cycle, we noticed that when Word saved using its HTML file format (a major new feature under development at the same time for Word 2000), it would write a link in the main Word HTML file to a manifest.xml file that contained the list of any additional embedded files (images, mostly) that were logically part of the overall document. When there were no additional embedded files, Word would omit generating a manifest.xml file but would still emit a link to this missing file. This would end up showing up as a broken link and would pollute the broken links view in FrontPage. This would make it much harder to identify the real broken links and reduced the value of this view significantly (which in normal usage you wanted to drive down to empty).

A FrontPage tester entered a bug about this behavior to the developer responsible for the broken links view. Conveniently, since we were now officially part of Office, we could just transfer the bug to the developer in Word (well, actually a shared team) responsible for the root cause misbehavior. Such an advantage to be part of this wider organization! The bug then fell off my radar and I was somewhat horrified to find out that the shared team had decided to ship with that behavior rather than fix it since from their perspective there was no real bad consequence.

I was chagrined at myself after this, but not for failing to follow-up and get the bug fixed. I should have recognized that this was a wider class of such problems and we really should have provided a mechanism to filter or ignore certain classes of links in order to make the view more useful. That is clearly what an external competitor without a private line to the Word development team would have done and would have placed accountability directly back on the team (us) with the most on the line. Clearly you can fail in both directions — hacking around problems rather than addressing root causes, or focusing so much on root causes that you fail to address the issue with what might be a simple workaround or even an elegant feature extension.

As you can see, all of these are related in raising interesting questions about system decomposition. One of the reasons I have always been so enamored with exploring the end-to-end argument and its applications over my career is because these decomposition questions end up being some of the most interesting issues to address. They often raise both interesting architectural and organizational issues — complex systems development typically needs to address both.