Where Have All the Hangs Gone?

Don’t you just hate when an application hangs on your Windows PC? Why do modern web apps and mobile apps hang less often than older applications on a PC do? It turns out there are some interesting and deep reasons why this is the case. I want to explore them here.

Let’s first run through Application Architecture 101. A typical application is broken into a data model and a view of that data model that is rendered onto a screen. User input (touches, mouse clicks, keystrokes) are used to invoke commands that alter the data model (or might alter how that model is viewed, for example panning or zooming). If the command updates the data model, the view then updates to reflect the new model state. If that basic loop of command/model update/view update happens fast enough, the application feels responsive. Applications carefully design their data models and the commands that can be performed on them to be efficient and have predictable performance. The view is carefully designed so it can be updated quickly as the model changes (e.g. as I discussed in my MVC post).

Despite this, many applications have some core computation that cannot be guaranteed to complete in the sub-second interval required for good responsiveness. For example, Word might need to repaginate (lay out content into pages) for a document that is 1000’s of pages long. Excel might have a complex model that in some cases can take literally hours to recalculate. A computer-aided-design tool might need to run a complex set of design rules in response to some change in the design. For those kinds of critical functions, application designers do the work to break the computation into small chunks that can be executed in small pieces and interleaved with additional user editing operations. The application continues to feel responsive even as it finishes the expensive recalculation “in the background”. Before we had threads, this computation was executed “at idle” on the same thread that interpreted user commands. Threading capabilities in the operating system did not initially make much difference in this core design because the hard problem was still how to interleave background updates to the data model with additional user edits and then how to reflect those changes incrementally in the view. As PCs got true multi-processing capabilities, using multiple threads could give the application access to more of the hardware’s full power and application designers did the work to leverage real threading for some key application processing (e.g. Excel can use multiple threads to perform recalc).

So why does an application hang? An application appears to hang when it is not processing user input or not updating the view to let the user know it is processing user input.

Some hangs are “just bugs”. Because of a coding error, the application might be in an infinite loop or might be in some kind of multi-threaded deadlock. The application is completely stuck and will need to be explicitly terminated or it will just hang forever. Our experience in Office was that these types of hangs are actually pretty rare in the wild, since internal testing tends to find these before the product is released.

There is another kind of “mini-hang” where the application is taking a long time to process an operation because of some characteristic of the specific data model being edited. This might happen if the application is being used to edit a document that is at the outer limits of the design point for the application. In some cases this happens when the document is “just big” (e.g. you insert a column into a table that is 1000s of pages long in a Word document). In other cases it might happen because the content is unusual in some way and the application was not designed to deal with it (early browsers were notoriously bad at dealing with deeply nested tables because of algorithms with running times that were exponential with table depth).

The experience in Office was that neither of these were the causes of most hangs. The most common hangs happened when the application was calling some application programming interface (API) that it expected to return quickly and reliably but instead returned quickly sometimes and very slowly at other times. If it was reliably slow, we would design the application to expect and handle that. It was the unpredictability that typically led to hangs. And that’s a deeper story.

In the early days of networked computing, there were a lot of different experiments in how to program distributed systems. My early experience was at BBN, where I worked with some of the early designers of the Internet. The applications I wrote interacted with other components through asynchronous TCP-based request-response protocols. We understood the interactions were asynchronous and that performance could vary widely across the wide area network. We understood the other components could fail and that communications could fail and we designed for it.

When I arrived at Microsoft in 1996, I found that the Remote Procedure Call paradigm had “won” here. RPC was formally introduced by Andrew Birrell and Bruce Nelson of Xerox Parc in 1984 although there were a number of other similar efforts going on at the time across the industry and in academia. The basic strategy for RPC is to make remote interactions have the same core programming structure as local procedure calls. Reading their paper now, I am particularly struck by the amount of effort spent comparing the relative performance of local and remote procedure calls. From that day forward, Moore’s Law would drive those comparative performance numbers exponentially apart. Even more than pure performance, there are critical issues of how to deal with variance in performance and especially variance in timing of error discovery and the way that failures occur in a distributed system. As I discussed in How to Think About Cancellation, the only fundamental mechanism you have for error discovery in a distributed system is the timeout. In addition, your knowledge about what failed is inherently limited. You basically just need to give up. This is completely different from the failure characteristics for a local procedure call and ends up having large implications for the design of the application, including the design of the overall user experience.

The rise of APIs like MAPI (Messaging Application Programming Interface, the API used to communicate between Outlook and Exchange) that embodied the RPC paradigm arose at the same time as Windows 95 and the introduction of Win32 and OS threads in common usage (Win32 was first introduced in WinNT but its use exploded with Windows 95). The operating system also made widely available the ability to transparently access remote files using the same basic file APIs used for local file access. This model for adding new OS capabilities was one that we would see in many other areas (e.g. networked printers). The basic API stays the same but new capabilities are transparently provided under the covers. The benefit of this approach for OS developers is that applications do not need to be modified to take advantage of these capabilities. This means that users get the benefit of these new capabilities from day one of the OS release rather than requiring a longer drawn-out period as applications are modified to take advantage of the new capability. Again, the problem is that performance, performance variance and error handling behavior is radically different for APIs designed in this way and this has — or should have — implications for application design and behavior.

The use of threads in concert with RPC made it possible to write APIs that combined communication and local processing behind a single synchronous API. However, there was a basic problem with essentially all of these API designs that arose during this time that led to problems in usage in real applications. The first problem was that there was no explicit API pattern to indicate whether an API would interact with the network or block in some way. In fact, as I mentioned above, an API might change to start interacting with the network between one OS release and another. There was no consistent pattern for cancellation (actually, usually no ability to cancel at all) and no pattern for an API to provide progress or feedback. In practice, providing feedback on network interaction can be crucially important because of the wide range of layers where failures can occur and out-of-band actions that a user can take to address failures at different layers (from a physical connection being broken to network access point failing to the server failing or unavailable to the operation proceeding successfully but merely taking longer than expected). The idea that an application could “wrap a thread around the API” in order to use it and still maintain responsiveness was naïve but served as rationalization for this approach.

I had a very early experience with the risks of radically changing the behavior of an API back at BBN when the core network API “GetHostByName” was extended to integrate with the newly deployed Domain Name System. Prior to DNS, this API would read a local “/etc/hosts” file to map hostnames to their 4-byte IP address. This was a purely local operation that had predictable performance. When DNS support was added, the API changed to either quickly return a cached value or to dynamically ping a remote server and block until the response was received. This made its latency widely variable — and the predictable result was that applications hung in places they never had before. Eventually a new API was developed, AsyncGetHostByName, that enabled applications to handle this variance more cleanly.

At Microsoft, much early work in developing these APIs was focused on exposing servers (e.g. file and email servers) for desktop PCs connected to local area networks where networks were hardwired and variance was more controlled. Over the late 90’s and into the 2000’s, the rise of wide-area access and laptops with wireless access led to much greater variance in performance and a much wider set of error states.

As you hopefully can see, there was quite a bit of cognitive dissonance going on. New capabilities were being exposed in ways specifically designed to make it easy for applications to adopt by not requiring application design changes. At the same time, the changes in behavior of these APIs absolutely demanded significant changes in application design. The direct consequence of the failure to make those application design changes were hangs.

Windows XP (in 2002) added the “Red X” in the corner of the window frame that made it easy for a user to kill an unresponsive application. At the same time, Office XP introduced the “Watson” system for reporting application crashes — which included reports on user termination of hanging applications. This was the first concrete data (of course there was much anecdotal information before then) that made it clear that hangs were a significantly larger problem than crashes. This information led to major investments across the Office suite to address these issues over the next decade and a half. These have involved large design changes, especially in how data is locally cached and then asynchronously opened and saved to services. These changes ultimately had large implications for user experience. Despite this work, there are still scattered responsiveness issues throughout the applications that arise from these very basic early design issues.

I should note that the Windows Runtime (WinRT) APIs introduced in Windows 8 and expanded in Windows 10 were specifically designed to address these problems. They made a much bigger bet on explicitly exposing asynchrony and having common patterns for cancellation and progress reporting. These design changes were a direct consequence of the learning that happened over this period and design discussions between Windows and Office to incorporate them into new APIs.

What about web apps?

Clearly, there have been and still are lots of bad web apps. However, there is a class of web apps I want to talk about that is best represented by that early archetype, Google Maps. In the early 2000’s sites like Mapquest were starting to battle native applications like Microsoft Streets and Trips in the mapping application space. Mapquest was amazing but was just painful to use. You would click-click-click on that stupid arrow to move around or zoom the map and wait for the screen to be refreshed. Google Maps burst on to the scene in 2005 as a revelation. It was not the first application to use the techniques that later came to be called “AJAX” (Outlook Web Access probably deserves that honor) but it combined a number of key characteristics that best defined this class of applications, both architecturally and in terms of user experience and developer mind-share.

Asynchronous. Google Maps interacts with the remote service asynchronously. In some sense, there is no choice. JavaScript execution is single-threaded (this was in the days before web workers) and so code cannot be written to synchronously wait for a network response (and the browser would terminate a JavaScript application that hung in this way). Additionally, operating in the browser there was no confusion that network response was going to be anything but widely variable based on local conditions. The application needed to be designed for this variability. More deeply than this, the application was written so that its local state was always fully consistent and ready for subsequent user interactions. Asynchronous responses then update this local state. This may seem trivially obvious but it is actually fundamental. The asynchronous request does not leave the application in some awkward indeterminate state — the local application state is always well-defined and ready for user interaction despite the existence of outstanding asynchronous requests. This is the core of the point I made in the post Loose Coupling in Asynchronous Systems.

Local feedback is immediate. For the common actions of panning and zooming, Google Maps would offer direct and immediate local feedback. When panning, the existing map surface would immediately move and a temporary surface would be displayed for newly revealed areas with no image available. These would then get filled in as the asynchronous requests for those map tiles completed. For zooming, Google Maps would optically zoom existing tiles (simply scale available images from the original zoom level using browser capabilities) and then asynchronously replace those areas with requested images from the service at the new zoom level. Despite the application being written in JavaScript, key functionality (like image loading and scaling) was effectively implemented in much more performant C++ by making use of native browser capabilities.

This experience was not only radically better than other web-based mapping applications, it was way better than native applications like Streets and Trips that depended on relatively slow CD-based IO to fetch image or map data and requested that data synchronously, blocking the user experience. This comparison was an even better lesson that an application that embraced its constraints could deliver a superior user experience.

The surface is virtualized. This just means that the application only kept around enough information necessary to show the user feedback on the screen. New data was only fetched on demand and old data was discarded when no longer needed. This kept the running size of the application in the browser reasonable. Virtualization is key for great performance for both native and web applications. The 2D nature of maps makes virtualization relatively straightforward here but helped clarify the effectiveness of the technique.

Leverage the service. A web application inherently has a service backend that is guaranteed to be present (the page needs to be served from somewhere). In the case of Google Maps, this service could maintain commonly requested map tiles in a distributed memory cache so that service requests did not require any disk IO to complete. So while the application was designed to receive responses asynchronously and deal with significant variance in response times, in fact the actual latency could start to compete with local disk seek and random read times.

Note that there is nothing about these characteristics that are unique to web applications — it was just that there was much more clarity that the operating environment in the browser required these approaches. As we were planning the next version of Office in 2006, I used the example of Google Maps to drive a much bigger investment in performance. It captured many of the points I had been arguing for internally for the previous five years around embracing asynchrony for network-connected applications — and by this point every application was a network-connected application.

What about mobile apps?

Like the web application space, there were many characteristics that lent “moral clarity” to arguments about responsive application design in the mobile space.

Touch. Touch-based interfaces — and especially touch-based interfaces that combine animation and motion — have much tighter responsiveness timing requirements compared to mouse or keyboard-based interfaces. The eye is very sensitive to variance and glitches in motion. This ended up driving changes in application architecture to meet these essentially real-time 60-frames-per-second requirements (a topic probably worth another whole post). While much effort was spent on responsiveness in the previous decades, it was primarily “best effort” and glitches were generally accepted as a normal part of the trade-offs one made in designing the application. Those trade-offs were not acceptable for touch-based stick-to-my-finger interfaces.

Mobile networking. The characteristics of mobile networking drove similar clarity as for web-based applications. Application developers understood that networking would vary widely in bandwidth, latency and error rates and needed to design for it. In addition, much focus was spent on application networking behavior because poor behavior could have significant implications for overall device battery life.

Ecosystem enforcement. Apple took a much stronger role in enforcing responsiveness standards than previous operating systems. The OS itself would directly terminate an application that was not responsive after a short delay. Android and Windows 8 later added similar features. These turned a nuanced discussion of whether to fix some problematic user experience into a clear application failure that needed repair. The curated app store also let Apple directly enforce these responsiveness requirements even before an applications was released.

The improvements in application responsiveness we have seen over the last decade are a direct consequence of significant architectural changes in the way applications are built and the way OS capabilities are exposed. While some of this was from the overall industry “getting smarter” a large part was due to the fact that the fundamental physics of distributed applications were clearer in the web and mobile spaces. This helped the industry avoid some of the design errors that occurred during the early days of PC networking.