Asynchronous Issues in the Word Web App

The Microsoft Word Web App had an interesting outage a couple years back that was related to one of my special areas of interest, asynchronous issues in systems design. The incident ended up shedding light on a couple of interesting design patterns.

In this case, the Word Service was opening up a Word document and processing it in order to send a representation down to the browser. There are two main components of that processing — converting the main structure and text of the document into the data format required by the browser as well as dispatching embedded images to an image validation service that both validated the images for errors as well as produced a web-optimized version of the image for download to the client.

The initial version of the service would block as it waited for the image validation request to complete. In order to improve end-to-end latency, a change was made to make the image validation asynchronous to the continued processing of the main structure and text. The Word Service would finish its processing of the text in parallel with the image service and then finally rendezvous back and block waiting for the image service to complete at the end before completing the request to the client. The intent was to reduce overall latency by parallelizing these two key parts of the processing. Early testing indeed showed the intended improvement.

There’s an interesting refactoring pattern here in taking a synchronous algorithm and changing to use asynchronous processing. In an asynchronous system, issues of resource management, prioritization, throttling and congestion control usually need to be dealt with more explicitly. In a synchronous design these issues are often implicit in the way the system is coded, which is “simpler” but provides less opportunity for explicitly managing these important concerns (or even recognizing that you are managing them at all).

In this case, the modified system would not need to process any more images per unit time than it had already demonstrated it was capable of handling, so there was little concern about the primary end-to-end issue you might have to deal with in an asynchronous system of providing “back-pressure” through a pipeline of components to prevent the slowest component from getting overrun.

However, a different, related, problem that often arises is the issue of fairness or preventing one client from dominating resource usage. A simple analogy is imagining a store with a single checkout counter. It can gracefully provide great throughput with little or no waiting if each customer brings up one or two items. But then someone shows up with an overloaded shopping cart and the line starts backing up and backing up. Pretty soon customers entering the store see the long line and just leave.

Essentially that’s what happened here. In the synchronous design, requests to the image validation service were implicitly throttled since only one image request was issued per document at a time. Most documents have zero, one or two images, but occasionally one document is requested that has many images. It can quickly issue these asynchronous requests and swamp the image validation service (it shows up at the counter with an overflowing shopping cart). Other requests to the image service from other documents back up behind it. Now all these requests are taking a long time since they all are backed up behind this one slow request. The overall Word service ended up being unable to respond to new requests as the number of outstanding requests (with associated resource usage) built up.

There are a number of ways one could imagine fixing this. One approach is for the consuming client to simply self-limit the number of requests outstanding (essentially this is what browsers do by limiting the number of open connections they will make to a single site even when there are many resources to fetch and they could improve local latency by requesting them all at the same time). Another approach is for the service to internally manage fairness concerns, perhaps by processing requests in a round-robin fashion rather than strictly first-in-first-out. In some cases it can be a good approach to allow the service to receive and queue outstanding requests if the service can do a better job of overall optimization by having a larger set of requests to reason over. A classic example is a disk controller trying to optimize the latency of reads and seeks on a spinning disk. That might not make sense if the resource requirements of queuing the requests are high and we would rather throttle at the request side.

The actual fix made was to limit clients of the image validation service to only have two requests outstanding at any one time. This was simple to implement and captured almost all the benefits since most documents only contain a few images. For documents with many images, the image validation time dominates anyway so there was no benefit to issuing all requests asynchronously at once (since the requests are serialized in the image service anyway).

The pattern we saw here is a common one — an asynchronous design opens up opportunities for better resource management and allows for intentional design tradeoffs but requires more explicit management of those choices. As systems grow in complexity, exposing these important design choices as intentional parts of the system design allows the system to grow more gracefully compared to having them buried as implicit choices in synchronous coding patterns distributed throughout the code.

One interesting part of this story is that when the problem was discovered, the team was able to make a configuration change in the service to revert to the previous behavior in order to quickly restore system health. One of the changes the Office team put in place in how they implemented features like this was to provide the ability to flight in production and control the execution path through configuration options. This is an approach that Office used fairly rarely when developing client software on multi-year schedules, but it has become common practice (essentially mandatory) in the services world and was critical in being able to respond quickly and mitigate the issue before they were able to deploy a deeper fix. They were also able to use comprehensive data analysis on the characteristics of real documents and real usage to validate that the change to throttle to two images at a time would capture almost all the benefit. This is obviously very different from the previous client world with long update intervals and relatively limited telemetry.