Layering Lessons from Outlook’s Key-Value Store
At the start of the Office 2016 product cycle (2012–2013), the Outlook team made the decision to build a new mobile client on top of a new architecture and codebase rather than leveraging the existing Outlook desktop code. This post focuses on one part of that work — the design of the lowest levels of the client storage stack — because I think it provides a great example of how deeply layers interact in complex applications and how to design that layering in an effective way.
Let’s start with a little background. The original Outlook desktop client was designed in the early nineties in parallel with a separate team building the first versions of the Exchange mail server. The product strategy was built around a rich client loosely coupled with a rich server. The loose coupling allowed independent development and deployment of new features on either the client or the server. At one level, the strategy was wildly effective. Exchange effectively swept the table in the market for commercial email and Outlook became the default commercial email client, helping drive the success and demand for new versions of the full Office client suite over the last two decades.
At another level, the strategy was showing its age. Many of the challenges in stabilizing and delivering rich functionality over the years (sophisticated calendaring features, rules-based handling of email, etc.) required parallel development of functionality on both the client and server and required building that functionality on very different technology stacks. The very heterogeneous environment of email clients that developed with the rise of mobile platforms also tended to drive more functionality into the service where it could be implemented consistently rather than requiring compatible — and complex — application logic running on all endpoints.
In the modern mobile+cloud world, the engineering team leverages the ability to ship new versions of the service and compatible clients on a rapid cadence and radically reduce the latency in deployment. In the old client/server world, product cycles would run 2–3 years long and then deployment cycles for those new clients and compatible servers could stretch out from 1–6 years. This new engineering environment would allow the team to build and deploy new integrated client and service features at a much faster rate. Instead of a loose coupling between client and server, the team could view the client as an integrated component — really a projection of the service. As part of this design, they would push to unify as much of the complex business logic for new features in the service rather than requiring compatible implementations across a range of heterogeneous clients. That old approach had proven fragile and slowed innovation speed. Viewing the client as a projection of the service was much more in line with modern mobile and cloud architectural strategies.
The desktop Outlook client uses a sync-based model for communicating with the service. Essentially this means that the application implements commands as local changes to the data model which are then synchronized at a granular level with the server state. The advantage of this approach is it allows new functionality to be built in the client as a composition of granular changes to the data model without requiring updates to the client-service synchronization mechanism or requiring deployment of new service functionality. The significant disadvantage of this approach is that these commands represent a build-up of complex business logic in the client — exactly what we want to avoid in a mobile+cloud architecture where we want to push complex functionality and business logic into the service as much as possible.
This led the Outlook team building the new client to come up with an approach that they called “Commands Up / Data Down + Lies”. Commands are executed by invoking commands in the service. The service provides new data which is always “the truth”. The local client uses “lies” overlaying the real data model in order to provide immediate feedback to the user but these lies are eventually discarded when the real truth is returned from the service. The data the client holds no longer needs to granularly represent a sync-ready version of the service data model but can be tuned for presentation purposes. This is both true in terms of how the service prepares the data to send to the client as well as how the client stores the data in memory and on disk.
The client maintains a local cache of a subset of service data in order to be able to buffer against connectivity variability, provide fast startup and to tune service load (that is, all the typical reasons to cache on the local device).
As the team was working through this design, they took an explicitly transparent approach to layering. They wanted to ensure that data could be received from the service, accessed in memory and stored on disk with minimal processing and minimal additional data structures. Loading the message store on startup should have similar characteristics — requiring minimal separate IO requests to the cache and minimal processing and building up additional in-memory data structures. This focus on minimal processing was based on an understanding that startup speed and the ability to fault in new parts of the cache quickly is directly determined by the number of independent IO operations the application needs to make and then the amount of memory it needs to touch as it interprets that data and builds up additional data structures. If you can tune these things you can reveal that we really are carrying around supercomputers in our pockets.
Whenever you design some layer, there is often a decision about whether there is an element of “magic happens here” — especially around performance — or whether the behavior of the layer is much more predictable and easy for a higher calling layer to model and control. For example, in my article Model-View-Controller and Loose Coupling, I talked about how the layer that computed how to update the screen in VI or Emacs was effectively an opaque black box with “rocket science” — the rest of the system could just assume that the screen would be updated optimally. That isolation enabled a very loose coupling between the model and the view. Facebook’s React framework operates as a similar black box to optimize updating the browser DOM for web applications.
It is often much harder to practically accomplish this kind of magic when dealing with storage in any kind of complex application. Certainly a relational database paired with a query processor can provide significant optimization around building query plans that looks like rocket science. This enables a single data model to support a wide range of flexible queries. However, if you look at virtually all complex applications, they have highly tuned their data model and query pattern to control the overall IO behavior — building appropriate indices or denormalizing the data in some way to better match the applications load requirements (which also then results in building up non-trivial business logic in order to keep that denormalized data model consistent). I do not have a logical proof for why this is the case, but it seems universal based on experience.
Once you are tuning your data access patterns, you typically want the simplest pass-through to the underlying storage rather than having to “trick” any sophisticated intermediate layer into behaving predictably. That can be unstable over time, especially as the software (and software team) evolves and load patterns change or the intermediate layer changes.
All this reasoning led the Outlook team to prefer using a relatively simple key-value store rather than any more complicated underlying layer. I want to dive into the design of that layer a bit here.
As I mentioned, the design for this key-value (KV) store was intentionally kept simple in API and behavior so that its performance would be transparent and predictable. The database loads quickly (generally two IO operations) and loading a value from a key generally requires 1 or 2 IOs (depending on whether there is a hash conflict bucket to indirect through). The database performs no caching for higher layers. The implementation is small and was completed by a couple developers in three months with very few issues after initial development.
The store only allows single threaded access. The system is generally IO bound and could be kept much simpler by assuming a single thread manages all access to the file. The core APIs allow you to read the value associated with a key, store a value by key, delete a key, start a transaction, and end a transaction (by commit or abort). All stores and deletes wrapped in a transaction are guaranteed to be committed atomically. This is about as simple an API as one could imagine. It explicitly places demands on higher layers, e.g. to ensure that write transactions are chunked into reasonable sizes if you want to intersperse reads for responsiveness through to the user interface. Higher layers are responsible to cluster data into appropriate-sized values so that distinct IO operations are optimized for the app overall. There was a clear trade-off made in keeping this layer simple and predictable and placing more demands on higher layers, but demands that could be filled by leveraging easy-to-understand building blocks. It is not explicit in the single-threaded design but an additional constraint is there can only be a single transaction active at any time. This means that there is no need for mechanism and overhead to support Isolation at this layer (the I in ACID database properties).
The implementation is quite straight-forward. The file contains a header block that points to a hashtable that maps hashed keys to offsets in the file where the combined key and value are stored. At the start of each 256MB section of the file is a free-list bitvector that tracks free space in 512 byte chunks. In fact, there are actually two copies of the hash-table and two copies of each free-list in the file. The header always specifies the current hashtable and current free-list. As key-values are stored, the in-memory copy of the hashtable and free-list are updated. To commit the transaction these changes are flushed to disk by writing to the second location for the hashtable and free-list and then the header is flushed to point to the new current version. Each subsequent transaction then alternates between pointing to these two copies in the file. A transaction can be committed without forcing flushing to the file — this ensures atomicity but allows the application to trade-off Durability (in the ACID sense) for performance (so a crash will not result in corruption but might result in some minor loss of work).
There are some minor additional optimizations or features. I mentioned there is no caching for IO, but in fact the KV store will batch together multiple KV writes into a single block and then flush it out as a single operation. There are a couple reasons for this. The first is that it is relatively simple to convert temporal locality of write operations into spatial locality (just batch them together). This is not the case for reads of course. Additionally, small write operations are in some sense “worse” than small read operations on most storage devices because data is virtually always managed as blocks at the lower levels and a non-block-aligned write operation requires a read-merge-write sequence. So there is more incentive to help out in clustering writes even if the overall responsibility for keeping the number of independent IOs small by clustering data into a single value is the ultimate responsibility of higher layers. Another reason for not buffering reads is that it introduces hard to optimize lifetime management issues for those caches. Because reads are not buffered, the storage for the read data is completely owned by the higher layers. I’ll talk more about that below.
One other feature that this layer provides is error detection using a CRC of KV blocks. Some error rate is inherent in all storage systems and providing checks here for this long-running storage is cheap (since all the bytes need to be touched anyway as they are being read and written).
The abort transaction capability is implemented by a small log that tracks the in-memory changes to the hashtable and free-lists. If the transaction is aborted, those changes are restored. Any new key-values persisted to the file can be ignored (since that space is implicitly reclaimed by restoring the free-list). This is another example where the functionality is made available to a large extent because it is simple and cheap (in both memory and CPU) to provide at this level.
Fragmentation can be a problem for any database that supports a wide range of value sizes. The KV store addresses this using a pretty simple technique that in pseudo-code looks like upper layers calling “while (GoodTimeToDefrag() && store->ShouldDefrag()) store->DoDefragStep(500);” where 500 defines a time quantum for defragmentation to execute. Defragmentation might explicitly reduce the file size which is important to prevent a long-running file store from staying at the lifetime high water mark.
The database does not store attachments or even large message bodies (these are stored directly in the file system, which is optimized for this use case). Again, this allows the KV store to optimize for a smaller set of design points. In particular, large blob support would probably introduce the need to support some type of streaming access for both reads and writes. The file system already manages this quite well. It does mean that higher levels are responsible for maintaining consistency between the KV store and these blobs in the file system, similar to the challenge for the OneNote cache which takes a similar approach of offloading large blobs.
Finding the right balance of functionality for a specific software layer is always trickier in the process of designing than it is in retrospect looking at the description of the final result. The striking thing in looking at Outlook’s KV store is how careful they were to not get overly ambitious about piling on functionality. They worked hard to meet the goal of predictable behavior, especially around performance, and only implementing that extended functionality that could be implemented in a simple and straightforward way at this layer without adding complexity and performance overhead or performance variability. Having components (and teams that own components!) that “know their place” is crucially important in trying to optimize end-to-end behavior.
The next layer in the stack is called “COLA” for “Collections and Objects Layer”. Objects are a property bag of schematized attribute IDs and typed values. They are designed to be stored and accessed from a single allocated block of memory. A collection is an array of such objects and supports insert, delete and find. The collection layer aggregates objects into contiguous chunks of approximately 32Kb, a value determined experimentally based on client load. The collection has an indirection table that maps from a range in the array to a specific one of these object clusters. Objects are not stored as independent KV entries — they are always part of a collection.
During the build process a schema description file is used to define the object schemas and generate C++ classes to access individual fields from the packed property bag. The schema file also defines “persistence groups” which are used to specify which objects should be packed together into a single key-value. This provides flexibility in how the in-memory objects are packed on disk so it can be tuned as features and usage change. Schema transitions require a migration of the cache file, which keeps the running state simple (even if managing the user experience during migration is non-trivial).
Collections are explicitly designed to represent the ordering as presented to the user so they are stored and accessed with good locality. The objects and collections are designed to have good locality and require minimal additional data structures built up around them. This means that the process of starting the application and loading the dataset involves both minimal IO operations and minimal allocations and creation of additional data structures — which makes it wicked fast. Note that Outlook is heavily leveraging that it is a read-dominated application — it is mostly reading and displaying content, not modifying it.
The thing that struck me about this whole design was this focus on a very clear end-to-end goal. This was achieved not by building a mysterious magic black box, but by building a component that was very transparent about its behavior guarantees and allowed higher layers to leverage those behaviors to achieve the end goal. Such an approach also ends up being very evolvable over time, since the responsibilities stay clear between layers.