Never Write Your Own Database
Note: I originally authored this post while at Microsoft with Vincent Lascaux (the engineer primarily responsible for the work in OneNote described here) and decided to republish in this forum since that version no longer seems to be available. Thanks Vincent for all the detailed information and feedback!
There’s an old adage in software engineering — “never write your own database”. So why did the OneNote team go ahead and write one as part of upgrading OneNote’s local cache implementation?
First, it’s probably necessary to more carefully define what I mean by a “database” in this context.
Most interesting applications need to be able to persist their application state and the user content they create and manage to durable storage. An application can either use its own custom logic for persistence or make use of some pre-existing database management software (DBMS) to handle storing its data. That pre-existing DBMS might be some kind of relational store (e.g. SQLLite, ESENT) or a simpler key-value store (or frequently, a relational store used as a key-value store).
If you start exploring these technologies, it doesn’t require much time browsing the Wikipedia page for ESENT (for example) to get a sense for some of the complexity involved in building a database. While reading through that description, you can almost do an archeological analysis of what were the initial core capabilities and what were additional capabilities developed over time, clearly driven from evolving client requirements (clustered indexes, sparse indexes, covering indexes, versioned columns, etc. etc.). It gives the clear sense that this is a component that has stood the test of time and served as the basis of some key products and how they evolved in the market place.
There are two rather counter-balancing reactions when evaluating whether to use such a robust and functional component. The first reaction is “holy crap, there is no way that I’m smart enough to duplicate that tech”! That’s a good and righteous reaction! “If I see so far, it is only because I stand on the shoulders of giants…”
The other reaction is: “Wow, there’s a lot of stuff there that I don’t need. I wonder what I’m paying for that?” One of the key principles of engineering is TANSTAAFL — “there ain’t no such thing as a free lunch”. Ultimately, when you look under the covers, that extra capability is implemented with extra code, IO, memory and CPU, just as it would be if you implemented it on your own. Sometimes (rarely!) it’s completely pay-for-play and you only pay the cost when you use those extra features. Other times the component’s implementers have built in certain assumptions and trade-offs and those features are likely to constrain future evolution or place constraints on how the component is integrated into a larger system. It is the complexity of analyzing this, even when you understand the core design characteristics of the component, that often leads to the practice of building a full functional prototype to validate your specific performance requirements under your own simulated (or actual) load patterns.
Let’s get back to the problem of persisting application state.
Some applications operate over storage in a relatively simple manner. A file on disk is completely read into data structures in memory, modified, and then the entire file is written back to disk. A simple trick of writing the new file under a separate name and then renaming the new file to replace the old file only after it has been successfully written provides safety against storage corruption.
The next approach is to represent the application’s state using multiple files on disk; the step up in complexity is surprisingly steep. An important challenge is that generally there are interdependencies between the files (e.g. an index of some kind) and file systems do not generally provide any transactional guarantees (or the tradeoffs in using the support they do provide is too great). In the face of failures (guaranteed to occur at scale) the state of the different files can become inconsistent. This means the application must build in techniques for recognizing and repairing this inconsistent “corrupted” state. The very fact that the files are independently accessible to other applications and services through the file system also makes this kind of inconsistency a state that needs to be designed for. However, the interaction with any single file can be still relatively simple — just open and rewrite it in toto. In aggregate, there may be other challenges depending on the scale at which the file system is leveraged (e.g. if 10s, 100s, 1000s of files need to be managed and potentially split across directories to optimize file system performance characteristics that scale with number of files per directory). This approach of using independent files can work well when the files are independently consistent and where the typical file size is large enough that storage space efficiency is not a special concern (due to fragmentation or wasted space due to file system metadata and page size). An example use that meets those requirements is storing email attachments as independent files on disk rather than within a single mail cache file. Storing large independent blobs is a problem file systems have been tuned for, so this offloading can narrow the scope of design points a single file database needs to be targeted at.
The key differentiating approach in a database as I am using the term here is that the single file is only partially read into memory and only incrementally rewritten. For example, OneNote provides a virtualized experience over a potentially very large space (10s of GBs) of notebooks. They cannot be fully loaded into memory and yet the application would like to provide the illusion that all the data is available for browsing and editing at any time. The application would like to boot to a working state quickly (that is, only do a small amount of IO on startup). Modifications typically only involve a small subset of the data and the application would like to write those changes to disk quickly and at low cost. As I browse over new parts of the data model, I would like to quickly fault that data in (e.g. navigate to a new section in OneNote). An additional requirement is that while most writes are small and incremental, the special case of syncing new content needs to be very fast and efficient for bulk transfers.
Even with this rather simple description of requirements, we already have somewhat conflicting constraints (which is the nature of engineering!). We would like the data packed together so we can reduce total number and size of IOs, but we would like to avoid having to repack it all together when saving back changes to only a small subset of that data. If we only write the changed data our data ends up getting more fragmented over time while if we always repack we end up doing more IO — essentially having to rewrite data that hasn’t changed in order to get better packing. One fundamental change in approach that we’ve seen in the move away from desktop AC-powered machines to battery powered (that is, power constrained) mobile devices is moving away from the assumption that we can handle some of these tradeoffs through sporadic high-cost maintenance processes (e.g. running a whole file compaction or defragmentation process “at 3AM”). Instead, steady-state performance is the focus and costs need to be amortized over time. It was already the case even before the rise of smartphones that our assumptions about our ability to run code at these times was suspect (battery-constrained laptops became dominant starting in the mid-2000’s; “green” AC-powered machines are automatically shut down at night; as disk capacities exploded but IO per second did not, just rewriting all the data could take so long that it could not complete in any maintenance window that was available).
We also need to think about how to implement the core database ACID properties (Atomic, Consistent, Isolated, Durable). While there is a long technical history on approaches and techniques here, they are generally embedded in specific implementations rather than available as a simple set of APIs to call. Trying to roll your means joining a mysterious cabal of wizards familiar with file system arcana including things like when is a flush really a flush, what bizarre things virus checkers can do, where and when IO requests can be reordered, how backup systems really work, how sync engines work, how this varies across OS type, etc.
The advantage of rolling your own is that you get to optimize and control things at a level that is just not possible when working through a complex external component. In some sense this is the eternal component dependency problem. When does an external component solve a hard problem better than you could yourself? Determining “better” is frequently a hard question that includes aspects like performance, functionality and development time/cost as well as a significant element of “better over time”. Over time your own requirements will change, your view of the importance to your product of optimizing this layer will change, the underlying technology landscape will change, and the component itself will change (or frequently, not change, which is problematic when layered on top of that changing technology landscape).
OK, that was all just background. Let’s dive into the details of the OneNote work.
OneNote has been morphing for some time into a purely cloud-connected application with data stored in the cloud rather than operating on files stored on local disk. This work was focused on optimizing the OneNote “cache” — the set of files that store a cached copy of cloud notebooks locally to support offline usage and fast and responsive read and write. The original OneNote persistence format stored each section in a separate file (where each section contains a set of pages). A notebook was represented as a folder hierarchy containing section files and other nested folders. A table of contents file stored certain metadata about the folders themselves. OneNote’s original cache format merged all cached notebooks into a single file on disk. A section file is a “database” in the meaning we are using it here (it was designed to be partially read and incrementally rewritten) and the cache file built on top of the core file and object management primitives that were used for reading and writing section files.
The cache rewrite effort had some clear goals and one significant constraint. As I talked about above, OneNote’s cache file was designed to have periodic large-scale cleanup processes run that are not appropriate in a more mobile resource-constrained world. So an important goal was eliminating any need for periodic maintenance so that in practical usage the cache would use less storage and require less memory and CPU (since for many devices that maintenance never actually succeeded in running). Another goal was to address a long-time inefficiency related to OneNote’s revision-based storage model. OneNote supports a “save-less” model where all changes are automatically persisted to storage. In order to do this efficiently, it appends a new revision to the page that only contains the changed content. For heavily edited pages (a small but highly valuable subset), these revision chains could get long, resulting in pages that were slower to load and required more memory and CPU. An important goal was to ensure that the cost of loading a page was in line with the actual complexity of the page presented to the user rather than proportional to its editing history.
The significant constraint was that the work should be limited to the storage layer itself and not require a massive rewrite of the overall application. This ended up driving many of the requirements. When dealing with large, complex systems, a key problem is how to evolve the architecture without just throwing everything up in the air and hoping it lands on time. That approach is a thing of the past in our world of continuous engineering and delivery. In fact, even in the past it was problematic since it deferred many of the trickiest and time-consuming integration tasks and was a source of many a late project.
In OneNote’s case, this incremental approach drove significant requirements:
- The new system should equal or improve on the performance (memory, CPU, storage) of the existing solution (that was really the whole point of the project).
- It needed to support the ability to snapshot pages (while continuing to allow creation of new page versions concurrently). The sync infrastructure, as well as versioning, indexing and other features rely on the ability to cheaply create snapshots of pages so that these features can operate over a page at the same time it might be being edited by the user. Storing multiple versions of a graph that contains dozens to tens of thousands of nodes in an efficient manner turned out to be a tricky endeavor. This requirement is a good example where a complex system comes to depend on some key operation being highly efficient. As a system evolves, maintaining these implicit performance guarantees is hugely important. Word has a similar requirement where it requires that “cloning” a document is very cheap. It is a non-trivial constraint for new features to maintain that characteristic.
- Efficient computation of changes between versions. The sync engine relies on fast change computation to upload user changes to the server.
- Multiple concurrent writers. Sync happens concurrently with editing and requires the ability for multiple writers of the same page.
- Other subtle odd requirements. The existing code required that data could go unreferenced for a short period of time without being deleted. It also required that embedded files be exposed as real files while still maintaining ACID contracts. In practice, one of the main challenges of re-architecting a complex component or layer in a software system is even recognizing these subtle requirements since they are often implicit in the way the component is used rather than an explicit part of the API contract.
The team looked carefully at existing database solutions (remember that old adage about never writing your own database?). Ultimately it was really the list of integration requirements above that drove them to write their own database rather than results of an analysis around raw performance or robustness. These kinds of evaluations are notoriously tricky, since you can really fail in both directions — you can underestimate the cost of stabilizing your own low level code in the wild world of real machines and real usage patterns or you can underestimate the cost of delivering your full solution on top of an existing complex component that has its own design point and constraints, especially if you are not starting from scratch but rather are trying to match those component constraints to the existing constraints of your own application. The fact that the team was building on their established expertise in this space gave them confidence that they had a good understanding of where the bear traps were in building their own.
One of the earliest decisions they made was to split the cache into a separate file per page (with a single master index file). Garbage collecting space allocated to deleted pages and deleted content or closed notebooks had been a major issue with the single file cache design. By splitting out pages, this garbage collection problem was offloaded to the file system, which is highly tuned to deal with it. Additionally, the typical size of a single page is relatively small (although in line with sizes that the file system is tuned for) so that opened opportunities for other simplifications. There were other benefits — it also avoided problems that had arisen with cache files that were exceptionally large (> 4GB). Fortuitously, it also made integrating with Enterprise Data Protection APIs (a new Windows 10 feature) much simpler since each page could be separately encrypted according to the appropriate polices for that content. I said “fortuitously” but in reality this is a good example where having a low-level design that is more in line with the user’s high level model of their content ends up avoiding internal implementation complexity that doesn’t align well with value delivered. This is the kind of “accidental complexity” that can slow down development over time in a large code base.
As alluded to above, this does introduce a separate challenge with multi-file consistency. The team took an interesting approach here. The index file points to the location of the root of the page graph and free space map in each page cache file. When persisting a multi-file transaction, all page cache files are flushed to disk before the index file. Each page file now has two graph roots (and their related free space map) the old graph and the new graph (which typically share many children). Both are fully consistent and available in the file, depending on where you start looking. The index file is then flushed to disk to point to the new graph roots and new free maps. Transactional consistency for creation and deletion of pages requires an additional small twist. A data structure in the index file (flushed to disk as part of the transaction) describes the files to delete and the files created. New files are created in a separate folder and then moved to the main cache folder after the transaction completes. On restart, these data structures are processed to ensure any actions that didn’t fully complete at last run are finalized. Deletion is idempotent (deleting multiple times results in the same result) so is fairly straightforward. Finalizing the creation transaction is also idempotent (any files no longer in the sub-folder have already been successfully moved to the main folder) and files present in the subfolder but not referenced in the creation list are related to an unfinished new page transaction and can be safely deleted (or made available as part of a “lost pages” recovery feature).
To address the problem of long revision chains, they made a fundamental change in the page-level data structures. Pages are still defined as revision chains, but each revision now stores the lookup table (now a BTree) for all the objects the revision references. The old version only referenced changed objects in the revision and needing to walk down the revision chain explicitly to find any objects that were not changed in that revision.
The effect is that the cost of loading a version of a page (especially the current version) now reflects the complexity of the page, rather than its editing history. This change does make the computation of changes between revisions different. As noted above, this is one of the operations that much of the system assumes is fast. The team spent quite a bit of time redesigning the differencing algorithm (including building out an extensive set of unit and component tests) to ensure that it had both the reliability and the performance characteristics that the rest of the system required, especially over a broad set of use cases. This is another good example of investing a lot of effort (and surrounding engineering infrastructure to ensure the property is maintained over time as the product evolves) in one particular feature — compute the difference between two page versions — so that the rest of the system can be written to just assume it is reliable and fast. In fact, by writing the differencing algorithm as a general capability to compare any two revisions rather than assuming a chaining relationship, it opened up the feature for other uses (e.g. user-level compare of page versions that were not related by simple revision chaining but more complex branching).
Fast page snapshot was another feature required of the new implementation. The transactional consistency mechanism that required having multiple instances of the page graph be valid at the same time also could support a robust snapshot mechanism (simply by continuing to hold on to both instances of modified page graphs during the scope of the transaction). By building on the whole cache transaction mechanism it also expanded the capability of the snapshot from a single page in the old system to allowing snapshots of the entire notebook. That is not a mechanism that is currently being used but there are multiple features in early stages of planning that are considering making use of this more fully functional capability.
Support for multiple writers allows both transactions to proceed independently but then are merged into a single transaction at commit time. This depends on leveraging OneNote’s robust merge capability, including the ability to represent conflicting edits directly in the content model. There are a few cases where the content model cannot be merged and in that case the transaction is backed out. Combining write transactions does have the consequence that a long-running transaction could delay other transactions from committing so care needs to be taken about how this mechanism is used by higher level processes (like sync).
One of the key approaches they took to minimizing memory was to use memory mapping for the page cache file and minimize the number of additional data structures that needed to be allocated and maintained in order to represent the page in memory. So this storage layer both requires less memory and less CPU since the storage data structures can be directly referenced without additional allocations and data copying. OneNote typically then builds an in-memory graph from the storage structures for ease of overall manipulation. The team is now looking at more opportunities where certain processing (especially bulk operations like sync and indexing with very specific functional requirements) can be implemented directly on the storage layers rather than requiring a graph to be “hydrated”. Experience shows that optimizations like this can be surprisingly effective.
In the longer term, these changes around flexible threading models and highly efficient processing models sets OneNote up for building on their investment in owning the low-level storage layers. For the OneNote team, it seems clear that at least in this case, that old adage proved false.