I have been trying for a while to come up with a way of organizing the most critical insights and intuitions I end up using when I construct new systems or analyze existing ones.
The “design pattern” community categorizes and enumerates 100’s of patterns but I have never found those collections especially useful. I never go looking for software pattern #217 to help me solve some problem or understand why a system I am looking at was built the way it was.
As I thought about it, the most useful intuitions are a small enough set that I have internalized so well that they come to mind immediately as I think about the design problem I am trying to solve or system I am trying to understand. Many of these have a basis in “the physical reality of computing” and that seemed a good framing for talking about them.
Computing is a physical process. We think of the digital world as distinct from the physical world — an alternative reality. But computing takes place in the physical world. The core elements of computing always come down to physical elements of memory, computation that reads and writes that memory and communication between physical computing nodes.
Memory state is represented with real electrons, magnetic charge or other physical state that requires real space. Computation requires time, space and energy to access and alter this physical state. Communication between computing nodes requires time and energy to transmit photons and electrons across physical distance. Any of these physical processes are susceptible to failures although failures of communication are most common.
This all might seem trivially obvious. But I’ve found valuable insights when I return to this physical reality as I’ve designed new systems or worked to understand existing ones.
The primary insight is the most obvious: this is all there is. Just like a chemical system always comes down to how a limited set of atomic elements bind together, any computing system always comes down to how memory, computation over that memory and communication are combined.
Some software layer or component might be incredibly complex and difficult to understand. It is exactly this complexity that makes a simpler perspective so powerful. The physical reality is a tool that lets you look at some big pile of code and ask “what is it really doing?” — and that must ultimately come down to the primitives of touching memory, computation and communication.
I want to run with this idea a bit but let me first give a concrete example from my career where that perspective could have been useful.
Microsoft Outlook and Exchange had a long-running “odd couple” pattern of interaction, despite being critically dependent on each other’s product and business success. Part of this was due to the typical front-end/back-end tension where the back-end developers are heavily focused on the “-abilities” — especially scalability, reliability, and manageability. The front-end developers are heavily focused on features, features, features. This was made worse because Outlook was paying a “strategy tax” and building on the re-purposed Cairo database APIs from Windows 95. Cairo was the abandoned once and future file system database that was originally supposed to ship in Windows 95 (or whatever it would have been called if it shipped at the target date of 1993).
These APIs completely obscured what was actually happening under the covers — what was the pattern of communication, which requests and responses were batched together and which APIs would launch a request to the server immediately or just queue it locally. The Outlook team initially assumed the API layer had to be doing caching and batching but ultimately needed to jack up the entire foundation of their application and insert their own caching layer once they discovered nothing of the sort was happening. Exchange complained that Outlook’s behavior was like an attack from a swarm of bees but Exchange was responsible for this very opaque API layer.
Much of the confusion was because the Outlook team programmed against an API layer that they hoped would behave like they wanted rather than really understanding how this most basic layer of communication and caching behaved under the covers. The point of the API was to allow the development of these rich graphical applications but (initially) ignored the key design point that application and server performance is driven by a full-stack understanding of how data needs to be clustered in memory and over the wire to deliver the intended user experience. It took a lot of time and a lot of smart engineers to tune this system. And all the time the underlying parameters of hardware capability and load requirements were changing, further obscuring the key issues.
To be fair to the Outlook and Exchange teams, while almost every application follows these patterns of communication today, they were blazing a trail and were continually pushing the edges of what was possible on the hardware of the time.
I have lots more examples but they tend to have this flavor — they play out over long periods and there is lots of environmental noise that obscures the core performance behavior that really should be driving the discussion. Bringing it back to the basics of memory, computation and communication can help clarify everyone’s understanding of the core issues.
OK, let’s jump up a bit.
Because we can really break it down to these three core elements, it is change in these elements that really drive new system designs, capabilities and opportunities. For memory, we care about capacity — how many bytes can I store?, latency — how fast can I get a byte once I decide I need it?, and bandwidth — how many bytes can I retrieve and process per unit of time?
For computation, since the late nineties, the speed of processing is really driven by how reliably I can use that memory bandwidth. Instruction pipelining, vector processing, prefetching, multi-level caches, even multiple threads of execution — all these processor design elements are focused on actually leveraging the memory bandwidth. High performance programming techniques are designed around having predictable memory behavior with high locality so that memory bandwidth can be optimized.
We often think about GPUs as focused on delivering performance because of the parallel processing of multiple cores. But the really key design point is around providing the memory bandwidth to serve those cores. This interview with Jensen Huang of Nvidia has a great discussion of designing their GPU around memory bandwidth.
Change in communication performance has been one of the most impactful elements driving new design points for system and application design over the last 40 years. As with memory, for communication it is latency and bandwidth and changes in latency and bandwidth that drive new design opportunities.
I first heard one of the most basic system design issues around the changing tradeoffs here framed by Turing Award winner Jim Gray. “Do you bring the data to the computation or the computation to the data”? Given our basic primitives of memory (data), computation and communications, it seems obvious that this has to be one of our most basic design choices — and one that can change as the relative performance characteristics of each change.
Bringing the computation to the data reduces the latency and bandwidth constraints of transmitting that data across a communication channel. Features like flexible stored procedures in our database layer as well as sophisticated query capabilities are some of the tools providing the mechanisms to move processing closer to the data. In the early days of diskless workstations, I would do this in an ad-hoc way by explicitly running a remote shell directly on the file server to do the final IO intensive link step when building our large C++ application.
Well that seems obvious. Why wouldn’t I want to move the computation closer to the data? There are a variety of challenges. The data tier is trying to manage its use of local memory and computational resources to optimize access to the underlying stable storage. Allowing and managing arbitrary computation at this layer is quite hard — any programmer can fall out of bed and design an O(N²) algorithm that could tank the performance of this shared resource — and allowing arbitrary computation also generally means that requests will have much wider variance in lifetime and memory requirements so it becomes harder to provide fair, efficient service that optimizes access to the key resource being managed. Moving this computation off the data tier makes the job of that data tier much easier as well as providing more flexibility when performing computation on that data — including for example decisions on how to use memory resources local to the computation for caching. We often see only limited capabilities at the data tier like filtering and aggregations that can provide significant communications benefit but with very predictable resource demands.
This pressure to move the computation closer to the data and then counter-vailing pressures due to complex resource management challenges means that this design decision can swing back and forth as underlying technology and tradeoffs change. We see this when new capabilities appear (e.g. “serverless” functionality like AWS Lambda) as well as changing design decisions within the apps built on these layers.
There are other valuable insights that come from thinking about computing as a physical process. The fact that speed of light is a fixed constraint and we are limited by three physical dimensions means that we will always have some memory that is closer and some that is farther from our computation. We will always have memory hierarchies. Some memory is closer and faster and some is farther away and slower. There is always more space farther away (growing with the radius cubed) so that slower memory can always, at least potentially, have much larger capacity than the faster, closer memory. (Obviously the technologies in use and the costs for any specific memory layer also play a big role in performance and capacity variance here.)
This inherent hierarchical structure means that caching is always a fundamental performance approach. Keeping some piece of data that you are going to access again closer — available faster — is always a key way of speeding up some processing. Exactly how to cache correctly and effectively ends up being a fascinating and complicated question for any real system, often very sensitive to the characteristics of the hardware, the particular problem being solved and the shape of the load on the system.
Another form of caching is to take the result of some computation — essentially some walk over memory compressed into a smaller form — and storing that for later access. Compression is often an important part of any caching strategy since a particular application may only need some subset of a dataset and that dataset is frequently not stored in a form optimized for a specific application. Compression into a small, fixed-size hash, checksum or timestamp is a very useful special case that often provides a quick way of testing for equality between arbitrarily large datasets.
The principle of locality — keeping things that are accessed together in time close together in space — is another key organizing concept for computer systems that matches a strong physical intuition.
Locality is critical to almost all performance work. A carpenter or a cook placing their tools and materials understands this. For a programmer, keeping a computation limited to L1 memory. Designing a data model so all the data needed to respond to a request is clustered together in one block of memory. Buffering data together so it can be written as a single block to storage. Packing a request into a single packet to send in a request to the server.
In fact, the vast increase in memory capacity at every level of system design makes thinking about locality that much more important. It is easy to ignore this — our favorite object-oriented programming tools almost beg us to ignore it. But locality is important for very real physical reasons that have not changed — that will not change — because they are anchored in these physical realities.
A lot of application design work is spent de-normalizing the data model (trading off using more storage capacity) to improve locality for the specific types of requests the application makes.
Latency and Bandwidth
Latency and bandwidth have direct physical meaning and correlates in the real world. Perhaps more interesting is intuition around how latency and bandwidth change. Latency is hard to change — we would probably be surprised if we heard someone had managed to transmit a fiber-optic signal coast-to-coast with half the latency. We would not be surprised if we heard that the bandwidth had doubled — perhaps because a new fiber optic line was lit up right next to the old. This plays out at every level of system architecture — bandwidth universally improves faster than latency. This means that designs that can leverage bandwidth improvements while being robust to much slower latency improvements end up evolving better over time. These end up generally using asynchronous techniques, batching requests to generate bigger responses that can leverage those bandwidth improvements and make smart use of caching.
Because latency is hard to change, when it does change significantly you often see big opportunities in new system designs and tradeoffs. Sometimes we see it because of a change in technology — e.g. moving from rotating magnetic hard disks to solid-state drives or changes in the datacenter networking interconnect. More common is when memory capacity increases so much at one level of the memory hierarchy that we can use that layer as a cache that satisfies most requests for some piece of data. In-memory databases or distributed in-memory caches like redis or memcached took advantage of exploding dynamic RAM capacities and enabled new designs and approaches because results could be returned with memory latency speeds rather than with the latency of stable storage.
These new design points also show up when growing memory capacity intersects with something about the real world — I can now hold all the map tiles for the US in server memory or can store 18 pictures every time I click the camera so taking a picture is really taking a video. This last example wasn’t because of a radical new programming technique — it was because some improvement in the basic layer of memory bandwidth and capacity intersected with the real world.
If 5G actually results in significant end-to-end latency improvements (a business question as much as a technical question), it will create surprising new opportunities for system designs.
Because computing is embedded in the real world of three dimensions and the limitations of the speed of light, everything takes time. Asynchrony is the natural way of modeling this. You make a request and a response arrives some time later. Treating something as synchronous involves hidden overhead (implicitly holding onto and managing resources while waiting). It means you are fighting the real behavior of the system and typically less responsive over time to the constantly changing ratios of performance in different layers of the system. That synchronous “wait” is getting longer and longer relative to the other changing dimensions of system performance and yet is essentially hidden behind a single line of code in your program.
Asynchronous request and response is effectively what is going on at every layer of the system (even if hidden from you as the programmer). This also means it makes sense to have a good grounding in the core challenge of this structure of interaction. One often focuses on the success case, but it is really how communication fails that drives most designs. From the point of view of the client, the forms of failure are relatively simple. You can fail because the request was never received by the server. The request can be received but processing it fails. Or you can fail to receive the response after the request is processed successfully. That’s it. The challenge is, you don’t know which of those three forms of failure occurred. In fact, you essentially have to wait forever to know that any failure has occurred. Timeouts are the only fundamental way to recognize failure. They are either implemented by you or are implemented by some layer underneath you. This also means retries are the fundamental recovery strategy.
This fundamental structure means you want to keep timeouts short and be able to retry aggressively. It’s hard to have short timeouts if a request has lots of variance in timing. The high variance means you can’t tell if a request has failed or if it is just taking a long time. You also can’t retry aggressively if requests are expensive to make — either for the sender or receiver (or anything in the middle for more complex end to end systems). You can only keep timeouts short by keeping latency low and predictable. If you keep retries cheap, you allow more flexibility in how to apply them.
This basic pattern means you need to design your communication semantics around failure. You don’t have a request like “pay Judy $10”. If you have to retry that, Judy could end up getting paid over and over. You say “pay invoice 17”. If you retry that, your “server” can recognize that invoice 17 is already paid. (This is idempotence.) You also don’t have to wait as long before retrying since the consequences of requesting again are low. You can also ask the question “was invoice 17 paid?” — that is, you have a way to locally reconstruct the state on the other side of the connection even if you never received a response because of one of these three forms of failure. Most well-designed communication looks like this. The request can be safely retried and the success or failure of the request can be discovered if the immediate response is never received.
Batching and Chunking
Anyone who has packed a moving truck understands the value of batching and chunking. You batch by packing a bunch of small things together into a single consistent sized box that can be efficiently and easily processed — thrown on a hand-cart with a bunch of other boxes and then neatly packed into the truck. You chunk as you break down some big piece of furniture and divide it into easy to move pieces. It is those awkward big pieces that can’t be chunked that cause all the trouble. Digital systems benefit by being able to transform virtually everything into a stream of bits that can be uniformly batched or chunked as necessary. Chunking and batching is happening at every layer of our systems even if an API hides this.
Batching is so effective not only because it simplifies processing to have a consistent size of request to manage but also because most systems have a significant fixed cost to do any work at all. To put it another way, the marginal cost of stuffing an additional request into a batch is low. Batching a bunch of requests together often results in a major end-to-end performance win. An amazing amount of performance work is spent looking for opportunities to batch work and eliminate that fixed overhead of independent requests. Well-designed systems often are structured to be able to automatically batch when requests backlog. So the more requests backlog, the more they get batched and the lower the cost of processing each request.
Queuing, Congestion and Resource Management
Any asynchronous system has to deal with congestion and resource management. Each node in an asynchronous network is managing some resource and has to face the problem of being asked to do more than it is capable of. Generally each node is trying to optimize by ensuring it is getting lots of work done while also minimizing the latency of any specific request (or in some cases just letting a requestor know quickly that they are not going to be served). I’d rather be told there is no space at the restaurant than cluelessly wait three hours for a table.
I like to use a restaurant example as a real-world physical system that reveals the structure of many of the core challenges. A hostess gates access to the first of the constrained resources, the limited number of seats and tables. The waitstaff takes orders to the kitchen and brings food to the tables. Overall the restaurant would like to maximize throughput as well as reduce latency at each stage (keeping the customers happy). It doesn’t help to have lots of tables but not have enough waitstaff to take the orders. Or to take the orders quickly and then backup in the kitchen. In this case, the capacity of all the stages are managed to minimize explicit queuing and explicit congestion management. The critical dynamic management of the system happens at the front end where customers have to wait to get seated (and “requests” are failed at the front when customers are turned away or walk away).
Queuing that happens is often in the service of batching — so the waitstaff waits for all the drinks to be made before carrying them to the table or the busser waits for all the diners to be finished before clearing the plates in one trip.
In fact, this characteristic that resource/congestion management is somewhat “hidden” because the resources of the various components have been tuned to balance and most control happens at the edges of the system is a very common pattern. Allowing a request to get part way through the system and then wait often means you are holding on to and backing up other resources. If the kitchen gets backed up, you have all those diners filling up your tables but not eating.
For large teams, this hidden characteristic of how the system works can often mean that parts of the team don’t understand a basic principle around how the system is dynamically tuned, which can be dangerous.
Complexity is probably the most critical aspect of system designs that resists physical intuition. Our common experience is with physical objects whose surfaces naturally provide data isolation and hiding. Interactions require actual contact that is visible from inspection. Our digital systems explode in complexity as the state space of the system grows exponentially. A line of code hidden in a file reaches out and touches a global defined elsewhere (or calls a subroutine that does the same thing — no fundamental difference), combining the two state spaces. That combination is multiplicative — combining two 16-state spaces results in 256 states to reason over, not 32. Spaghetti code is vastly more complicated than a plate of spaghetti — it moves and writhes and constantly changes and can’t be easily picked apart or understood.
People are continually fooled by growing complexity, especially when starting work on a new system and projecting initial rapid progress into the future. New systems start small but complexity challenges emerge quickly.
It is exactly this resistance to intuition that drives the need to intentionally keep things simple, isolate complexity and design interactions in a way that does not tightly bind together different components and their states. The primary lesson about complexity is that your intuitions about complexity are poor.
Each of the topics discussed above have entire books (or bookshelves!) written about them. Despite that detail and complexity, having a basic intuition about these common structures and repeating patterns that you can access quickly is important when working on any system problem. Understanding that they are not “just choices” but grounded in fundamental physical realities can help you ask the right questions and get to good answers.