Office 365 — Complexity and Strategy

Terry Crowley
12 min readApr 14, 2017

--

When you look at Microsoft’s strategy with Office 365, its relatively easy to relate it to the original bundling strategy Microsoft took with Office — initially bringing Word, Excel and PowerPoint together in one easy-to-buy package and then adding additional applications over time — Outlook, OneNote, Access, and others. Version 1 of the Office bundle was essentially just a couple boxes shrink-wrapped together. Over time, large investments went into making that marketing strategy an engineering reality. Shared consistent features around user experience, command structure, customization and extensibility support, copy and paste and other content interoperability, setup and administration, localization and internationalization, open and save experience — these were all significant investments that made a set of independent programs into a true suite. Some of these features were great for users and some were targeted for buyers — IT or business decision makers in enterprises.

Office 365 is going through a similar transition. Office 365 is made up of three key service workloads — Exchange for email, SharePoint/OneDrive for document collaboration and Skype for communications, as well as rights to the client applications on all devices. The clients are a key part of the offering. Large changes were made in engineering systems and architecture to move the clients to a rapid service-like delivery cycle and to provide consistent and robust cross-platform support. The clients are now a projection of the service on to most devices. Some of the “intelligent” features you are seeing show up in the clients are an example of both that faster cadence as well as service projection.

Other workloads are being added like Planner for task management and Teams for chat. When looking at the transition that is happening, it is easy to focus on the surface features around consistent user experience. There is work going on around providing a common navigation experience (so within the web interface you can access all the other applications from any other) as well as other common features like notifications and user settings. Consistent social features like @mentions, #hashtags and simple social gestures like commenting and “likes” are showing up in all the workloads. These are tied to common backend services like directory lookup, search indexing by hashtag or consistent notification aggregation so they can be unified across workloads. Underlying all this is a consistent user identity and group model. (Actually, consistency on the group model is really a work in progress.)

It is easy to focus on these high-level features and miss the depth of work that is going on. The point of this post is not to cheerlead that work but rather to look at it through the lens of the ideas I discussed in Complexity and Strategy and try to anticipate some of the implications — for Microsoft and for competitors. I will note that none of this information is confidential — Microsoft touts this work in conferences and to enterprises but I think it is under-appreciated more widely.

The business and technical strategy starts with physical concrete and corporate structure. In order to deliver on requirements of data sovereignty and local control, Microsoft has taken an approach to data center construction that allows them to build smaller data centers and for some locales (China and Germany currently) have an independent local corporate entity actually run the datacenter and gate access to the data. This protects from some of the legal risk as well as meeting data sovereignty requirements (e.g. the recent case when the US government tried to force Microsoft to turn over data owned by a non-US company stored in an Irish datacenter — Microsoft won that case but it shows some of the dangers). Datacenter size impacts the technical architecture for features like whole datacenter failover as well as the ability to engineer for lower latency by keeping a user’s data close to where they are located geographically. These issues of failure domain size have large implications for being able to meet rigorous availability service level agreements. Microsoft is continuing to build out its “edge” system with even smaller instances that allow it to do local authentication with minimal latency and quickly route workload-specific traffic to the appropriate data center for that user and tenant with minimal intermediate hops.

Office 365 and Azure worked hard to have one consistent strategy from datacenter up so that Azure investments are directly leveraged by Office 365. That includes ensuring that when governments and enterprises have specific standards around data sovereignty, security, privacy, auditing and operational practices (e.g. how secret keys are managed, data encryption at rest, how administrative access is granted or logged, background checks on personnel, etc.) that Azure hardware, operations and base services can meet these standards as well. This ensures O365 services can build on Azure services and deliver on these and stricter compliance promises. For many of these features, you are only as strong as your weakest link — as easy it is to tie together services, it is also easy to lose support for many of these compliance features.

The joint strategy between divisions might seem obvious, but in fact it is always tremendously difficult to have different large organizations with different perspectives on success and competition agree on very basic hard design issues that have long-term implications. I lived this for two decades of Office, Windows and Developer Division cooperation — or attempts at cooperation. That cartoon that showed the different divisions pointing guns at each other was not exactly right but it was not exactly wrong either. Getting agreement between O365 and Azure was a big step — one example of the advantage of having an engineer for a CEO.

There are a deep set of these “compliance” features. Some of them are completely under the covers (e.g. the routing support) and some are explicit real features — for example, support for legal hold and discovery across all workloads, data recovery standards, reviewable audit logs or data loss prevention features. At the time I left, there was a spreadsheet of service workload requirements that had 1007 rows (I am virtually certain it is longer now — they never get shorter) describing these features and operational requirements expected of each workload. Some of those requirements were met “automatically” by building on available infrastructure while many required explicit workload-specific investment to implement and then had significant ongoing operational and design implications for each team. Not all workloads supported all these features (especially ones that were clearly “features” rather than operational requirements around security, privacy, availability and performance). There was a consistent push to move all workloads upwards in their level of support.

In the longer term, the shared data layer is the fabric that will support features driven by machine intelligence operating over these large data sets. “Data is the new oil” is the term used to describe features enabled by having consistent access to these large data sets. To date this is mostly aspirational but is an additional large motivation for driving consistency across workloads here.

For an external developer sitting here deploying some NodeJS service to an AWS Elastic Beanstalk instance connecting to an AWS S3 storage layer, both the features and architectural issues I am discussing here are completely foreign. There is a layer of complexity that is completely absent from 99% of cloud services being built.

What are the implications of this complexity?

To the extent the features provided by this complex layer are valued, this complexity represents a moat. It is hard to understate how much work and investment is going on here. The architectural decisions to support these complex features are driving multi-billion dollar datacenter design decisions. The hands-off corporate structure I described above requires real engineering support to make operations efficient in this model. Decisions to take dependencies on lower levels of shared support for storage and networking in order to implement these requirements are made early on in workload design and ends up having deep implications. It is not something you patch on.

Some of the user interface consistency within the O365 suite can be provided by each workload independently — similar to how most desktop applications independently support the cut/copy/paste metaphor. Others require deeper integration — for example around identity, groups, permissions, sharing, administration, search and discovery. These heavily leverage shared infrastructure. Compliance features generally are different from most user facing features in both in being less visible and requiring deeper integration into the overall technical stack.

Perhaps the key question is “if the features are valued”. This transition to cloud services is happening in an environment where we already experienced a generation of enterprise “shelfware”. Software-as-a-service is tightly linked to the trend of consumerization of IT where design and usability is a primary focus. SaaS sometimes extends this into the enterprise with models based on usage charging rather than blanket per-seat charges independent of actual deployment and use. There is real risk to focusing on features that appeal to the CIO, Chief Security Officer or legal department rather than the individual end user.

The risks show up in a number of ways. The most obvious risk is opportunity cost. If you are building compliance features, you are not building other user-facing features that could drive usage. A more subtle problem is around building on a deep technical stack that addresses these hard issues around compliance but might compromise on other functional or — more likely — performance requirements. When you get pulled into this kind of dynamic it feels like you have made a “deal with the devil” — you make great progress early on but then find that all ongoing development is more difficult. (This is often the dynamic that causes the distance from demo to product to be so large.)

There is an additional cost in situations where you have a dominant primary client for the infrastructure — an 800-pound-gorilla. Exchange serves that role for Office 365 since it currently dominates infrastructure usage. This makes it extremely difficult to make infrastructure changes that are not motivated (or at the least have no negative consequences) for this primary client. In addition, when changes are made for the primary client (which happen relatively often since much of this layer originated out of the Exchange service) inevitably it has costs and consequences for the other applications. For the Office clients, we used to call this the “Office tax”. It meant that you had to allocate some percentage of your team just to keep up with what was going on in the underlying infrastructure, even when it was not adding value to your application — or at least not value commensurate with the cost.

An additional challenge is that it ends up being difficult to fully measure the costs and benefits. I used to think about this as the “Polish keyboard” problem. When I was the FrontPage dev manager, I used to assign out all incoming bugs. One morning I received a bug from an overseas tester that “FrontPage will not launch using install-on-demand on a Bulgarian system with a Polish keyboard”. I had no interest in install-on-demand (a setup feature focused on corporate environments being built by a shared Office team). I did not even know we ran on Bulgarian systems and had no idea what a Polish keyboard was or why it would have any impact on whether we could run or not. I was completely certain the dev manager for DreamWeaver (our primary competitor) did not have to worry about that bug. I was paying the Office tax.

A further cost that you see play out over the long term is the “consistency tax”. As a suite, you are selling consistency as a primary value proposition. But this means that in order to add a new consistent feature, you need to get resources allocated and design buy-in across the suite. So inevitably costs, especially coordination costs, are higher. Most features have some shared component that can be leveraged, but typically require at least some investment on a per-app basis and that investment might vary significantly across applications. This results in very different perspectives on cost and benefit across the suite that need to get resolved as part of the design and resource allocation process.

A mitigation of these risks is that services have much better telemetry on real usage and deployment. From the CEO and CFO down, there is a clear understanding that market power comes with usage, not with deals signed.

So how will this play out?

Compliance features are a basket of features that are motivated by the value they are providing to customers but end up generating a complex interdependency. Some are absolutely required and some are nice to have. Various features become dependent on the underlying shared infrastructure so are not easily pulled apart. I should be clear that this generated complexity is in no way a goal — like any engineering team, the Office 365 team is motivated to make things as simple as possible in order to reduce operational cost, design in robustness and enable future evolution. But natural dynamics of looking for sharable components to build on and then leveraging them for new functionality naturally results in a complex layer with multiple consumers pulling in multiple directions.

It can always be difficult to tease apart all the reasons customers make some particular choice and in fact there typically is not one single reason. Will a customer go with Teams instead of Slack or Planner instead of Trello because of purchase advantages, user experience consistency, administrative consistency or these other deeper compliance reasons built on a deep set of shared design choices? Or will Microsoft struggle to compete on core usability in these other categories because of constraints and resource allocation decisions made to build and maintain this deep shared stack and see these usability issues dominate purchase decisions?

To the extent the suite strategy and in particular, the strategy of building out a deep shared compliance layer from data center on up is effective, the obvious advice for Microsoft is “don’t overthink that strategy”. That is, continue the laser focus of moving enterprises over to the primary Exchange workload in Office 365 (last reported as 85 million active users and growing rapidly) since that forms the concrete foundation for every other part of the strategy. Continue the work to move the other primary workloads over to the cloud, SharePoint and Skype. Bet that the suite advantage, especially compliance broadly, is the sustainable differentiation for other categories such as Teams and Planner and focus on the basic usability of their core functionality rather than thrashing on a differentiation strategy on a per-category basis. This seems obvious externally but can sometimes be surprisingly hard to recognize internally since much of this shared functionality is seen as overhead and “tax” (as I described above) for the team building the category app.

This complex layer does make some things harder for category apps. Just like the negotiation advice of “don’t argue past yes”, the suite advice is “don’t build past yes”. Once customers are buying based on suite advantages, Microsoft needs to balance the costs this complex infrastructure places on category teams in order to let them focus on and prioritize their core functionality and usability.

The level of complex infrastructure necessary to deliver compliance consistently across workloads is a much bigger barrier than most people appreciate. There are some characteristics of services that make them easier to integrate together than client side applications and might lead to the assumption that per-category dynamics rather than suite dynamics will dominate this next generation of productivity competition. The set of features under the compliance umbrella are definitely not easy to build with loosely coupled services. As they drive purchase decisions, they will drive the market towards suites.

How should a competitor respond?

Looking at that complex layer, a competitor could view it optimistically in a few ways. The first is to believe that Office 365 is essentially “over-building” this layer of functionality for its most demanding clients (financial, government) and this will then constrain the ability to compete effectively on core user-facing functionality which drives purchase decisions for smaller less-demanding customers. That is, accept that addressing these requirements really does end up being hard and simply stay away from customers with those high demands. This is obviously a risky long-term path since these demanding customers are also some of the best-paying ones. A fremium business strategy depends on the ability to monitize those well-paying customers.

Another argument could be made that Exchange and SharePoint are carrying over complex on-premise requirements into their cloud instantiations. The argument is that “born-in-the-cloud” applications have less complex legacy requirements and can leverage this to deliver some of these features using simpler approaches that do not place as many constraints on future development. This one only makes sense if you can really identify those concrete features that might hold them back. For a time it appeared that SharePoint’s long history as a business process platform was hampering their ability to effectively compete on basic file sync and share. With upgrades to their core sync infrastructure on client and service over the last year, they seem to be digging out of a hole there. While I was inside Microsoft I had used the cautionary tale of Lotus Notes competition with Exchange. Notes was born as a flexible app-building platform with email as essentially one “app”. As email became the dominant workload, this lack of architectural focus made the flexibility and extensibility of Notes more of a negative than a positive in their ability to optimize for the dominant email workload. SharePoint’s flexibility in comparison to some of the other file sharing platforms presented a similar risk. A key driver to the improvements over the past couple years has been building out comprehensive telemetry on what actual user experience was with the system. This enabled a much clearer focus on driving the overall quality of the experience.

Going head-to-head on a suite-based competition will be extremely difficult given Exchange momentum — email continues to be an “anchor tenant” in decisions about suites and directories. I may be old school but I do not see the chat platform choice driving the same momentum. How Microsoft Teams competes over the next year (especially in actual usage) will be exceptionally revealing to the longer term evolution of the productivity market.

--

--

Terry Crowley

Programmer, Ex-Microsoft Technical Fellow, Sometime Tech Blogger, Passionate Ultimate Frisbee Player