Written by: Kumar Venkateswar, Office 365 Senior Product Manager

I’d like to introduce myself. My name is Kumar Venkateswar, and I’m a senior product manager on the Office 365 Technical Product Management team. Some of you may recognize my name from my previous role in the Exchange engineering team, where I worked in both the high availability team and the core storage team.

I’d like to take some time to write about service availability in Office 365. Some of my colleagues have already written about the approach that Microsoft is taking with Exchange Online, SharePoint Online, and Lync Online. Because they’re covering some of the details of those components, I thought I would fill in some of the details that don’t belong to any particular workload and talk a little bit about how we look at the suite from an overall availability perspective.

As a start, let’s look at some of the availability features that distinguish Office 365 from on-premises systems that are based on Exchange 2010 Server, SharePoint 2010 Server, and Lync 2010 Server. This isn’t to say that these features would be impossible to implement in an on-premises environment. Some of our largest enterprise customers have been able to achieve very high availability. Take a look at this case study, which shows an on-premises Exchange deployment with 99.999% availability. However, in general, this isn’t possible for many of our Office 365 customers on-premises.  

Office 365 provides these high availability features:

  • Redundancy at many levels. With redundancy at the disk, NIC, power supply, server, switch, aggregator, egress networks, datacenter and more, common system component failures are easily recovered from, and functionality and data are available after the recovery, even in disaster scenarios.
  • Automated monitoring and recovery systems with 24/7 on-call engineering teams. Dev, Test, PM, Operations, and even Management are standing by to fix anything that the automated systems are not able to handle.
  • Minimal to zero downtime for back-end updates, upgrades, and patches.   Our redundant systems also ensure minimal to zero downtime for updates, upgrades, and patches. In addition, we provide a minimum advance notice of 5 business day if the engineering teams think there may be downtime. For Exchange Online, we don’t exclude scheduled downtime in our SLA.
  • Resilient Office software stack. Whether it’s a rare server-side outage or a more common WiFi outage (like in the airport I’m typing this in), the software stack that Office uses is resilient to failure. That means your users will continue to be productive online or off—something that not all cloud providers offer.  One of the biggest benefits is that all of these features are delivered at the kind of scale most organizations are not able to reach, so the cost is much lower than in on-premises systems.

I’ve heard from customers that there’s some apprehension about the recent outages, which I completely understand. Our team takes every outage seriously. We go through a root cause analysis for every major incident, and provide a Post Incident Review (PIR) to explain what happened. We continue to learn from each incident and apply fixes to all our customers, including those that didn’t experience the failure to prevent the issue from happening to them. 

I feel it’s important to note that our record for outages without adjusting for population is comparable to other popular cloud services. I look at the unadjusted numbers because most of the outage types we have seen don’t vary with population, so there’s no reason to adjust it that way unless you’re trying to make your availability numbers look better.

That said, we will continue to improve.  Not only is Office 365 starting out near the front of the cloud services pack, but it is engineered for continuous improvement – not only in terms of the software itself, but also in terms of the processes around it. For instance, the recovery time objective (RTO) that is quoted in the service description is not wishful thinking; it’s the amount of time that we believe we will meet based on quarterly tests of disaster recovery. That way, we provide the predictability that you need to set end user expectations, which is a little different from some of our competitors who don’t believe in that transparency.

The process doesn’t stop with testing, either. If we do have an outage, we provide service credits, counting every single user impacted. In addition, our senior executives including Steve Ballmer are informed of outages, and after the problem is corrected by the on-call engineer that was paged, there are post-incident reviews. All of these steps ensure that everyone at Microsoft—from the leaf node engineer improving software so she doesn’t get paged in the middle of the night to the C-suite executive that wants to improve the quality of service delivery—helps make sure that you’re able to meet your end user availability goals. That’s the availability promise of Office 365, and what we deliver to you.