Sign up for Office 365
Learn more about Office 365
Rajesh Jha, Corporate Vice President, Office 365 Engineering
I lead the engineering organization that builds, operates and supports our Office 365 service. A core principle of running Office 365 is service reliability and we take this extremely seriously.
On Monday and Tuesday of this week, some of our Office 365 customers hosted in our North America datacenters experienced unrelated service issues with our Lync Online and Exchange Online services. First, I want to apologize on behalf of the Office 365 team for the impact and inconvenience this has caused. Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider.
I’d also like to take this opportunity to summarize the two unrelated issues that some of our customers hosted in North America experienced over the last couple of days:
On June 23rd, 2014, a service issue occurred in the Lync Online service preventing some users from logging into Lync. On June 24th, 2014, an unrelated Exchange Online issue occurred that resulted in prolonged email delays for externally bound email (email coming inside & going outside the company) for some customers. For a small subset of customers, email couldn’t be accessed. During the Exchange Online issue, we also experienced a problem with our Service Health Dashboard (SHD) publishing process, meaning not all impacted customers were notified in a timely way which we realize was frustrating and this has since been addressed.
We have a full understanding of the issues, and the root causes of both the Exchange Online and Lync Online services have already been fixed. You can expect a Post-Incident Report (PIR) in your SHD which contains a detailed analysis of what happened, how we responded and how we will prevent similar issues in the future. I’d also like to provide some technical details on what happened:
In the case of the Lync Online issue, we saw a brief loss of client connectivity in our North America datacenters due to external network failures. Even though connectivity was restored in minutes, the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration.
In the case of the Exchange Online issue, the trigger was an intermittent failure in a directory role that caused a directory partition to stop responding to authentication requests. This caused a small set of customers to lose email access. Given the unique nature of this specific failure, the recovery time was prolonged but the impact was still contained to a small set of customers. Unfortunately, the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers. Our recovery strategy was two pronged: 1) We partitioned the mail delivery system away from the failed directory partition and 2) directly addressed the root cause for the failed directory partition. In addition to fixing the root cause trigger, we are working on further layers of hardening for this pattern.
While we have fixed the root causes of the issues, we will learn from this experience and continue improving our proactive monitoring, prevention, recovery and defense in depth systems. I appreciate the trust you have placed in our service. My team and I are committed to continuously earning and maintaining your trust every day. Once again, I apologize for the recent service issues.
Thanks for the feedback.