Dear OneLogin Customer,
I wanted to reach out to you personally and sincerely apologize for Monday’s service disruption. I realize that OneLogin is an integral part of your infrastructure and that any outage significantly impacts your ability to operate.
The outage on Monday was caused by a combination of customers with old AD Connectors and inefficiencies in how OneLogin applies large numbers of mappings to users when these are updated. Isolating the root cause took longer than expected and it was not until after shutting down subsystems one-by-one and a database restart that we had isolated the issue. We have updated the incident report accordingly.
A consistent message in the feedback we have received is that we could have communicated much better during the outage. We completely agree with this feedback and in order to improve communication about our operational status in the future, we are launching these initiatives:
- We are implementing a third-party status page tool where customers can subscribe to updates and see the current status of our service’s various components. This should be in place in the next couple of days.
- OneLogin’s current Service Notification Page was only visible to logged in users, which prevented some customers from seeing updates. During a service disruption, these updates will be available on the new status page, so it can operate independently of the OneLogin service.
On the technical side, we have started initiatives to prevent a recurrence:
- Updating of users in accounts with many mappings can take several seconds in some cases. We have started optimizing that part of the code to run more efficiently.
- A few hundred of our customers have older versions of the AD Connector which updates users too frequently, in some cases generating four times as much traffic as necessary. We have been working with customers to upgrade to the most recent version of the AD Connector, which does not have this issue, and we will continue to do so in the upcoming weeks.
Everyone here at OneLogin is extremely committed to ensuring that we don’t have a recurrence of this week’s events and we earn back the trust you place in us. In 2014, our average monthly uptime was 99.99% and that will continue to be what we strive for despite this week’s setback.
Our commitment to earn back your trust starts with me. If you have any questions or feedback, feel free to reach out to me directly at firstname.lastname@example.org.