At Deputy, we are fully committed to delivering high standards of service, which includes keeping you up to date on the continuous investments we are making in our platform as our business continues to deliver scheduling solutions to more and more workplaces across the globe, every day.
Our latest investment in our platform stems from a significant outage that occured in January, 2021. At Deputy, we take these matters extremely seriously. We understand the knock on impact that our platform has for business owners, team managers, and employees alike when the system is not available to use. Deputy is a critical investment that you have made in your business, allowing you to track your employees shifts and ultimately results in your teams being paid fairly and on time. So when our system goes down, the impact is felt far and wide.
Our mission is to support thriving workplaces in every community and we believe that trust and transparency underpins any thriving workplace, including our own. And we want to ensure our loyal customers understand how important their business is to us and that trust remains in Deputy as your team scheduling solution.
A transparent recount of the Deputy platform outage
January 25 is a unique and busy workday in Australia, especially for our customers. It’s the day before a public holiday, it’s the end of the month, and this year, it fell on a Monday, when many of our customers export timesheets and run payroll simultaneously.
At 8:39 a.m., we received an alert from our technology system that we were experiencing heavy response times. Concurrently, our customer support team started receiving 100s of customer chats, indicating that our customers were having trouble accessing Deputy.
The company triggered an incident at this time and our incident response team formed to respond to the growing crisis.
Technical investigations began immediately. Our software engineering team hadn’t released any new features that day, no new code or infrastructure changes were present that may ordinarily trigger these types of alerts.
Our code continued to pass all of our automated quality tests. Traffic to the login page had grown naturally through the morning, as expected. In short, nothing obvious appeared to be wrong yet our customers were unable to log in. It didn’t add up.
Meanwhile, symptoms were surfacing. For those who are interested in the technical detail, our elastic servers kept adding and scaling more web servers to try to cope with the increasing load. Digging one level deeper, we could see our databases were seeing 10-20 times the usual load. Continuing to dig, Redis, our in-memory cached database, which is normally used to drive high performance, was seeing an abnormally high amount of utilisation. It was at this point that we confirmed that Redis was the single point of failure → our scalable databases, and elastic web servers, were all waiting for our one Redis storage unit, resulting in a cascading failure. The following charts detail our discovery of this single point of failure.
Traffic pattern morning of January 25. Traffic looked like normal patterns.
Behind the scenes, each web server started seeing significant over-utilisation (requests per instance)
Meanwhile, databases seeing significant load (connections per database)
The root cause: Redis CPU started showing signs of over utilisation after 8:30 a.m., which in turn caused database and web server utilisation to hit an unsustainable peak (utilisation % of Redis)
By midday, we had provisioned a new version of Redis, effectively restarted all of our processes and systems, and by 2:30 p.m., Deputy was again accessible to our customers. However, the Redis risk remained, lurking - provisioning a new version was a patch fix. In fact, it came back in a smaller, more controlled way on a few other occasions through the next few weeks when our customers experienced short intervals of being unable to log into Deputy on 29 March, 2021.
What did this outage teach us?
Firstly, that we never wanted to put our customers through this situation again. Secondly, as our customer base had grown, we had reached natural limiting factors within our technology infrastructure. This growing pain had created a single failure point, resulting in outages for our valued customers, at a critical time when their employees were due to be paid.
During this incident, it was quickly made apparent to us that we needed to implement a more scalable solution to mitigate the chance of this occurring again. To share an analogy, we had 1 cash register for a very, very busy supermarket. Even as the supermarket got to peak capacity, we still had 1 cash register. With the new architecture in place, we have unlimited cash registers to serve our customers!
Our team went hard at work, consulting with our AWS enterprise architecture team, working nights and weekends, to develop a scalable, distributed Redis, with the mantra in mind that we did not want to let our customers down again.
With the new architecture in place, our infrastructure is now more trustworthy than ever. We now have 10 times the redis clusters to effectively spread and orchestrate workloads, and we continue to add new clusters as our customer base grows. In short, our infrastructure now reflects our requirements for today’s customers and future proofs us for our growth ambitions.
This incident was a key catalyst in doubling down on the journey we’ve been on to improve system resilience and systematically removing any single points of failure that may exist as our customer utilisation expands.
Following this outage, we have made the following investments in technology to ensure our customers receive the experience they expect from Deputy:
Redis has been reworked and re-architected
Increased Monitoring, alerts and logs have been introduced in the application
Circuit Breakers have been implemented to reduce likelihood of cascading failures
Elastic computing scaling rules have been adjusted to better handle scale up when required
Invested in and hired key engineering leadership, both in our new VP of Engineering and in the formation of a Site Reliability team
Thank you for choosing Deputy
We understand this was an upsetting outage for our customers, especially on a payroll day, before a public holiday. We responded quickly to correct the situation, and have systematically dealt with the root cause of the issue to ensure our customers are not impacted in this way again in the future.
Thank you for your patience and understanding. We do not take for granted the trust you have placed in Deputy. We will continue to be on a journey to make Deputy highly available and your trusted partner, while being open and transparent as we strive for continuous improvement.