In 2013, we wrote about Deputy’s infrastructure, and as we’ve grown significantly over the past 10 years, we’d thought we’d give you an update.
To keep up with our scale, our technology platform continues to innovate and expand. This is a deep-dive technical post, to give you an inside, under-the-hood look, to walkthrough how we scale to handle 340,000+ workplaces scheduling shifts with high availability, concurrency, reliability, and resilience.
Deputy was born in the Cloud over a decade ago, and as such building as a distributed, scalable system is in our DNA. Being an early cloud trailblazer, the system was designed well before AWS Virtual Public Cloud's as we know them now existed, meaning that some of our architecture has this heritage. Through the last couple of years, we’ve been striving to continually modernise our architecture.
Key Platform Elements
PHP and Golang for main business logic
RDS, DynamoDB for databases
AWS Cloud Infrastructure (see detail below)
API-first, developer centric design principle, in fact all of our Web application and Mobile application use our Public Facing API, which many partners and customers also leverage to customise or automate Deputy inside their business
DeXML Scripting language for developers who wish to customise and extend Deputy to the unique business requirements of complex organisations
Deputy is run on AWS. We utilise a vast array of AWS services to run the Deputy platform, including (in no particular order):
EC2, with the basis of amazon linux 2, for most of our computation workloads.
ECS + ECR for our container based services.
Transit Gateway to provide cross account and region connectivity through the platform.
Elasticache with Redis for caching and session storage.
RDS + Aurora for our multi-tenanted customer data storage.
S3 + Glacier for any static assets and long-term storage requirements.
SQS + SNS to manage asynchronous processing.
ALB to manage the inbound traffic to our compute clusters.
Route53 for all DNS requirements.
DynamoDB for globally available data and service-based storage requirements.
Autoscaling Group to ensure our compute clusters can manage continually serving traffic.
Rekognition to provide the facial recognition functionality for our touchless clock-in feature.
Firehose to stream data and logs for our data processing pipeline.
API Gateway to provide the routing for some of our Lambda-based services.
Lambda for some services, but also for helping manage our workloads and supplementing our infrastructure in general.
ACM to manage all of our TLS certificates.
Cloudwatch to provide some monitoring and log aggregation.
SES for sending out transactional emails from across the platform.
Backup to provide backup capabilities for DynamoDB.
System Architecture Design
The above is a heavily simplified visualisation of the Deputy platform. A few key points:
This represents a single region, so is generally replicated across our us-west-2, ap-southeast-2, and eu-west-2 regions.
Our incoming traffic is funnelled through a shared ingress VPC
The Availability Zone is also a singular representation. Our compute resources are shared across at least 3 zones within each region to give us high availability.
We ensure that all incoming requests are through specific services such as ALBs, keeping all of our actual code and data sources in private locations.
The Aurora box represents multiple individual clusters, depending on the region.
We have kept our multi-tenant approach to our database utilisation, meaning that we have individual databases for each customer. For this we use Aurora clusters, meaning that we can place multiple customer databases into each cluster. As we grow our customer base, so too do we grow our number of Aurora clusters.
A closer look: Heavy computational capabilities purpose-built for workforce management
Our system design was purpose-built to handle unique use cases in workforce management, such as Data Driven Auto Scheduling. To automatically create schedules for thousands of employees, the system needs to take into account each individual employee’s availability, their preferences, and how much they are legally allowed to work. Deputy will balance these considerations with staffing requirements and the cost of each shift for any available employee to make sure an optimal amount of staff is working at critical times in order to create efficient schedule for our customers.
How do we do this? Our system spawns new parallel processes on demand, concurrently pursuing multiple scheduling permutations. To complete this quickly, we increase computational load, running thousands of processes concurrently and asynchronously, each process itself running thousands of simulations to find the most appropriate person to be assigned to each shift. These processes are run inside the customer’s database and when they are completed the user can review the schedule.
Even when many customers could be performing this action at the same time, our system is designed to scale up and down elegantly, as required.
Future view: Where we’re going
Technology innovation is at heart of continuing to trail-blaze in workforce management and enable us to deliver unique compelling capabilities.
By 2025 we have some ambitious goals!
Developer-first flexible API and Low Code / No Code Scripting languages for partners and customers to use Deputy as “workforce management as a service” in their organisations and technology ecosystem
Adopt serverless completely
Shift to fully distributed micro-service architecture
Embrace edge computing to provide realtime, heavy computational benefits to run optimal scheduling scenarios
Artificial Intelligence and Machine Learning native platforms driving smart automation in the workforce
If you’re interested in contributing to our journey to this future, please apply to join our engineering team!
Why are there 2 separate Autoscaling Groups in the diagram?
Our core web application that runs on the EC2 instances within these groups has several different workloads that it executes, so to better help distribute this load we’ve created separate clusters that operate independently for this purpose. Because of this split, we can better utilise the resources for each group, meaning we can better handle the traffic we receive.
How does your monitoring and logging work?
Most of our logs are sent through to Datadog, some of these via Cloudwatch (such as those generated by Lambda). All of our monitoring also uses Datadog, where the agent runs within EC2 instances or sidecars within our ECS services. By pushing all of this data into Datadog we can create a broad overview across the platform, letting us correlate particular metrics across multiple components for a better understanding of what’s happening.
What happens when something goes wrong?
While working through the enhancements and upgrades we made to arrive at the above architecture, a general resilience was a focus for us. In this way, we utilise multiple availability zones wherever possible, and run multiple instances of services where we can, to ensure that there’s always something available to handle requests.
How are deployments carried out?
In a similar way to how we have separate Autoscaling Groups for different workloads on our core web application, each of these is actually 2 separate Autoscaling Groups that belong to a single Target Group. When we push a new release, the Version tag is updated on the empty Autoscaling Group so then as new instances are added, they read this tag to pull that version. This is an implementation of the Blue / Green style of deployment, so we can ensure there are always instances available for our customers.
What other tools and services do you use?
As mentioned above we use Datadog for our monitoring and logging, but there are also other products we use across the platform.
Netlify and Storyblok for our blog and website.
Atlassian, Jira, and Asana for product and development task management.
Slack for our internal communications.
Google Apps for our office email, documents, etc.
Snowflake and Tableau for our data warehousing and visualisation
Deputy for our internal scheduling needs. We follow a release process that happens daily, and manage the roster of who is involved via our own Deputy instance