Lifion’s Cloud Transformation Journey
On moving to managed services in a microservice architecture
Since Lifion’s inception as ADP’s next-generation Human Capital Management (HCM) platform, we’ve made an effort to embrace relevant technology trends and advancements. From microservices and container orchestration frameworks to distributed databases, and everything in between, we’re continually exploring ways we can evolve our architecture. Our readiness to evaluate non-traditional, cutting edge technology has meant that some bets have stuck whereas others have pivoted.
One of our biggest pivots has been a shift from self-managed databases & streaming systems, running on cloud compute services (like Amazon EC2) and deployed with tools like Terraform and Ansible, towards fully cloud-managed services.
When we launched the effort to make this shift in early 2018, we began by executing a structured, planned initiative across an organization of 200+ engineers. After overcoming the initial inertia, the effort continued to gain momentum, eventually taking a life of its own, and finally becoming fully embedded in how our teams work.
Along the way, we’ve been thinking about what we can give back. For example, we’ve previously written about a node.js client for AWS Kinesis that we’re working on as an open source initiative.
AWS’s re:Invent conference is perhaps the largest global cloud community conference in the world. In late 2018, we presented our cloud transformation journey at re:Invent. As you can see in the recording, we described our journey and key learnings in adopting specific AWS managed services.
In this post, we discuss key factors that made the initiative successful, its benefits in our microservice architecture, and how managed services helped us shift our teams’ focus to our core product while improving overall reliability.
Why Services Don’t Share Databases
The notion of services sharing databases, making direct connections to the same database system and being dependent on shared schemas, is a recognized micro-service anti-pattern. With shared databases, changes in the underlying database (including schemas, scaling operations such as sharding, or even migrating to a better database) become very difficult with coordination required between multiple service teams and releases.
As Amazon.com CTO Werner Vogels writes in his blog:
Each service encapsulates its own data and presents a hardened API for others to use. Most importantly, direct database access to the data from outside its respective service is not allowed. This architectural pattern was a response to the scaling challenges that had challenged Amazon.com through its first 5 years…
And Martin Fowler on integration databases:
On the whole integration databases lead to serious problems becaue [sic] the database becomes a point of coupling between the applications that access it. This is usually a deep coupling that significantly increases the risk involved in changing those applications and making it harder to evolve them. As a result most software architects that I respect take the view that integration databases should be avoided.
The Right Tool for the Job
Applying the database per service principal means that, in practice, service teams have significant autonomy in selecting the right database technologies for their purposes. Among other factors, their data modeling, query flexibility, consistency, latency, and throughput requirements will dictate technologies that work best for them.
Up to this point, all is well — every service has isolated its data. However, when architecting a product with double digit domains, several important database infrastructure decisions need to be made:
- Shared vs dedicated clusters: Should services share database clusters with logically isolated namespaces (like logical databases in MySQL), or should each have its own expensive cluster with dedicated resources?
- Ownership: What level of ownership does a service team take for the deployment, monitoring, reliability, and maintenance of their infrastructure?
- Consolidation: Is there an agreed set of technologies that teams can pick from, is there a process for introducing something new, or can a team pick anything they like?
From Self-Managed to Fully Managed Services
When we first started building out our services, we had a sprawl of supporting databases, streaming, and queuing systems. Each of these technologies was deployed on AWS EC2, and we were responsible for the full scope of managing this infrastructure: from the OS level, to topology design, configuration, upgrades and backups.
It didn’t take us long to realize how much time we were spending on managing all of this infrastructure. When we made the bet on managed services, several of the decisions we’d been struggling with started falling into place:
- Shared vs dedicated clusters: Dedicated clusters for services, clearly preferable from a reliability and availability perspective, became easier to deploy and maintain. Offerings like SQS, DynamoDB, and Kinesis with no nodes or clusters to manage removed the concern altogether.
- Ownership: Infrastructure simplification meant that service teams were able to develop further insight into their production usages, and take greater responsibility for their infrastructure.
- Consolidation: We were now working with a major cloud provider’s service offerings, and found that there was enough breadth to span our use cases.
On our Lifion engineering blog, we’ve previously written about our Lifion Developer Platform Credos. One of these speaks to the evolutionary nature of our work:
- Build to evolve: We design our domains and services fully expecting that they will evolve over time.
- Backwards compatible, versioned: Instead of big bang releases, we use versions or feature flags letting service teams deploy at any time without coordinating dependencies.
- Managed deprecations: When deprecating APIs or features, we carefully plan the impact and ensure that consumer impact is minimal.
When we started adopting managed services, we went for drop-in replacements first (for example, Aurora MySQL is wire compatible with the previous MySQL cluster we were using). This approach helped us to get some early momentum while uncovering dimensions like authentication, monitoring, and discoverability that would help us later.
Our evolutionary architecture credo helped to ensure that the transition would be smooth for our services and our customers. Each deployment was done as a fully online operation, without customer impact. We recognize that we will undergo more evolutions, for which we intend to follow the same principles.