Environments (AWS accounts)
What we’re trying to fix
Within the Ministry of Justice, AWS accounts and the applications that sit within them tend to be split by business unit and their software development lifecycle phase (SDLC).
This means that for most business units, they will have a
production AWS account, which holds all of their production applications. This tends to span across other SDLC phases such as
test, and so on.
Whilst this works for business units working on their own applications with a finite number of cloud resources, scaling it into a platform for anyone within the Ministry of Justice to use raises some issues.
For example, if we had an account for
fictional-business-unit-production and some example applications:
example-a, which is an application with a database that holds sensitive
example-b, which is an application with a database that holds sensitive
Issue 1: blast radius
One of the risks of splitting AWS accounts at the granularity of business unit and their SDLC is the blast radius. By doing this, the blast radius has a much wider impact on the business and has a high probability of affecting other resources and applications that sit within an AWS account.
The blast radius can be affected by anything, such as security, or an Availability Zone going offline.
Access keys are leaked for an IAM user that is in
fictional-business-unit-production. The IAM user has an
AdministratorAccess policy attached. A malicious attacker with these keys now has the ability to access both sensitive datasets:
dataset A and
fictional-business-unit-production had only ever configured their EC2 instances to run in one AZ. If this AZ goes down, all of their applications do too.
Issue 2: team isolation
Similar to Issue 1, team isolation becomes extremely difficult to maintain if all applications in their SDLC stage are within one account.
team-a works on
team-b works on
example-b, both of which fall under
team-a does can be seen by
team-b even if you use a well-defined attribute-based access control (ABAC). Using ABAC requires resources to be tagged, and it has no effect on untagged resources. ABAC also doesn’t work for all resources in AWS.
team-a forgets to tag something, or creates a resource that doesn’t support tagging during creation, it’s interactable by who has access to that AWS account.
An extreme misclicking example
team-a notices an issue with one of their
production EC2 instances. They try to terminate it, and notice that they’ve just terminated a
production instance that belonged to
team-b, which disrupts
team-b‘s workflow whilst they redeploy their application. It takes time and money to reprovision these resources.
Issue 3: billing
In a shared AWS account for a business unit, untaggable billable items such as Data Transfer become impossible to attribute to an application.
Billing granularity requires resource tagging, and it is useful to know how much an application costs to run to help prioritise the refactoring, replacement, or retirement of each application.
Further to this, having more granular billing can expose expensive operations or misconfigured resources such as instances that require right-sizing.
example-a stores 100TB of data in an S3 bucket and replicates it to another region, costing $2,048.00.
example-b stores 1TB of data in an S3 bucket and replicates it to another region, costing $20.48.
Since Data Transfer isn’t taggable, you can’t easily trace it back to the originating S3 bucket.
Issue 4: cloud waste
When holding all applications in a SDLC account, it becomes difficult to retire application infrastructure, even if they’re covered by a good tagging policy. With cloud infrastructure, everything should be considered ephemeral and everything should be rebuildable from code.
If you destroy the wrong infrastructure when retiring an application, it becomes costly to recreate. If you miss the deprovisioning of some infrastructure, you create cloud waste which is also costly.
Unless you’ve retired all applications within a shared AWS account, you can’t close it, so there’s always a chance there are some resources unaccounted for after an application has been refactored, replaced, or retired.
Example of cloud waste
example-a has been retired, and is no longer of use. There are untagged resources that are thought to be part of
team-a is unsure, so they just leave it.
Example of destroying the wrong infrastructure
What we investigated
Let teams decide
In the Modernisation Platform, we want to empower teams and give them the autonomy to do whatever they need to support their application and environments. One of the issues with delegating how their infrastructure is separated is that it:
- quickly becomes hard to track
- can become messy
- doesn’t support the central alignment we’re trying to achieve
- people tend to go for the easiest option, which creates technical debt further down the line
Of that, there are some positives:
- teams truly have autonomy and empowerment to run their infrastructure in the way they believe is best
- teams can understand their own configuration as they are the ones who built it
One account, strong attribute-based access controls
Having one account is by far the easiest setup to have. You can use VPCs out of the box to enable cross-resource communication, you only have to set up IAM users once, and you can easily see all resources in one place.
Early on, we explored using strong attribute-based access control (ABAC) so teams can only view their own resources in an account, alongside strong IAM policies to stop cross-team resource interaction. Some of the issues with this is that:
- ABAC isn’t supported on all AWS resources, and requires resources to be tagged (it doesn’t work on untagged resources)
- finding the cost of an application becomes extremely hard to do, where billable items like Data Transfer aren’t easily attributable directly to a source
- deprovisioning infrastructure or applications can become tricky if there are thousands of resources in one account
- resources aren’t truly isolated, as they’re all in one account
- the blast radius is huge
Separate by application and SDLC
Separating accounts by application is the more granular middle-ground of either having one account, or separating everything. It’s very similar to separating by business unit and SDLC but goes a step further to provide granularity at an application level.
It has some issues:
- it’s not simple to do
- it’s uncommon
- you probably need an essence of centralisation, such as for networking
- it has a higher risk of cloud waste, but is easier to track
Separate by business unit and SDLC
This is how most multi-account architectures are set up and it is recommended by most cloud providers to go to at least this level of granularity. Some issues with this are listed above in What we’re trying to fix.
It’s not incorrect or wrong to do things this way, but the Modernisation Platform would like to improve on the Ministry of Justice’s current working practices.
It does have its benefits:
- there are a set number of SDLC stages a business unit will use
- it’s easy to work with, since everything for each SDLC stage is in one place
- it logically separates business units from each other
Theoretically and technically, you can separate everything into different accounts. You could have something like this:
example-a-api-production example-a-backend-production example-a-databases-production example-a-frontend-production example-a-api-development example-a-backend-development example-a-databases-development example-a-frontend-development example-a-landing-zone
The issue with this is:
- it becomes incredibly complex to maintain
- it becomes difficult to track
- it causes exponential growth in the number of AWS accounts
- it becomes hard to sensibly group things, i.e. what makes up the
frontendgroup, what needs to go into
- it is impossible to sustain as an approach as more teams are on-boarded onto the Modernisation Platform
- most legacy applications aren’t microservices, so splitting things out is hard
- huge room for error in cross-account sharing, if required
The benefits of this are:
- it becomes easy to track costs per application and layer (frontend, backend)
- it has a tiny blast radius
What we decided
We decided to use separate AWS accounts per application as a middle-ground between separating everything and using one account with strong attribute-based access controls.
Whilst there is a trade-off to more complexity, we feel it’s outweighed by the benefits of doing this.
Some of the biggest benefits are:
- granular isolation at an application level
- better cost management
- increased autonomy for teams
- reduced cloud waste
Logically separate applications
The Modernisation Platform is going to host a number of applications, which are built in different ways. One of our goals is to modernise these applications, and that can be anything from moving away from on-premise hosted databases into managed services such as AWS Relational Database Service; or moving away from bastion hosts to agent-based instance management.
Whilst that is our goal, we’re also aware it won’t happen straight away, or can’t happen for some legacy applications.
By separating applications out into their own AWS accounts we can:
- allow autonomy for all environments, including production, with more confidence due to a reduced blast radius
- isolate different team workloads from each other
- manage and control costs at an application level
- reduce cloud waste by building test suites for application infrastructure, rather than for a whole platform
- retire whole applications easily, by closing their AWS account
- sensibly group applications without having to worry about which layer (frontend, backend) it sits in
- separate legacy infrastructure from more modern infrastructure, and move them into the Cloud Platform when suitable
- use Service Control Policies attached to single accounts to restrict centrally-managed actions on an application-by-application basis
- utilise independently mapped Availability Zones so if one AZ goes down, it will have less of an impact in other accounts
More accounts aren’t always better
The risk of separating everything, down to layer (frontend, backend, database, etc) is too complex to maintain sensibly. It is easy to get carried away with splitting applications down into layers, and most of our hosted applications will be legacy, which are unlikely to be split into microservices. This makes it difficult or impossible to be cost efficient with migrations, especially if they are likely to be retired or rebuilt in the near future.
Whilst more accounts come at a free monetary cost, the time cost of defining, and maintaining them, becomes high.
The trade-off for complexity is (probably) worth it
Logically separating applications into their own AWS accounts isn’t an easy task, and can become complex.
The trade-off of having a smaller blast radius for legacy applications, better cost analysis, less cloud waste, and greater team isolation is worthwhile for the confidence we will have as a team to make infrastructure changes.
We’re always learning
Where there’s complexity, there’s an opportunity to learn and that will help us make the right decisions with our infrastructure.
We might find issues with this structure or find that it doesn’t work for our scale, and that’s OK, because we’ll have learnt exactly why by trying it.
You can view our architecture for Environments (AWS accounts) on the dedicated Environments (AWS accounts) architecture page.