Environments (AWS accounts)
What we’re trying to fix
Within the Ministry of Justice, AWS accounts and the applications that sit within them tend to be split by business unit and their software development lifecycle phase (SDLC).
This means that for most business units, they will have a production
AWS account, which holds all of their production applications. This tends to span across other SDLC phases such as development
, staging
, test
, and so on.
Whilst this works for business units working on their own applications with a finite number of cloud resources, scaling it into a platform for anyone within the Ministry of Justice to use raises some issues.
For example, if we had an account for fictional-business-unit-production
and some example applications:
example-a
, which is an application with a database that holds sensitive dataset A
example-b
, which is an application with a database that holds sensitive dataset B
Issue 1: Blast radius
One of the risks of splitting AWS accounts at the granularity of business unit and their SDLC is the potential size of the blast radius - the radius in which damage could occur should something go wrong. At this level of granularity, the blast radius has a much wider impact on the business and has a high probability of affecting other resources and applications that sit within an AWS account.
The blast radius can be affected by anything, such as security, or an Availability Zone going offline.
Example: Compromise of security
Access keys are leaked for an IAM user in fictional-business-unit-production
. The IAM user has an AdministratorAccess
policy attached. A malicious attacker with these keys now has the ability to access both sensitive datasets: dataset A
and dataset B
.
Example: Loss of availability
fictional-business-unit-production
had only ever configured their EC2 instances to run in one AZ. If this AZ goes down, all of their applications do too.
Issue 2: Team isolation
Similar to Issue 1, team isolation becomes extremely difficult to maintain if all applications in their SDLC stage are within one account.
Imagine team-a
works on example-a
and team-b
works on example-b
, both of which fall under fictional-business-unit
.
Everything team-a
does can be seen by team-b
even if you use a well-defined attribute-based access control (ABAC). Using ABAC requires resources to be tagged, and it has no effect on untagged resources.ABAC also doesn’t work for all resources in AWS.
If team-a
forgets to tag something, or creates a resource that doesn’t support tagging during creation, anyone who has access to that AWS account will be able to interact with it.
Example: Human error (misclick)
team-a
notices an issue with one of their production
EC2 instances. They try to terminate it, and notice that they’ve just terminated a production
instance that belonged to team-b
, which disrupts team-b
‘s workflow whilst they redeploy their application. It takes time and money to reprovision these resources, and may affect application users.
Issue 3: Billing
In a shared AWS account for a business unit, untaggable billable items such as Data Transfer cannot easily be attributed to a particular application.
Billing granularity requires resource tagging, and it is useful to know how much an application costs to run to help prioritise the refactoring, replacement, or retirement of each application.
Further to this, having more granular billing can expose opportunities for optimisation: expensive operations or misconfigured resources such as instances that require right-sizing.
Example: Unbalanced untaggable billing
example-a
stores 100TB of data in an S3 bucket and replicates it to another region, costing $2,048.00.
example-b
stores 1TB of data in an S3 bucket and replicates it to another region, costing $20.48.
Since Data Transfer isn’t taggable, you can’t easily trace it back to the originating S3 bucket.
Issue 4: Cloud waste
When holding all applications in a SDLC account, it becomes difficult to retire application infrastructure, even if they’re covered by a good tagging policy. When deploying to the cloud, all infrastructure should be considered ephemeral and everything should be rebuildable from code.
If you destroy the wrong infrastructure when retiring an application, it becomes costly to recreate. If you accidently miss some infrastructure during de-provisioning, you inadvertently create ongoing cloud waste which is also costly.
Until you’ve retired all applications within a shared AWS account, you can’t close it, so there’s always a chance there are some resources left unaccounted for after an application has been refactored, replaced, or retired.
Example: Cloud waste from uncertainty
example-a
has been retired, and is no longer of use. There are untagged resources that are thought to be part of example-a
, but team-a
is unsure, so they just leave it.
Example of destroying the wrong infrastructure
See Example: Human error (misclick).
What we investigated
Option A: Let teams decide
In the Modernisation Platform, we want to empower teams and grant them the autonomy to do whatever they need to support their applications and environments. Some complexities that can arise from delegating how infrastructure is separated are:
- people tend to go for the easiest option, which creates technical debt further down the line
- infrastructure quickly becomes hard to track
- it can become tangled and messy
- doesn’t support the central alignment we’re trying to achieve
However, raw autonomy does give some positives:
- teams truly have autonomy and empowerment to run their infrastructure in the way they believe is best
- teams can understand their own configuration as they are the ones who built it
Option B: One account, strong attribute-based access controls
Having one account is by far the easiest setup to have. You can use VPCs out of the box to enable cross-resource communication, you only have to set up IAM users once, and you can easily see all resources in one place.
Early on, we explored using strong attribute-based access control (ABAC) so teams can only view their own resources in an account, alongside strict IAM policies to stop cross-team resource interaction. Some of the issues with this is that:
- ABAC isn’t supported on all AWS resources, and requires resources to be tagged (it doesn’t work on untagged resources)
- finding the cost of an application becomes extremely hard to do, where billable items like Data Transfer aren’t easily attributable directly to a source
- deprovisioning infrastructure or applications can become tricky if there are thousands of resources in one account
- resources aren’t truly isolated, as they’re all in one account
- the blast radius is huge
Option C: Accounts separated by business unit and SDLC stage
This is how most multi-account architectures are set up and it is recommended by most cloud providers to go to at least this level of granularity. Some issues with this are listed above in What we’re trying to fix.
It’s not incorrect or wrong to do things this way, but the Modernisation Platform would like to improve on the Ministry of Justice’s current working practices.
It does have its benefits:
- there are a set of SDLC stages a business unit will typically use
- it’s easy to work with, since everything for each SDLC stage is in one place
- it logically separates business units from each other
Option D: Accounts separated by application and SDLC stage
Separating accounts by application is a more granular middle-ground of either having one account, or separating everything. It’s very similar to separating by business unit and SDLC stage but goes a step further to provide granularity at an application level.
It has some issues:
- it’s not simple to do
- it’s an uncommon pattern
- users probably need some degree of centralisation, such as for networking
- it has a higher risk of cloud waste, but is easier to track
Option E: Separate everything
Theoretically and technically, you can separate everything into different accounts; i.e. application, application component/layer, and SDLC stage. You could have something like this for the example A
application:
example-a-api-production
example-a-backend-production
example-a-databases-production
example-a-frontend-production
example-a-api-development
example-a-backend-development
example-a-databases-development
example-a-frontend-development
example-a-landing-zone
Some issues with this are:
- it becomes incredibly complex to maintain
- it becomes difficult to track
- it causes exponential growth in the number of AWS accounts
- it becomes hard to sensibly group things, i.e. what makes up the
frontend
group, what needs to go intobackend
group - it is impossible to sustain as an approach as more teams are on-boarded onto the Modernisation Platform
- most legacy applications aren’t microservices, so splitting things out is hard
- huge room for error in cross-account sharing, if required
The benefits of this are:
- it becomes easy to track costs per application and layer (frontend, backend)
- each account has a tiny blast radius
What we decided
Overview
We decided to use separate AWS accounts per application as a middle-ground between separating everything and using one account with strong attribute-based access controls.
Whilst there is a trade-off to more complexity, we feel it’s outweighed by the benefits of doing this.
Some of the biggest benefits are:
- granular isolation at an application level
- better cost management
- increased autonomy for teams
- reduced cloud waste
Logically separate applications
The Modernisation Platform hosts a number of applications, each built in different ways. One of our goals is to facilitate modernisation of these applications through the onboarding process. That could mean: - moving away from on-premise hosted databases into managed services such as AWS Relational Database Service (RDS) - moving away from bastion hosts to agent-based instance management
Whilst app modernisation is our goal, we’re also aware that it won’t happen straight away, and may not be possible for some legacy applications.
By separating applications out into their own AWS accounts we can:
- allow autonomy for all environments, including production, with more confidence due to a reduced blast radius
- isolate different team workloads from each other
- manage and control costs at an application level
- reduce cloud waste by building test suites for application infrastructure, rather than for a whole platform
- retire whole applications easily, by closing their AWS account
- sensibly group applications without having to worry about which layer (frontend, backend) it sits in
- separate legacy infrastructure from more modern infrastructure, and move them into the Cloud Platform when suitable
- use Service Control Policies attached to single accounts to restrict centrally-managed actions on an application-by-application basis
- utilise independently mapped Availability Zones so if one AZ goes down, it will have less of an impact in other accounts
More accounts aren’t always better
The risk of separating everything, down to layer (frontend, backend, database, etc) is too complex to maintain sensibly. It is easy to get carried away with splitting applications down into layers, and most of our hosted applications will be legacy, which are unlikely to be split into microservices. This makes it difficult or impossible to be cost efficient with migrations, especially if they are likely to be retired or rebuilt in the near future.
Whilst more accounts come at a free monetary cost, the time cost of defining, and maintaining them, becomes high.
The trade-off for complexity is (probably) worth it
Logically separating applications into their own AWS accounts isn’t an easy task, and can become complex.
The trade-off of having a smaller blast radius for legacy applications, better cost analysis, less cloud waste, and greater team isolation is worthwhile for the confidence we will have as a team to make infrastructure changes.
We’re always learning
Where there’s complexity, there’s an opportunity to learn and that will help us make the right decisions with our infrastructure.
We might find issues with this structure or find that it doesn’t work for our scale, and that’s OK, because we’ll have learnt exactly why by trying it.
Architecture
You can view our architecture for Environments (AWS accounts) on the dedicated Environments (AWS accounts) architecture page.