Skip to main content

Disaster Recovery Process

This page is designed to help give direction in the event of a DR scenario.

It covers permanent loss of all infrastructure, temporary loss of all infrastructure, and loss of a single account.

Diagnose

To find out the severity of the situation, check the AWS status page here.

Regardless of which type of disaster recovery situation there is, you should contact AWS immediately, understanding if this is an outage for 1 hour or 1 month will dictate the next steps to take.

  • If the issue is across all accounts, login to the Modernisation Platform account, and raise the highest level severity ticket, whilst also engaging with our AWS representative in the #ext-aws channel.

  • If this issue is in one account, login to that account, and raise the highest level severity ticket if production, and also engaging with our AWS representative in the #ext-aws channel.

  • If this occurs during working hours, create a huddle / meet and follow the escalation process outlined below.

  • If this occurs outside of working hours, because of the complexity of the DR process we will not attempt to initiate DR out of hours, rather to wait until the next working day. Steps still can be taken to prepare for this however, which can be followed below.

Internal Escalation

Once AWS have been contacted, and a ticket has been raised, please esclate to the following people.

  • Modernisation Platform Product Manager - Simon Pledger
  • Hosting Service Owner - Sean Privett
  • Deputy Director / Head of Platforms and Architecture - Martyn Inglis
  • Digital Director - Kamal Bal

Inform the following channels of the situation and put forward a contactable member of staff from our team.

#ask-modernisation-platform

#modernisation-platform

#modernisation-platform-updates

Following this process.

Disaster Recovery Types

Let’s work the problem, people. Let’s not make things worse by guessing.

  • London has been lost but no services need to migrate. Click here.
  • London has been lost, and all services need to migrate. Click here.
  • Loss of a single account. Click here.

London has been lost but returning soon.

Getting a decision

Once a ticket has been raised with AWS, and we have engaged with them and have total confirmation that the infrastructure will be returning within X number of minutes or hours, we will need to pass this information on for an informed decision to be made above.

Only when we have this, can we update users on what is next.

Priority list

If there is limited availability, which is a possibility, these are the priority accounts that need to be checked before we can focus on member-accounts.

Account Information
Core-Shared-Services Creates the core shared services account resources.
Core-Network-Services Creates the core networking account resources; this is the highest priority.
Core-Logging Creates the core logging account resources.
Core-Security Creates the core security account resources.
Cloud-Platform-Transit-Gateways Across all regions.
Core-VPC-* Creates the core VPC resources in the VPC accounts, needed after the core-networking-services.
Modernisation-Platform Creates key resources such as S3 state buckets for the platform.

Test Accounts

If there is limited availability, power off all member infrastructure in the test accounts. All are useless until the platform is running.

Single Account

Things to remember

  • Don’t forget the statefile, this could cause issues when rebuilding and generating plans.
  • Backups, our infrastructure is cross region where possible, as stated in the table, however, we will need to help teams with restoring theirs if possible.

Will it be returning

If this is a test account, waiting may be the easier idea. If AWS can promise us this is a temporary issue with an estimated fix time, we can pass this to the team and agree our next step.

Rebuild account

Assuming this is a production account with no time given to return, please follow the steps.

Workflows

Rebuilding an account entirely through workflows would be the next step, and in an ideal scenario this would work first time, however when errors occur, debugging Terraform plans and applies locally saves a lot of time.

Scenarios are never as cut and dry as ‘all infrastructure has vanished from one account’, so working and debugging the issues that occur will be quicker locally.

Member Accounts - Located here, running the new environment workflow should begin the steps of an account creation for member accounts.

Core Accounts - Located here, on the left you will see the deployment workflows for all core based accounts, required for the building on the platform.

Manually

1.) In the modernisation-platform repo, run a Terraform plan against the production workspace.

2.) All changes should appear, using your admin credentials, run an apply.

3.) Assuming the apply runs as planned, login to the account and check that vpcs, etc look correct.

4.) In the modernisation-platform-environments repo, run Terraform against the correct account.

Things to remember

  • Statefile for the lost account will exist, this might need to be removed.
  • Any backups on the account or volumes will be lost.

Disaster Recovery Running outside of London.

If the London infrastructure is to be down for the foreseeable future, running from another region is the next step.

Informing teams

Firstly, inform the following channels of the situation and put forward a contactable member of staff from our team.

#ask-modernisation-platform

#modernisation-platform

#modernisation-platform-alerts

Supply the ETAs, lead incident numbers, links to where updates will be posted etc. We will mention our commitment to regularly we will update the incident, however until we know what is expected this is an unknown.

Cross Region Infrastructure in the Platform

Technology Information Replicated? Replication Region
S3 Contains the platforms Terraform state files. Yes eu-west-1
KMS Contains encryption keys for the platform. No -
Secrets Manager Contains platform secrets. Yes eu-west-1
IAM Across all regions. - -
DNS (Route53) Across all regions. - -
VPC No VPCs currently exist for DR, only reserved. - -
Firewall Not cross region.
Transit Gateway Not cross region.
NACLS Not cross region.
Subnets Not cross region.
Dynamo DB modernisation-platform-terraform-state-lock table No, but can be replicated TBC
ACM Private Certificate Authority Private certificate authorities (CAs) are Regional resources. To use private CAs in more than one Region, you must create your CAs in those Regions. You cannot copy private CAs between Regions No TBC

Repo / Module List

Below are the repositories owned by the Modernisation Platform. The next step will be to create new tags for the current releases with any hardcoded region values changed.

Use the version number stated so these changes are easily recognisable. The version of these modules will then need to be updated on our Modernisation Platform repository, along with any variables changed that are needed. This should go in one large PR with many reviewers.

Module / Repo Link Information Change Required
Main Repo https://github.com/ministryofjustice/modernisation-platform Houses all of the Terraform that builds the platform, many modules will be referenced in here and will need the variables changed. Variable
Environments Repo https://github.com/ministryofjustice/modernisation-platform-environments Customers responsibility (although we might need to help) -
Bastion Linux https://github.com/ministryofjustice/modernisation-platform-terraform-bastion-linux Used by teams to create a bastion EC2 instance, module tests have a hard coded region but the module itself has a variable when called, no change required to the module. Region variable
ConfigurationManagement https://github.com/ministryofjustice/modernisation-platform-configuration-management This repository contains configuration management of the ec2 infrastructure hosted on the ModernisationPlatform. Requires hardcoded eu-west-2 to be changed. User managed repo
Instance Scheduler https://github.com/ministryofjustice/modernisation-platform-instance-scheduler A Go lambda function for stopping and starting instances, RDS resources and autoscaling groups. Not required for DR
S3 Bucket https://github.com/ministryofjustice/modernisation-platform-terraform-s3-bucket A Terraform module to standardise S3 buckets with sensible defaults. No, region variable when called.
AWS VM Import https://github.com/ministryofjustice/modernisation-platform-terraform-aws-vm-import Region variable
SSM Patching https://github.com/ministryofjustice/modernisation-platform-terraform-ssm-patching Automated patching module. Region variable
EC2 Autoscaling https://github.com/ministryofjustice/modernisation-platform-terraform-ec2-autoscaling-group For ec2 autoscaling Region variable
AMI Builds https://github.com/ministryofjustice/modernisation-platform-ami-builds This repository contains the Modernisation Platform AMI build code and workflows. User managed repo
Terraform Baselines https://github.com/ministryofjustice/modernisation-platform-terraform-baselines Terraform module for enabling and configuring the MoJ Security Guidance baseline for AWS accounts, alongside some extra reasonable security, identity and compliance services. Region variable
Cross Account Access https://github.com/ministryofjustice/modernisation-platform-terraform-cross-account-access A simple Terraform module to configure an IAM role that is assumable from another account. No
DNS Certificates https://github.com/ministryofjustice/modernisation-platform-terraform-dns-certificates Template No
Load Balancer https://github.com/ministryofjustice/modernisation-platform-terraform-loadbalancer Region variable
GitHub OIDC Role https://github.com/ministryofjustice/modernisation-platform-github-oidc-role No
Terraform Member VPC https://github.com/ministryofjustice/modernisation-platform-terraform-member-vpc This module creates the member accounts VPC and networking. No
Terraform IAM Superadmins https://github.com/ministryofjustice/modernisation-platform-terraform-iam-superadmins This repository holds a Terraform module that creates set IAM accounts and associated configuration, such as: account password policies, administrator groups, user accounts. No
EC2 Instance https://github.com/ministryofjustice/modernisation-platform-terraform-ec2-instance Module for an ec2 instance creation. Region variable
GitHub OIDC Provider https://github.com/ministryofjustice/modernisation-platform-github-oidc-provider This module allows users to create an OIDC Provider and the associated IAM resources required to make use of the connect provider. No
ECS Cluster https://github.com/ministryofjustice/modernisation-platform-terraform-ecs-cluster This repository provides 2 Terraform Modules: Cluster and Service Region variable
Pagerduty https://github.com/ministryofjustice/modernisation-platform-terraform-pagerduty-integration Region variable
Terraform Module Template https://github.com/ministryofjustice/modernisation-platform-terraform-module-template No
Incident Response https://github.com/ministryofjustice/modernisation-platform-incident-response No
Trusted Advisor https://github.com/ministryofjustice/modernisation-platform-terraform-trusted-advisor No
Lambda Function https://github.com/ministryofjustice/modernisation-platform-terraform-lambda-function No

Organisational Service Control Policies

Service control policies (SCPs) are a type of organization policy that you can use to manage permissions in the organization. SCPs offer central control over the maximum available permissions for all accounts in your organization. SCPs help you to ensure your accounts stay within your organization’s access control guidelines.

Currently in eu-west-1 we can build what is required, however if this region was to change, we would need to check what is limited in which region.

Contact AWS

As previously mentioned, contact AWS using the following steps

  • Raise a ticket, highest priority.
  • Contact our AWS representative for an update on the situation.
  • Organise updates, if the situation changes, our approach might need to.

Running the Terraform

Once the modules have been updated, and tagged, as well as the PR merged to change any variable or hardcoded mentions of the region in Modernisation-Platform we will need to begin the rebuild of the platform, starting with networking. The steps below outline which order to follow.

Platform rebuild - Networking

This is a large undertaking and has not been tested but discussed within the team.

Networking Overview

Our Networking Diagram shows a high level view of how shared VPCs are connected.

Our NetworkingApproach discusses and contains a detailed view of our shared VPCs.

Our Transit Gateway and how VPCs access the internet.

Stages

Rather than rebuilding the platform through workflows, I would run steps manually to easier debug issues, however, technically, you could start to run the account creation process with our GitHub workflows.

Located here, on the left you will see the deployment workflows for all core based accounts, required for the building on the platform.

VPCs and Subnets

  1. Clone the modernisation-platform repo, and change the region in the code. Whilst its as simple as a find and replace, modules are called and need changing, either locally and referenced locally, or in the module and pushed to GitHub.
  2. Core VPC rebuild. Provider would need to be eu-west-1, so check the permissions you have in that region with the credentials you are using. E.g. https://moj.awsapps.com/start/ when getting your credentials from here, make sure the region is correct.
  3. Core Networking would be the next to run Terraform on, checking the code and modules it uses for region references and changing them.

With the infrastructure in London no longer available, we would use the original ranges

Transit Gateway

To rebuild the transit gateway in a new region, first find the code, located here, in core-network-services.

Using the admin role, run a Terraform plan and apply. This hasn’t been done from scratch since the platform’s creation, so expect to debug issues.

Firewall

We make use of AWS Network Firewalls in two ways. First, we use a centralised inspection VPC to control traffic entering and exiting the Modernisation Platform to and from internal Ministry of Justice services. Second, we use inline-inspection in our egress VPCs to control traffic exiting the Modernisation Platform to the internet.

Located in the Core-Network-Services account, DNS, networking and other network related infrastructure will be down.

The firewall rules can be found here

Live data & non-live data.

Peer the new Transit gateway build in eu-west-1 to the MOJ transit gateway that is in eu-west-2.

Account dependencies

Account Information
Core-Shared-Services Creates the core shared services account resources, AMIs, KMS, S3 and other shared services customers require are based in here.
Core-Network-Services Creates the core networking account resources, and contains the transit gateway Terraform, as well as firewall rules.
Core-Logging Creates the core logging account resources, not essential for operational function but a part of the platform. As well as for troubleshooting, logging is required for audit purposes and therefore should also be prioritised.
Core-Security Creates the core security account resources.
Cloud-Platform-Transit-Gateways Across all regions.
Core-VPC-* Creates the core VPC resources in the VPC accounts.
Modernisation-Platform Creates key resources such as S3 state buckets for the platform.

Once Terraform has been run against these accounts, member accounts can be next, as technically, all that is required for the platform should now exist.

Member Accounts

After the networking has been completed, the core accounts created, and the modernisation platform itself built, the next step is creating member-accounts. Start with cooker, example and sprinkler, building from the mod-platform repo and then applying the Terraform in the environments repo. This will allow us to test the steps required for our customers to follow when recreating their infrastructure from the environments repo.

References

This page was last reviewed on 24 July 2024. It needs to be reviewed again on 24 January 2025 by the page owner #modernisation-platform .
This page was set to be reviewed before 24 January 2025 by the page owner #modernisation-platform. This might mean the content is out of date.