Skip to main content

Incident Process

This document describes our incident handling process for network, operations and security issues.

This document assumes the user has a PagerDuty account and is part of the Modernisation Platform team. To use the in Slack features the user must authorise the PagerDuty app in Slack.

Overview:

For ease of use the key steps are documented within the overview, with further details in the sections below.

1. Confirm that the event constitutes an incident

If this incident constitutes a security breach, you must also report it here

2. Declare the incident.

create a pager-duty incident (ensuring it is on the external status page) and a slack channel

3. Assign roles.

Incident Lead and Scribe are mandatory roles

4. Managing the incident

including communication - and who to tell

5. Fixing the problem

6. Post-incident procedure

Detailed steps follow below:

1. Confirm that the event constitutes an incident

If this incident constitutes a security breach, you must also report it here

We define an incident as an event which:

  • is unplanned, and
  • impacts end users or our direct users (developers and engineers), or
  • degrades user-facing services, or
  • increases risk to production services

If this event does not constitute an incident, the appropriate response is probably to raise an issue in our GitHub repository, however if the issue is security related, it is raised here

Once you are confident that you have an incident, declare it as such.

2. Declare the incident

If an alarm has gone off it will already have created an incident, you can skip this step and use the incident already created in PagerDuty.

From the #modernisation-platform channel in Slack you can use the PagerDuty slack tool to declare an incident.

/pd trigger

Type /pd help to see other available PagerDuty commands, although most likely they will not be useful here.

Declaring the incident will launch a form for you to complete, see an example of the slack form UI below: Create a new PagerDuty incident

Impacted Service - choose from the following depending on the type of incident:

  • Networking - Modernisation Platform
  • Operations - Modernisation Platform
  • Security - Modernisation Platform

Which Priority to assign?

We only use P1, P2 or P3

Priority Description
P1 The whole platform is down or unavailable, all user applications are unavailable
P2 Part of the platform is down or unavailable, some user applications are impacted or unavailable
P3 Part of the platform is down or unavailable, user applications are still available

Tick the Create a dedicated Public Slack channel for this incident box. You should use this channel to manage the incident.

Click Create button once you are happy with the form. This will create a new incident in PagerDuty and it will update PagerDuty Status Page. The PagerDuty Status Page slack integration will then automatically post the status in the ask-modernisation-platform and modernisation-platform-update channels.

The newly created slack channel (see example below) will also be available for record keeping and collaborating on the incident.

PagerDuty incident slack channel

At your discretion, you may also wish to notify users now via ask-modernisation-platform and/or modernisation-platform-update, however please ensure that this does not replace the Status Page update (see later in the guidance).

3. Assign roles

The two roles which must be filled for every incident are the Incident Lead and the Scribe.

In rare cases, the same person might fill both roles, but this is discouraged because it generally leads to poor record keeping.

To fill these roles, ask for volunteers from the team, either verbally or via #modernisation-platform. In the unlikely event that you don’t get any volunteers, appoint someone.

3.1 Incident Lead

Responsibilities:

  • coordinate our response to the incident
  • decide on any additional roles required (e.g. a communications lead may be required)
  • ensure that all required roles are filled
  • if no communications lead required, the Incident Lead will be responsible for communicating
  • ensure that all tasks which need to be handled are being done
  • make the final decision whenever we need to choose a course of action
  • set the schedule for any regular team check-ins, if those are deemed necessary
  • declare the incident closed, when appropriate
  • ensure that the post-incident process is followed

The incident lead needs to ensure that things are being done, not try to do everything themselves

3.2 Scribe

Anyone can make notes on the incident, but one person (and that could be the lead) should make sure the incident is documented.

Responsibilities:

The scribe is responsible for keeping a log of the incident, including:

  • important events
  • discussion topics
  • decisions
  • actions
  • results of actions/investigations

This log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”

4. Managing the incident

4.1 Communicating

As well as communicating to our user base, the following people should be informed for a P1 or P2:

  • Head of Platforms and Architecture

  • Head of Hosting

  • Product Manager/Delivery Manager

4.2 Recording notes

Entries on the incident can be created from Slack using the Add a Note action option on the Incident Post in the incident dedicated slack channel (see below)

PagerDuty incident actions in slack

or via PagerDuty Incident page by clicking + Add Note button (see below)

PagerDuty incident page management

NOTE, other incident actions are also listed in the slack UI, e.g. Assign Roles, however not all of them are currently enabled (as some features cost extra). Therefore, managing these information may require a manual recording through Add a Note action instead.

Similarly, these features will not be available in the incident page.

4.3 Updating the External Status on PagerDuty

When an incident is raised, an update will become pending on the Modernisation Platform external status page (the internal status will be automatically updated).

An email will be sent to the Modernisation Platform team informing them of the pending update.

To publish the update to the external status page, click the link in the email or navigate to Status and External Status page in PagerDuty.

Fill in the details for the update and publish it, this will also post an update to the #ask-modernisation-platform and #modernisation-platform-update Slack channels.

It is important to use the External Status page as this feeds in to other services in PagerDuty dependant on the Modernisation Platform.

4.4 Transferring roles

It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.

Whoever assumes a role should announce it in the incident slack channel (or thread if the channel was not created), so that the team is aware.

5. Fix the problem

Please bear in mind that not every incident requires the whole team to be involved (even if they all want to join in).

Log a support ticket if necessary

If the incident cannot be resolved within the team or if the issue lies with a 3rd party log a support ticket with the 3rd party. For AWS support, log a call in the AWS account affected.

3rd Party How to log a support ticket Escalation process
AWS Creating a support case On the case, or post in #ext-aws

6. End the incident

The incident is resolved once the user is no longer facing issues. This may be a temporary fix, in which case an issue should be created to put a permanent fix in place.

Resolve the incident via Slack or PagerDuty with a note on the resolution.

Update the external PagerDuty status page with the resolution.

This marks the official end of the incident.

7. Post-incident procedure

After the incident is resolved:

  • A new incident report should be created and stored in the Modernisation Platform > Incidents drive directory. See an example of such report. The notes generated during the incident management in PagerDuty can be used for the incident timeline records.
  • A blameless post mortem meeting should be scheduled to identify any processes that need to be improved
  • A runbook for how to fix this issue should be published
This page was last reviewed on 18 December 2024. It needs to be reviewed again on 18 March 2025 by the page owner #modernisation-platform .
This page was set to be reviewed before 18 March 2025 by the page owner #modernisation-platform. This might mean the content is out of date.