Incident Process
This document describes our incident handling process for network, operations and security issues.
This document assumes the user has a PagerDuty account and is part of the Modernisation Platform team. To use the in Slack features the user must authorise the PagerDuty app in Slack.
1. Confirm that the event constitutes an incident
If this incident constitutes a security breach, you must also report it here
We define an incident as an event which:
- is unplanned, and
- impacts end users or our direct users (developers and engineers), or
- degrades user-facing services, or
- increases risk to production services
If this event does not constitute an incident, the appropriate response is probably to raise an issue in our GitHub repository, however if the issue is security related, it is raised here
Once you are confident that you have an incident, declare it as such.
2. Declare the incident
If an alarm has gone off it will already have created an incident, you can skip this step and use the incident already created in PagerDuty.
From the #modernisation-platform channel in Slack you can use the PagerDuty slack tool to declare an incident.
/pd trigger
Type
/pd help
to see other available PagerDuty commands, although most likely they will not be useful here.
Declaring the incident will launch a form for you to complete, see an example of the slack form UI below:
Impacted Service - choose from the following depending on the type of incident:
- Networking - Modernisation Platform
- Operations - Modernisation Platform
- Security - Modernisation Platform
Which Priority to assign?
Priority | Description |
---|---|
P1 | The whole platform is down or unavailable, all user applications are unavailable |
P2 | Part of the platform is down or unavailable, some user applications are impacted or unavailable |
P3 | Part of the platform is down or unavailable, user applications are still available |
Tick the Create a dedicated Public Slack channel for this incident box. You should use this channel to manage the incident.
Click Create button once you are happy with the form. This will create a new incident in PagerDuty and it will update PagerDuty Status Page. The PagerDuty Status Page slack integration will then automatically post the status in the ask-modernisation-platform and modernisation-platform-update channels.
The newly created slack channel (see example below) will also be available for record keeping and collaborating on the incident.
At your discretion, you may also wish to notify users now via ask-modernisation-platform and/or modernisation-platform-update, however please ensure that this does not replace the Status Page update (see later in the guidance).
3. Assign roles
The two roles which must be filled for every incident are the Incident Lead and the Scribe.
In rare cases, the same person might fill both roles, but this is discouraged because it generally leads to poor record keeping.
To fill these roles, ask for volunteers from the team, either verbally or via #modernisation-platform. In the unlikely event that you don’t get any volunteers, appoint someone.
3.1 Incident Lead
Responsibilities:
- coordinate our response to the incident
- decide on any additional roles required (e.g. a communications lead may be required)
- ensure that all required roles are filled
- ensure that all tasks which need to be handled are being done
- make the final decision whenever we need to choose a course of action
- set the schedule for any regular team check-ins, if those are deemed necessary
- declare the incident closed, when appropriate
- ensure that the post-incident process is followed
The incident lead needs to ensure that things are being done, not try to do everything themselves
3.2 Scribe
Anyone can make notes on the incident, but one person (and that could be the lead) should make sure the incident is documented.
Responsibilities:
The scribe is responsible for keeping a log of the incident, including:
- important events
- discussion topics
- decisions
- actions
- results of actions/investigations
This log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”
4. Managing the incident
4.1 Recording notes
Entries on the incident can be created from Slack using the Add a Note action option on the Incident Post in the incident dedicated slack channel (see below)
or via PagerDuty Incident page by clicking + Add Note button (see below)
NOTE, other incident actions are also listed in the slack UI, e.g. Assign Roles, however not all of them are currently enabled (as some features cost extra). Therefore, managing these information may require a manual recording through Add a Note action instead.
Similarly, these features will not be available in the incident page.
4.2 Updating the External Status on PagerDuty
When an incident is raised, an update will become pending on the Modernisation Platform external status page (the internal status will be automatically updated).
An email will be sent to the Modernisation Platform team informing them of the pending update.
To publish the update to the external status page, click the link in the email or navigate to Status and External Status page in PagerDuty.
Fill in the details for the update and publish it, this will also post an update to the #ask-modernisation-platform and #modernisation-platform-update Slack channels.
It is important to use the External Status page as this feeds in to other services in PagerDuty dependant on the Modernisation Platform.
4.3 Transferring roles
It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.
Whoever assumes a role should announce it in the incident slack channel (or thread if the channel was not created), so that the team is aware.
5. Fix the problem
Please bear in mind that not every incident requires the whole team to be involved (even if they all want to join in).
Log a support ticket if necessary
If the incident cannot be resolved within the team or if the issue lies with a 3rd party log a support ticket with the 3rd party. For AWS support, log a call in the AWS account affected.
3rd Party | How to log a support ticket | Escalation process |
---|---|---|
AWS | Creating a support case | On the case, or post in #ext-aws |
6. End the incident
The incident is resolved once the user is no longer facing issues. This may be a temporary fix, in which case an issue should be created to put a permanent fix in place.
Resolve the incident via Slack or PagerDuty with a note on the resolution.
Update the external PagerDuty status page with the resolution.
This marks the official end of the incident.
7. Post-incident procedure
After the incident is resolved:
- A new incident report should be created and stored in the Modernisation Platform > Incidents drive directory. See an example of such report. The notes generated during the incident management in PagerDuty can be used for the incident timeline records.
- A blameless post mortem meeting should be scheduled to identify any processes that need to be improved
- A runbook for how to fix this issue should be published