Incident Process
This document describes our incident handling process.
This document assumes the user has a PagerDuty account and is part of the Modernisation Platform team. To use the in Slack features the user must authorise the PagerDuty app in Slack.
1. Confirm that the event constitutes an incident
If this is a security breach, also report it here
We define an incident as an event which:
- is unplanned, and
- impacts end users or our direct users (developers and engineers), or
- degrades user-facing services, or
- increases risk to production services
If this event does not constitute an incident, the appropriate response is probably to raise an issue in our GitHub repository.
Once you are confident that you have an incident, declare it as such.
2. Declare the incident
If an alarm has gone off it will already have created an incident, you can skip this step and use the incident already created in PagerDuty.
From the #modernisation-platform channel in Slack you can use the PagerDuty slack tool to declare an incident.
/pd trigger
Declaring the incident will launch a form for you to complete. Please use a meaningful title and add as much information to the description as you can.
Impacted services - choose from the following depending on the type of incident:
- Networking - Modernisation Platform
- Operations - Modernisation Platform
- Security - Modernisation Platform
Which Priority to assign?
Priority | Description |
---|---|
P1 | The whole platform is down or unavailable, all user applications are unavailable |
P2 | Part of the platform is down or unavailable, some user applications are impacted or unavailable |
P3 | Part of the platform is down or unavailable, user applications are still available |
Click the create dedicated Public Slack channel for the incident. You should use this channel to manage the incident.
At your discretion, you may also wish to notify users now via ask-modernisation-platform and/or modernisation-platform-update, however you may prefer to leave this until later in the process, when more information is available.
3. Assign roles
The two roles which must be filled for every incident are the Incident Lead and the Scribe.
In rare cases, the same person might fill both roles, but this is discouraged because it generally leads to poor record keeping.
To fill these roles, ask for volunteers from the team, either verbally or via #modernisation-platform. In the unlikely event that you don’t get any volunteers, appoint someone.
3.1 Incident Lead
Responsibilities:
- coordinate our response to the incident
- decide on any additional roles required (e.g. a communications lead may be required)
- ensure that all required roles are filled
- ensure that all tasks which need to be handled are being done
- make the final decision whenever we need to choose a course of action
- set the schedule for any regular team check-ins, if those are deemed necessary
- declare the incident closed, when appropriate
- ensure that the post-incident process is followed
The incident lead needs to ensure that things are being done, not try to do everything themselves
3.2 Scribe
Anyone can make notes on the incident, but one persion (and that could be the lead should make sure the incident is documented)
Responsibilities:
The scribe is responsible for keeping a log of the incident, including:
- important events
- discussion topics
- decisions
- actions
- results of actions/investigations
This log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”
Entries on the incident can be created from Slack using the “Add Note” option on the Incident or via PagerDuty.
3.3 Updating the External Status on PagerDuty
When an incident is raised, an update will be come pending on the Modernisation Platform external status page (the internal status will be automatically updated).
An email will be sent to the Modernisation Platform team informing them of the pending update.
To publish the update to the external status page, click the link in the email or navigate to Status and External Status page in PagerDuty.
Fill in the details for the update and publish it.
Transferring roles
It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.
Whoever assumes a role should announce it in the incident slack thread, so that the team is aware.
4. Fix the problem
Please bear in mind that not every incident requires the whole team to be involved (even if they all want to join in).
Log a support ticket if necessary
If the incident cannot be resolved within the team or if the issue lies with a 3rd party log a support ticket with the 3rd party. For AWS support, log a call in the AWS account affected.
3rd Party | How to log a support ticket | Escalation process |
---|---|---|
AWS | Creating a support case | On the case, or post in #ext-awssupport |
5. End the incident
The incident is resolved once the user is no longer facing issues. This may be a temporary fix, in which case an issue should be created to put a permanent fix in place.
Resolve the incident via Slack or PagerDuty with a note on the resolution.
Update the external PagerDuty status page with the resolution.
This marks the official end of the incident.
6. Post-incident procedure
After the incident is resolved:
- A blameless post mortem meeting should be scheduled to identify any processes that need to be improved
- A runbook for how to fix this issue should be published