- Define the incident owner
- Define the incident secretary/communicator
- Create and document
- summary
- observations (link to metrics dashboards with absolute timestamps as much as possible)
- screenshots
- links to logs
- hypotheses/theories
- who made them
- when
- if they have been validated/invalidated
- the actions taken
- by whom
- if it had the desired effect
- etc.
- In the situation where an incident has been caused by the introduction of a code regression, revert the change and deploy as soon as possible
- Start by reducing/relieving the impact of the incident before searching for a root cause
- Use multiple data sources when data sources do not agree
- Diagram all the implicated systems and the relationship to one another in order to identify the potential locations where the problem might be
- Test your hypotheses to verify if they hold or not
- Develop a procedure over time that can be followed to diagnose similar issues
- Identify the cause of the problem
- 5 whys
- List potential solutions
- Investigate potential sources of similar problems
- Address the additional sources of risk
- Reduce incident duration
- Identify the cause of the problem more rapidly
- Reduce incident cost
- Reduce the number of people involved
- GCP
- Traffic director
Every year, either at the beginning or end of the year.
A few hours spread over the course of a few days.
- Review year according to various facets
- Plan the next year according to the same facets
- I use a mind map software to do my yearly review and plan.
Every month, either at the beginning or end of the month.
30 minutes.
- Review what was planned for the month
- Provide feedback related to the plan
- Review yearly plan and align
- Plan next month
- I use a text editor, such as VS Code, to write my monthly reviews.