Troubleshooting/postmortem analysis culture
In this we will learn and understand about troubleshooting/post mortem analysis culture.
For platform providers that offer a wide range of services to a wide range of users, fully public postmortems such as these make sense. But even if the impact of your outage isn’t as broad. However, if you are practising SRE, it can still make sense to share postmortems with customers that have been directly impacted.
This is the position we take on the Google Cloud Platform (GCP) Customer Reliability Engineering. Further, to help customers run reliably on GCP, we teach them how to engineer increased reliability for their service by implementing SRE best practices in our work together. We identify and quantify architectural and operational risks to each customer’s service, and work with them to mitigate those risks and drive to sustain system reliability at their SLO.
Specifically, the CRE team works with each customer to help them meet the availability target expressed by their SLOs. For this, the principal steps are to:
- Firstly, define a comprehensive set of business-relevant SLOs
- Secondly, get the customer to measure compliance to those SLOs in their monitoring platform (how much of the service error budget has been consumed)
- Thirdly, share that live SLO information with Google support and product SRE teams (which we term shared monitoring)
- Lastly, jointly monitor and react to SLO breaches with the customer (shared operational fate)
Foundations of an external postmortem
Analyzing outages and subsequently writing about them in a postmortem gives benefits from having a two-way flow of monitoring data between the platform operator and the service owner. However, this provides an objective measure of the external impact of the incident: When did it start, how long did it last, how severe was it?
Further, based on the monitoring data from the service owner and their own monitoring, the platform team can write their postmortem following the standard practices and our postmortem template. This results in an internally reviewed document that has the canonical view of the incident timeline, the scope and magnitude of impact, and a set of prioritized actions to reduce the probability of occurrence of the situation, reduce the expected impact, improve detection and/or recover from the incident more quickly.
Selecting an audience for your external postmortem
- Firstly, if your customers have defined SLOs, they know how badly this affected them. Generally, the greater the error budget that has been consumed by the incident, the more interested they are in the details, and the more important it will be to share with them. They’re also more likely to be able to give relevant feedback to the postmortem about the scope, timing and impact of the incident.
- Secondly, if your customer’s SLOs weren’t violated but this problem still affected their customers, that’s an action item for the customer’s own postmortem: what changes need to be made to either the SLO or its measurements?
- Thirdly, if your customer doesn’t have SLOs that represent the end-user experience, it’s difficult to make an objective call about this. Unless there are obvious reasons why the incident disproportionately affected a particular customer. Then, you should probably default to a more generic incident report.
- Lastly, if the outage has impacted most of your customers, then you should consider whether the externalized postmortem might be the basis for writing a public postmortem or incident report, like the examples we quoted above.
Deciding how much to share, and why?
Another question when writing external postmortems is how deep to get into the weeds of the outage. At one end of the spectrum you might share your entire internal postmortem with a minimum of redaction; at the other you might write a short incident summary. This is a tricky issue that we’ve debated internally.
The two factors we believe to be most important in determining whether to expose the full detail of a postmortem to a customer, rather than just a summary, are:
- Firstly, How important are the details to understanding how to defend against a future re-occurrence of the event?
- Secondly, How badly did the event damage their service, i.e., how much error budget did it consume?
Postmortems should never include these three things:
- Firstly, Names of humans. Rather than “John Smith accidentally kicked over a server”, say “a network engineer accidentally kicked over a server,” Internally, we try to express the role of humans in terms of role rather than name. This helps us keep a blameless postmortem culture.
- Secondly, Names of internal systems. The names of your internal systems are not clarifying for your users and creates a burden on them to discover how these things fit together. For example, even though we’ve discussed Chubby externally. But, we still refer to it in postmortems we make external as “our globally distributed lock system.”
- Lastly, Customer-specific information. The internal version of your postmortem will likely say things like “on XX:XX, Acme Corp filed a Support ticket alerting us to a problem.” It’s not your place to share this kind of detail externally. As it may create an undue burden for the reporting company (in this case Acme Corp.). Rather, simply say “on XX:XX, a customer filed…”. If you’re going to reference more than one customer, then just label them Customer A, Customer B, etc..
Reference: Google Documentation