During the kubernetes outage incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes
One of the problems that came up was a lack of documentation around etcd operations.
Deliverables:
- - Document disaster recovery procedure for the v2 etcd nodes
- - Document quirks about the existing v2 nodes (such as timeouts) so they are less likely to cloud root-cause analyses
- - Document adding/removing nodes from the cluster