Page MenuHomePhabricator

Document some etcd cluster operations for Toolforge
Closed, ResolvedPublic

Description

During the kubernetes outage incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes
One of the problems that came up was a lack of documentation around etcd operations.

Deliverables:

  • - Document disaster recovery procedure for the v2 etcd nodes
  • - Document quirks about the existing v2 nodes (such as timeouts) so they are less likely to cloud root-cause analyses
  • - Document adding/removing nodes from the cluster