Page MenuHomePhabricator

Document some etcd cluster operations for Toolforge
Open, NormalPublic

Description

During the kubernetes outage incident https://wikitech.wikimedia.org/wiki/Incident_documentation/20190910-toolforge-kubernetes
One of the problems that came up was a lack of documentation around etcd operations.

Deliverables:

  • - Document disaster recovery procedure for the v2 etcd nodes
  • - Document quirks about the existing v2 nodes (such as timeouts) so they are less likely to cloud root-cause analyses
  • - Document adding/removing nodes from the cluster

Event Timeline

Bstorm triaged this task as Normal priority.Sep 12 2019, 6:50 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2019, 6:50 PM
Phamhi added a subscriber: Phamhi.EditedSep 13 2019, 11:10 AM

I have started the documentation which can be found here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Etcd