After doing some testing, Erik has a rough recovery plan (from https://phabricator.wikimedia.org/T295478#7501154):
- Deploy elasticsearch-repository-swift plugin to eqiad and codfw clusters
- Configure both clusters to connect to ms-fe.svc.eqiad.wmnet (swift)
- Snapshot the existing commonswiki_file index from the codfw cluster to swift, take note of start time
- Restore the snapshot from swift to the eqiad cluster.
- Run CirrusSearch downtime catchup procedure against eqiad for the period between starting restore and the cluster no longer failing writes to the commonswiki index.
- Undeploy elasticsearch-repository-swift from all clusters
Some related notes:
- elasticsearch-repository-swift was never released for 6.5.4, I ended up taking the last commit targeting 6.6.0 and compiling it against 6.5.4 (change elasticsearchVersion = 6.5.4, and change gradle from 5 to 4.1). What process should we follow to include this in the plugins .deb since we are no longer the upstream here?
- Should we have a separate auth setup in swift for cirrussearch snapshots?
- By default snapshot backup/restore is limited to 20MB/s per partition. Since commonswiki is 32 partitions the cluster will limit itself to 640MB/s, or over 5 gigabits/s. I suspect this is a bit excessive for the swift cluster, or at least beyond doubling the typical network traffic. What would a more appropriate limit be? @fgiunchedi
- After or during restore of the snapshot we likely need to manually assign the commonswiki_file and commonswiki aliases to it.