Page MenuHomePhabricator

Cleanup missing Commons index on Elasticsearch eqiad
Open, HighPublic8 Estimated Story Points

Description

After doing some testing, Erik has a rough recovery plan (from https://phabricator.wikimedia.org/T295478#7501154):

  1. Deploy elasticsearch-repository-swift plugin to eqiad and codfw clusters
  2. Configure both clusters to connect to ms-fe.svc.eqiad.wmnet (swift)
  3. Snapshot the existing commonswiki_file index from the codfw cluster to swift, take note of start time
  4. Restore the snapshot from swift to the eqiad cluster.
  5. Run CirrusSearch downtime catchup procedure against eqiad for the period between starting restore and the cluster no longer failing writes to the commonswiki index.
  6. Undeploy elasticsearch-repository-swift from all clusters

Some related notes:

  • elasticsearch-repository-swift was never released for 6.5.4, I ended up taking the last commit targeting 6.6.0 and compiling it against 6.5.4 (change elasticsearchVersion = 6.5.4, and change gradle from 5 to 4.1). What process should we follow to include this in the plugins .deb since we are no longer the upstream here?
  • Should we have a separate auth setup in swift for cirrussearch snapshots?
  • By default snapshot backup/restore is limited to 20MB/s per partition. Since commonswiki is 32 partitions the cluster will limit itself to 640MB/s, or over 5 gigabits/s. I suspect this is a bit excessive for the swift cluster, or at least beyond doubling the typical network traffic. What would a more appropriate limit be? @fgiunchedi
  • After or during restore of the snapshot we likely need to manually assign the commonswiki_file and commonswiki aliases to it.

Event Timeline

dcausse renamed this task from Cleanup missing Commons index on Elasticsearch equiad to Cleanup missing Commons index on Elasticsearch eqiad.Mon, Nov 15, 4:56 PM
MPhamWMF set the point value for this task to 8.
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)
  • I checked with network ops, if we rate limit to 2 gigabits (8MB/s/partition in repository config) of traffic it should leave plenty of room for everything else.
  • Talked with the team, agreed to ship a locally compiled jar of elasticsearch-repository-swift in our debian package, and then take it out when we are done prior to the 6.8 upgrade.
  • Ideally we should still check in regarding swift auth, but i don't suspect there is any problem with us using our existing data shipping credentials for this purpose.

next steps:

  • release updated debian package
  • rolling-restart eqiad and codfw

Change 738979 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Add repository-swift plugin

https://gerrit.wikimedia.org/r/738979

Working on the plugin build. Getting an error at basically the final step:

ryankemper@apt1001:~$ sudo -E reprepro -C component/elastic65 include stretch-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes && sudo rm -rfv ~/Elastic_Plugins
File "pool/component/elastic65/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb" is already registered with different checksums!
md5 expected: d24b364d64e6898dd0d6c131de039710, got: 9259317418b1bd1ce34a43b806b8e879
sha1 expected: f78b1bd75539aca14e79a9b67b594bba5ed66c2b, got: 56418574cf370dcfc9ed31f9d1782659e697e651
sha256 expected: 05854b89803030182da3e4b3dffc60f668efb66222a5be2eaec1952b668a84bd, got: 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead
size expected: 31251912, got: 35086032
There have been errors!

I haven't actually merged https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/738979 so maybe that's why it's still "seeing" 6.5.4-6 instead of 6.5.4-7


Here's the contents of the changes file:

ryankemper@apt1001:~$ cat /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes
Format: 1.8
Date: Mon, 15 Nov 2021 10:02:14 -0800
Source: wmf-elasticsearch-search-plugins
Binary: wmf-elasticsearch-search-plugins
Architecture: source all
Version: 6.5.4-7
Distribution: stretch
Urgency: medium
Maintainer: David Causse <dcausse@wikimedia.org>
Changed-By: Erik <ebernhardson@ebernhardson-ThinkPad-X1-Carbon-7th>
Description:
 wmf-elasticsearch-search-plugins - Elasticsearch plugins for search
Changes:
 wmf-elasticsearch-search-plugins (6.5.4-7) stretch; urgency=medium
 .
   * Add repository-swift plugin to support T295705
Checksums-Sha1:
 79474c5b44236358bacd58f26490c7dbc32978b5 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 e7ec820c548822754da4daf70505f312e2501935 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 56418574cf370dcfc9ed31f9d1782659e697e651 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 0c030af7184c5fcdda276e819b52b9316b61dde7 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Checksums-Sha256:
 1d7055d4b29b55b9048b122d7b59f856da4b97fc1b0aa80f83e90a6a2cc88a87 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 66f58ebc5a786f4961f83bef8b860614b9cba3041a790456096bf12154b030eb 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 98775ab3ca0a2f1f8302630cef6f2cb0e5eda1ba26f242bd2538cab5af23cd0e 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Files:
 5a642fa7ea58533e033fb23fca2b5ecf 688 database optional wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 1c709315d373f5ec3b1dee13dfbc72fe 40718258 database optional wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 9259317418b1bd1ce34a43b806b8e879 35086032 database optional wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 fb08db3ad6979fc478afa8c8666ee3cf 6206 database optional wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo

Working on the plugin build. Getting an error at basically the final step:

ryankemper@apt1001:~$ sudo -E reprepro -C component/elastic65 include stretch-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes && sudo rm -rfv ~/Elastic_Plugins
File "pool/component/elastic65/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb" is already registered with different checksums!
md5 expected: d24b364d64e6898dd0d6c131de039710, got: 9259317418b1bd1ce34a43b806b8e879
sha1 expected: f78b1bd75539aca14e79a9b67b594bba5ed66c2b, got: 56418574cf370dcfc9ed31f9d1782659e697e651
sha256 expected: 05854b89803030182da3e4b3dffc60f668efb66222a5be2eaec1952b668a84bd, got: 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead
size expected: 31251912, got: 35086032
There have been errors!

I haven't actually merged https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/738979 so maybe that's why it's still "seeing" 6.5.4-6 instead of 6.5.4-7


Here's the contents of the changes file:

ryankemper@apt1001:~$ cat /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes
Format: 1.8
Date: Mon, 15 Nov 2021 10:02:14 -0800
Source: wmf-elasticsearch-search-plugins
Binary: wmf-elasticsearch-search-plugins
Architecture: source all
Version: 6.5.4-7
Distribution: stretch
Urgency: medium
Maintainer: David Causse <dcausse@wikimedia.org>
Changed-By: Erik <ebernhardson@ebernhardson-ThinkPad-X1-Carbon-7th>
Description:
 wmf-elasticsearch-search-plugins - Elasticsearch plugins for search
Changes:
 wmf-elasticsearch-search-plugins (6.5.4-7) stretch; urgency=medium
 .
   * Add repository-swift plugin to support T295705
Checksums-Sha1:
 79474c5b44236358bacd58f26490c7dbc32978b5 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 e7ec820c548822754da4daf70505f312e2501935 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 56418574cf370dcfc9ed31f9d1782659e697e651 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 0c030af7184c5fcdda276e819b52b9316b61dde7 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Checksums-Sha256:
 1d7055d4b29b55b9048b122d7b59f856da4b97fc1b0aa80f83e90a6a2cc88a87 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 66f58ebc5a786f4961f83bef8b860614b9cba3041a790456096bf12154b030eb 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 98775ab3ca0a2f1f8302630cef6f2cb0e5eda1ba26f242bd2538cab5af23cd0e 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Files:
 5a642fa7ea58533e033fb23fca2b5ecf 688 database optional wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 1c709315d373f5ec3b1dee13dfbc72fe 40718258 database optional wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 9259317418b1bd1ce34a43b806b8e879 35086032 database optional wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 fb08db3ad6979fc478afa8c8666ee3cf 6206 database optional wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/rules#8 was the problem. Bumped and rebuilt.

Looks like we can maybe use this snippet to infer the build number automatically, if we wish to do so: https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/scripts/mk/pkg-info.mk

(For now I left it manual)

Change 738979 merged by Ryan Kemper:

[operations/software/elasticsearch/plugins@master] Add repository-swift plugin

https://gerrit.wikimedia.org/r/738979

Mentioned in SAL (#wikimedia-operations) [2021-11-18T18:57:47Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:01:57Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:05:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:51:21Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T22:44:12Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T22:52:45Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T23:47:32Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-19T02:42:14Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705

Thanks for reaching out @Gehel. Since I see from the network's POV the plan is to go 8MB/s max per partition (i.e. 2 gigabits total) I'd say let's start with a test at 4MB/s/partition and see how things look like? (cc @MatthewVernon)

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:39:16Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:44:46Z] <ryankemper> T295705 Upgrading relforge elasticsearch packages: ryankemper@cumin1001:~$ sudo cumin -b 2 'relforge*' 'DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins'

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:49:19Z] <ryankemper> [Elastic] T295705 Downtimed relforge* for 2 hours in order to performing a manual rolling restart of the two hosts relforge1003 and relforge1004

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:52:47Z] <ryankemper> [Elastic] T295705 Restarting first relforge host: ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:55:53Z] <ryankemper> [Elastic] T295705 Restarting second and final relforge host: ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:58:14Z] <ryankemper> [Elastic] T295705 Rolling restart w/ plugin upgrade of relforge is complete

Mentioned in SAL (#wikimedia-operations) [2021-11-22T17:01:16Z] <ryankemper> T295705 Beginning rolling restart w/ plugin upgrade of cloudelastic: ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic plugin upgrade + restart" --upgrade --nodes-per-run 3 --start-datetime 2021-11-22T16:59:38 --task-id T295705 on tmux rolling_restarts_cloudelastic

Mentioned in SAL (#wikimedia-operations) [2021-11-22T17:50:53Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T19:49:55Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T20:01:57Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:55:06Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:55:13Z] <ryankemper> T295705 ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation codfw "codfw plugin upgrade + restart" --upgrade --nodes-per-run 2 --start-datetime 2021-11-18T18:55:54 --task-id T295705 on tmux rolling_restarts_codfw

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:57:26Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:58:03Z] <ryankemper> T295705 elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9243): Read timed out. (read timeout=60)) Probably transient failure; will wait 10 mins and try again

Mentioned in SAL (#wikimedia-operations) [2021-11-23T03:06:01Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Change 740710 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cirrussearch: temporarily disable saneitizer

https://gerrit.wikimedia.org/r/740710

Change 740710 merged by Ryan Kemper:

[operations/puppet@production] cirrussearch: temporarily disable saneitizer

https://gerrit.wikimedia.org/r/740710

Mentioned in SAL (#wikimedia-operations) [2021-11-23T04:17:38Z] <ryankemper> T295705 Properly disabled the sane-itizer; we don't want it running until after we (a) complete rolling restarts and (b) restore the missing commonswikI_file index (which is blocked on the restarts)

Change 740711 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cirrussearch: s/sanitizer/saneitizer

https://gerrit.wikimedia.org/r/740711

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:10:48Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:26:33Z] <ryankemper> T295705 Rolling restart of codfw complete. elastic2044 was manually restarted earlier today so the cookbook didn't restart it (b/c we pass in a datetime cutoff threshold) so I'm manually upgrading and restarting that host

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:28:59Z] <ryankemper> T295705 Downtimed elastic2044 for one hour and doing a full reboot for good measure. Already ran the plugin upgrade: DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins

snapshot started earlier today, turns out i logged it to the previous ticket instead of this one:

Mentioned in SAL (#wikimedia-operations) [2021-11-23T17:35:41Z] <ebernhardson> T295478 start snapshot of commonswiki_file from cirrus codfw -> swift eqiad

It's taking it's time, running at around half of where the throttling limits were set (seeing ~80MB/s of network out above normal). For a one-off snapshot i don't imagine it's worth figuring out why. Expecting to start the restore sometime later today, and then start CirrusSearch's catchup routine tomorrow.

Restore ran much faster than the snapshot, started at 16:10, finished around 19:00 and then elastic took another hour to spread replicas across the cluster. Ran the catchup procedure, for the ~15 hours since the snapshot was taken, ran better than in the past and took ~45 minutes to replay 110k updates.

The cluster indices should now be back to normal. Remaining step will be to move traffic back to eqiad, drop snapshots from swift, and remove the plugin from the clusters.

Change 741734 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Revert \"Add repository-swift plugin\"

https://gerrit.wikimedia.org/r/741734

Change 742497 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Move CirrusSearch traffic back to eqiad

https://gerrit.wikimedia.org/r/742497

Snapshots have been deleted from swift. The snapshot configuration has been dropped from eqiad, codfw and relforge clusters. The plugin removal should be mergable, i imagine we won't need a specific rolling restart and the plugin removal can roll out whenever.

Change 742497 merged by jenkins-bot:

[operations/mediawiki-config@master] Move CirrusSearch traffic back to eqiad

https://gerrit.wikimedia.org/r/742497

Mentioned in SAL (#wikimedia-operations) [2021-11-29T20:00:16Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T295705 Move CirrusSearch traffic back to eqiad (duration: 00m 56s)