Page MenuHomePhabricator

Cleanup missing Commons index on Elasticsearch eqiad
Closed, ResolvedPublic8 Estimated Story Points

Description

After doing some testing, Erik has a rough recovery plan (from https://phabricator.wikimedia.org/T295478#7501154):

  1. Deploy elasticsearch-repository-swift plugin to eqiad and codfw clusters
  2. Configure both clusters to connect to ms-fe.svc.eqiad.wmnet (swift)
  3. Snapshot the existing commonswiki_file index from the codfw cluster to swift, take note of start time
  4. Restore the snapshot from swift to the eqiad cluster.
  5. Run CirrusSearch downtime catchup procedure against eqiad for the period between starting restore and the cluster no longer failing writes to the commonswiki index.
  6. Undeploy elasticsearch-repository-swift from all clusters

Some related notes:

  • elasticsearch-repository-swift was never released for 6.5.4, I ended up taking the last commit targeting 6.6.0 and compiling it against 6.5.4 (change elasticsearchVersion = 6.5.4, and change gradle from 5 to 4.1). What process should we follow to include this in the plugins .deb since we are no longer the upstream here?
  • Should we have a separate auth setup in swift for cirrussearch snapshots?
  • By default snapshot backup/restore is limited to 20MB/s per partition. Since commonswiki is 32 partitions the cluster will limit itself to 640MB/s, or over 5 gigabits/s. I suspect this is a bit excessive for the swift cluster, or at least beyond doubling the typical network traffic. What would a more appropriate limit be? @fgiunchedi
  • After or during restore of the snapshot we likely need to manually assign the commonswiki_file and commonswiki aliases to it.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dcausse renamed this task from Cleanup missing Commons index on Elasticsearch equiad to Cleanup missing Commons index on Elasticsearch eqiad.Nov 15 2021, 4:56 PM
MPhamWMF set the point value for this task to 8.
TJones updated the task description. (Show Details)
TJones updated the task description. (Show Details)
  • I checked with network ops, if we rate limit to 2 gigabits (8MB/s/partition in repository config) of traffic it should leave plenty of room for everything else.
  • Talked with the team, agreed to ship a locally compiled jar of elasticsearch-repository-swift in our debian package, and then take it out when we are done prior to the 6.8 upgrade.
  • Ideally we should still check in regarding swift auth, but i don't suspect there is any problem with us using our existing data shipping credentials for this purpose.

next steps:

  • release updated debian package
  • rolling-restart eqiad and codfw

Change 738979 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Add repository-swift plugin

https://gerrit.wikimedia.org/r/738979

Working on the plugin build. Getting an error at basically the final step:

ryankemper@apt1001:~$ sudo -E reprepro -C component/elastic65 include stretch-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes && sudo rm -rfv ~/Elastic_Plugins
File "pool/component/elastic65/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb" is already registered with different checksums!
md5 expected: d24b364d64e6898dd0d6c131de039710, got: 9259317418b1bd1ce34a43b806b8e879
sha1 expected: f78b1bd75539aca14e79a9b67b594bba5ed66c2b, got: 56418574cf370dcfc9ed31f9d1782659e697e651
sha256 expected: 05854b89803030182da3e4b3dffc60f668efb66222a5be2eaec1952b668a84bd, got: 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead
size expected: 31251912, got: 35086032
There have been errors!

I haven't actually merged https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/738979 so maybe that's why it's still "seeing" 6.5.4-6 instead of 6.5.4-7


Here's the contents of the changes file:

ryankemper@apt1001:~$ cat /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes
Format: 1.8
Date: Mon, 15 Nov 2021 10:02:14 -0800
Source: wmf-elasticsearch-search-plugins
Binary: wmf-elasticsearch-search-plugins
Architecture: source all
Version: 6.5.4-7
Distribution: stretch
Urgency: medium
Maintainer: David Causse <dcausse@wikimedia.org>
Changed-By: Erik <ebernhardson@ebernhardson-ThinkPad-X1-Carbon-7th>
Description:
 wmf-elasticsearch-search-plugins - Elasticsearch plugins for search
Changes:
 wmf-elasticsearch-search-plugins (6.5.4-7) stretch; urgency=medium
 .
   * Add repository-swift plugin to support T295705
Checksums-Sha1:
 79474c5b44236358bacd58f26490c7dbc32978b5 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 e7ec820c548822754da4daf70505f312e2501935 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 56418574cf370dcfc9ed31f9d1782659e697e651 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 0c030af7184c5fcdda276e819b52b9316b61dde7 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Checksums-Sha256:
 1d7055d4b29b55b9048b122d7b59f856da4b97fc1b0aa80f83e90a6a2cc88a87 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 66f58ebc5a786f4961f83bef8b860614b9cba3041a790456096bf12154b030eb 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 98775ab3ca0a2f1f8302630cef6f2cb0e5eda1ba26f242bd2538cab5af23cd0e 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Files:
 5a642fa7ea58533e033fb23fca2b5ecf 688 database optional wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 1c709315d373f5ec3b1dee13dfbc72fe 40718258 database optional wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 9259317418b1bd1ce34a43b806b8e879 35086032 database optional wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 fb08db3ad6979fc478afa8c8666ee3cf 6206 database optional wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo

Working on the plugin build. Getting an error at basically the final step:

ryankemper@apt1001:~$ sudo -E reprepro -C component/elastic65 include stretch-wikimedia /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes && sudo rm -rfv ~/Elastic_Plugins
File "pool/component/elastic65/w/wmf-elasticsearch-search-plugins/wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb" is already registered with different checksums!
md5 expected: d24b364d64e6898dd0d6c131de039710, got: 9259317418b1bd1ce34a43b806b8e879
sha1 expected: f78b1bd75539aca14e79a9b67b594bba5ed66c2b, got: 56418574cf370dcfc9ed31f9d1782659e697e651
sha256 expected: 05854b89803030182da3e4b3dffc60f668efb66222a5be2eaec1952b668a84bd, got: 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead
size expected: 31251912, got: 35086032
There have been errors!

I haven't actually merged https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/738979 so maybe that's why it's still "seeing" 6.5.4-6 instead of 6.5.4-7


Here's the contents of the changes file:

ryankemper@apt1001:~$ cat /home/ryankemper/Elastic_Plugins/wmf-elasticsearch-search-plugins_6.5.4-7_amd64.changes
Format: 1.8
Date: Mon, 15 Nov 2021 10:02:14 -0800
Source: wmf-elasticsearch-search-plugins
Binary: wmf-elasticsearch-search-plugins
Architecture: source all
Version: 6.5.4-7
Distribution: stretch
Urgency: medium
Maintainer: David Causse <dcausse@wikimedia.org>
Changed-By: Erik <ebernhardson@ebernhardson-ThinkPad-X1-Carbon-7th>
Description:
 wmf-elasticsearch-search-plugins - Elasticsearch plugins for search
Changes:
 wmf-elasticsearch-search-plugins (6.5.4-7) stretch; urgency=medium
 .
   * Add repository-swift plugin to support T295705
Checksums-Sha1:
 79474c5b44236358bacd58f26490c7dbc32978b5 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 e7ec820c548822754da4daf70505f312e2501935 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 56418574cf370dcfc9ed31f9d1782659e697e651 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 0c030af7184c5fcdda276e819b52b9316b61dde7 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Checksums-Sha256:
 1d7055d4b29b55b9048b122d7b59f856da4b97fc1b0aa80f83e90a6a2cc88a87 688 wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 66f58ebc5a786f4961f83bef8b860614b9cba3041a790456096bf12154b030eb 40718258 wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 237854d649b0912fbf38dc42805b59380657622754bfefc1f0760b2c2660eead 35086032 wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 98775ab3ca0a2f1f8302630cef6f2cb0e5eda1ba26f242bd2538cab5af23cd0e 6206 wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo
Files:
 5a642fa7ea58533e033fb23fca2b5ecf 688 database optional wmf-elasticsearch-search-plugins_6.5.4-7.dsc
 1c709315d373f5ec3b1dee13dfbc72fe 40718258 database optional wmf-elasticsearch-search-plugins_6.5.4-7.tar.gz
 9259317418b1bd1ce34a43b806b8e879 35086032 database optional wmf-elasticsearch-search-plugins_6.5.4-6~stretch_all.deb
 fb08db3ad6979fc478afa8c8666ee3cf 6206 database optional wmf-elasticsearch-search-plugins_6.5.4-7_amd64.buildinfo

https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/elasticsearch/plugins/+/refs/heads/master/debian/rules#8 was the problem. Bumped and rebuilt.

Looks like we can maybe use this snippet to infer the build number automatically, if we wish to do so: https://git.dpkg.org/cgit/dpkg/dpkg.git/tree/scripts/mk/pkg-info.mk

(For now I left it manual)

Change 738979 merged by Ryan Kemper:

[operations/software/elasticsearch/plugins@master] Add repository-swift plugin

https://gerrit.wikimedia.org/r/738979

Mentioned in SAL (#wikimedia-operations) [2021-11-18T18:57:47Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:01:57Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:05:00Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T20:51:21Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T22:44:12Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T22:52:45Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-18T23:47:32Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-19T02:42:14Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad plugin upgrade + restart - ryankemper@cumin1001 - T295705

Thanks for reaching out @Gehel. Since I see from the network's POV the plan is to go 8MB/s max per partition (i.e. 2 gigabits total) I'd say let's start with a test at 4MB/s/partition and see how things look like? (cc @MatthewVernon)

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:39:16Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:44:46Z] <ryankemper> T295705 Upgrading relforge elasticsearch packages: ryankemper@cumin1001:~$ sudo cumin -b 2 'relforge*' 'DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins'

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:49:19Z] <ryankemper> [Elastic] T295705 Downtimed relforge* for 2 hours in order to performing a manual rolling restart of the two hosts relforge1003 and relforge1004

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:52:47Z] <ryankemper> [Elastic] T295705 Restarting first relforge host: ryankemper@relforge1004:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:55:53Z] <ryankemper> [Elastic] T295705 Restarting second and final relforge host: ryankemper@relforge1003:~$ sudo systemctl restart elasticsearch_6@relforge-eqiad.service elasticsearch_6@relforge-eqiad-small-alpha.service logstash.service

Mentioned in SAL (#wikimedia-operations) [2021-11-22T16:58:14Z] <ryankemper> [Elastic] T295705 Rolling restart w/ plugin upgrade of relforge is complete

Mentioned in SAL (#wikimedia-operations) [2021-11-22T17:01:16Z] <ryankemper> T295705 Beginning rolling restart w/ plugin upgrade of cloudelastic: ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation cloudelastic "cloudelastic plugin upgrade + restart" --upgrade --nodes-per-run 3 --start-datetime 2021-11-22T16:59:38 --task-id T295705 on tmux rolling_restarts_cloudelastic

Mentioned in SAL (#wikimedia-operations) [2021-11-22T17:50:53Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T19:49:55Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-22T20:01:57Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart with plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:55:06Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:55:13Z] <ryankemper> T295705 ryankemper@cumin1001:~$ sudo cookbook sre.elasticsearch.rolling-operation codfw "codfw plugin upgrade + restart" --upgrade --nodes-per-run 2 --start-datetime 2021-11-18T18:55:54 --task-id T295705 on tmux rolling_restarts_codfw

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:57:26Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T02:58:03Z] <ryankemper> T295705 elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.codfw.wmnet', port=9243): Read timed out. (read timeout=60)) Probably transient failure; will wait 10 mins and try again

Mentioned in SAL (#wikimedia-operations) [2021-11-23T03:06:01Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Change 740710 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cirrussearch: temporarily disable saneitizer

https://gerrit.wikimedia.org/r/740710

Change 740710 merged by Ryan Kemper:

[operations/puppet@production] cirrussearch: temporarily disable saneitizer

https://gerrit.wikimedia.org/r/740710

Mentioned in SAL (#wikimedia-operations) [2021-11-23T04:17:38Z] <ryankemper> T295705 Properly disabled the sane-itizer; we don't want it running until after we (a) complete rolling restarts and (b) restore the missing commonswikI_file index (which is blocked on the restarts)

Change 740711 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cirrussearch: s/sanitizer/saneitizer

https://gerrit.wikimedia.org/r/740711

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:10:48Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart with plugin upgrade (2 nodes at a time) for ElasticSearch cluster search_codfw: codfw plugin upgrade + restart - ryankemper@cumin1001 - T295705

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:26:33Z] <ryankemper> T295705 Rolling restart of codfw complete. elastic2044 was manually restarted earlier today so the cookbook didn't restart it (b/c we pass in a datetime cutoff threshold) so I'm manually upgrading and restarting that host

Mentioned in SAL (#wikimedia-operations) [2021-11-23T05:28:59Z] <ryankemper> T295705 Downtimed elastic2044 for one hour and doing a full reboot for good measure. Already ran the plugin upgrade: DEBIAN_FRONTEND=noninteractive sudo apt-get -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install elasticsearch-oss wmf-elasticsearch-search-plugins

snapshot started earlier today, turns out i logged it to the previous ticket instead of this one:

Mentioned in SAL (#wikimedia-operations) [2021-11-23T17:35:41Z] <ebernhardson> T295478 start snapshot of commonswiki_file from cirrus codfw -> swift eqiad

It's taking it's time, running at around half of where the throttling limits were set (seeing ~80MB/s of network out above normal). For a one-off snapshot i don't imagine it's worth figuring out why. Expecting to start the restore sometime later today, and then start CirrusSearch's catchup routine tomorrow.

Restore ran much faster than the snapshot, started at 16:10, finished around 19:00 and then elastic took another hour to spread replicas across the cluster. Ran the catchup procedure, for the ~15 hours since the snapshot was taken, ran better than in the past and took ~45 minutes to replay 110k updates.

The cluster indices should now be back to normal. Remaining step will be to move traffic back to eqiad, drop snapshots from swift, and remove the plugin from the clusters.

Change 741734 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Revert \"Add repository-swift plugin\"

https://gerrit.wikimedia.org/r/741734

Change 742497 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Move CirrusSearch traffic back to eqiad

https://gerrit.wikimedia.org/r/742497

Snapshots have been deleted from swift. The snapshot configuration has been dropped from eqiad, codfw and relforge clusters. The plugin removal should be mergable, i imagine we won't need a specific rolling restart and the plugin removal can roll out whenever.

Change 742497 merged by jenkins-bot:

[operations/mediawiki-config@master] Move CirrusSearch traffic back to eqiad

https://gerrit.wikimedia.org/r/742497

Mentioned in SAL (#wikimedia-operations) [2021-11-29T20:00:16Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T295705 Move CirrusSearch traffic back to eqiad (duration: 00m 56s)

Change 740711 merged by Bking:

[operations/puppet@production] cirrussearch: s/sanitizer/saneitizer

https://gerrit.wikimedia.org/r/740711

Change 752724 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] cirrussearch: Reenable saneitizer

https://gerrit.wikimedia.org/r/752724

In addition to the patch to reenable the saneitizer, the patch to remove the swift plugin is also waiting for review: https://gerrit.wikimedia.org/r/c/operations/software/elasticsearch/plugins/+/741734/

Change 741734 merged by DCausse:

[operations/software/elasticsearch/plugins@master] Revert \"Add repository-swift plugin\"

https://gerrit.wikimedia.org/r/741734

Change 752724 merged by Bking:

[operations/puppet@production] cirrussearch: Reenable saneitizer

https://gerrit.wikimedia.org/r/752724

Saneitizer has been re-disabled (https://gerrit.wikimedia.org/r/c/operations/puppet/+/758317) as it was causing update lag. This needs to be investigated before we can close this task.

I had a look over the graphs and I suspect there is no great answer in trying to find the correct numbers that will make everything work.

  • Looking at 30 day graphs, deploying the saneitizer had no noticable effect on mean or tail latency of the ElasticaWrite job runtime. If our bottleneck was at the elasticsearch layer i would expect these latencies to increase, which suggests to keep looking elsewhere.
  • In the past we've come to a similar conclusion and increased the job runner allowed parallelism. A symptom of implementing multi-cluster replication inside the mediawiki job queue. Looking at the number of jobs submitted over 4 hour time slices[1] (enough to smooth the graph), deploying the saneitizer increased the insertion rate from ~160k jobs per 4hrs up to ~300k jobs per 4hrs. Roughly doubling the rate we insert jobs into the queue.
  • Our current job configuration allows for 100 concurrent job runners per cluster it's writing to, so around 300 total. All three pools were insufficient, although partition 1 (codfw) fell significantly further behind than the others. This suggests additional per-request latency has an outsized effect.

My interpretation of the above is that ElasticSearch is willing and able to take more indexing load, codfw falling the furthest behind while having signficantly more compute available than cloudelastic suggests to me that there isn't enough parallelism available to deliver all the changes to elasticsearch. We could potentially increase job runner parallelism again, i suspect (without hard data, although it could be looked up) that most of the ElasticaWrite jobs are idling in a curl call waiting for elasticsearch to respond and aren't doing much else. Even if we can increase parallelism, we also need to find a way to re-instate @dcausse prior work that made these jobs self-throttling. In the past these jobs would query the job queue, find out the queues are busy, and re-schedule itself to try again later. In that case only the Saneitizer ran behind and not the entire indexing pipeline.

@Pchelolo Do you have any intuitions about where the job queue limits are? I'm sure the node runner distributing jobs will be fine, but do we have do worry about capacity of the mw job runner instances themselves?

[1] increase(kafka_server_BrokerTopicMetrics_MessagesIn_total{cluster="kafka_main",kafka_cluster="main-eqiad",topic=~"^eqiad\\.mediawiki\\.job\\.cirrusSearchElasticaWrite$"}[4h])

I suspect sum (phpfpm_processes_total{cluster="jobrunner"}) by (cluster, state) from prometheus is a reasonable guide for live concurrency of the jobrunners. This would suggest the jobrunner infrastructure is typically running ~5-600 concurrent requests. We suspect though that ElasticaWrite jobs are already consuming 300 concurrent runners and the graphs don't really show any different activity before/after saneitizer deployment. Having 50% of all runners, but not changing the graphs noticably doesn't seem right. Likely needs more investigation.

Looking over T266762 and T255684 from previous occurrences, I think we can get a view of the live per-partition concurrency from prometheus with https://grafana.wikimedia.org/goto/0v1yWzanz [1]. This suggests, although I don't fully trust the numbers please review the query, that while we are configured for a concurrency of 100 per partition, 300 total, whats actually being delivered is between 10 and 30 per pod. This would align with the data above that phpfpm processes did not noticably change when deploying the saneitizer. Surprisingly, per this graph, concurrency *dropped* when the saneitizer was deployed. All of this still feels mysterious, there is some unexplained factor. Perhaps that factor is that cpjobqueue can't deliver the configured concurrency, but beyond this graph I'm not sure where else to search for definitive proof.

[1]

label_replace(
  cpjobqueue_normal_rule_processing{rule=~".*cirrusSearchElasticaWrite$",rule!~".*-partitioner-mediawiki-job-.*",quantile="0.5"}
  * on (rule, kubernetes_pod_name) irate(cpjobqueue_normal_rule_processing_count{rule=~".*cirrusSearchElasticaWrite$", rule!~".*-partitioner-mediawiki-job-.*"}[5m]),
   "rule_short",
 "$2$4 ($1$3)",
 "rule",
 ".*-(partitioned|partitioner)-mediawiki-job-(.+)|.*-mediawiki-(job)-(.+)|"
)

I'm not sure how much more we should do here ourselves, I've filled T300914 to ask the platform engineering team to look into cpjobqueue.

One issue with the deployment was fixed by platform eng and our jobs are now draining again. Once it's drained I'll manually trigger a round of saneitizer to verify it can complete in less than the 2 hours we expect.

Delivered concurrency while draining the job queue is around 15-20, suggesting also that our configured concurrency of 100 is much too high. If cpjobqueue can be updated to deliver something closer to the configured concurrency we should evaluate if this can be brought down to perhaps half the current value.

Hugh has made some great improvments in the rate we see through cpjobqueue, but it's still insufficient. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/762964 adds an option to cirrus to un-isolate the clusters. This will reduce our usage of the job queue by doing writes to eqiad and codfw clusters directly from the LinksUpdate jobs instead of enqueuing separate jobs to run them. We built this isolation initially because cloudelastic fell behind on updates, and that forced the main clusters to fall behind as well. The updated solution keeps cloudelastic isolated, but removes the isolation between prod clusters. If one falls behind both will. This will have to be carefully monitored on deployment, because currently the jobrunners can't keep up on codfw. The main thing we don't want to see is codfw pulling down eqiad.

Change 765577 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Reduce write isolation to only cloudelastic

https://gerrit.wikimedia.org/r/765577

Change 765577 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Reduce write isolation to only cloudelastic

https://gerrit.wikimedia.org/r/765577

Mentioned in SAL (#wikimedia-operations) [2022-02-24T21:29:07Z] <brennen@deploy1002> Synchronized wmf-config/CirrusSearch-production.php: Config: [[gerrit:765577|cirrus: Reduce write isolation to only cloudelastic (T295705)]] (duration: 00m 55s)

In terms of draining the already existing backlog of writes, the reduction is write isolation is so far looking reasonable. CirrusSearchLinksUpdate backlog has stabilized around 1-2minutes of backlog, with prioritized updates staying sub second. Worth noting that prioritization didn't actually work while writes were isolated, this change brings back the functionality to prioritize edits over non-revision based updates.

The codfw backlog had 8M+ updates to run through, they have been clearing out and should catch up in a few hours. Once that backlog is cleared we can manually run the saneitizer and see how things perform.

This ticket has been going on long enough, and the primary purpose is resolved. I created a new ticket to cover restoring the functionality that was turned off.