Restore lost index in cloudelastic
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	May 31 2022, 6:17 PM

Description

During T309343 , I inadvertently started a reimage before the cluster could finishing moving all its shards, causing one shard to be lost.

Creating ticket to delete and restore the missing index commonswiki_file_1647920262

Details

Subject	Repo	Branch	Lines +/-
elastic: disable saneitizer for perf reasons	operations/puppet	production	+2 -2
job queue: Squelch errors related to unwritable cloudelastic	mediawiki/extensions/CirrusSearch	master	+16 -10
job queue: Squelch errors related to unwritable cloudelastic	mediawiki/extensions/CirrusSearch	wmf/1.39.0-wmf.19	+16 -10
job queue: Squelch errors related to unwritable cloudelastic	mediawiki/extensions/CirrusSearch	wmf/1.39.0-wmf.18	+16 -10
cirrus: Disable commonswiki writes to cloudelastic	operations/mediawiki-config	master	+1 -0
cloudelastic: Increase primary cluster heap from 45G to 55G	operations/puppet	production	+1 -1
elastic: delegate echo loc to PATH	operations/puppet	production	+2 -2
elastic: grep path differs btw OS vers	operations/puppet	production	+2 -0
elasic: grep path differs btw OS vers	operations/puppet	production	+2 -2
elastic: don't mutate keystore group	operations/puppet	production	+2 -0
elastic: use full paths for shell cmds	operations/puppet	production	+4 -4
elastic: elasticsearch-keystore takes from stdin	operations/puppet	production	+6 -2
elastic: fix s3 user/pass logic	operations/puppet	production	+15 -11
elastic: configure keystore values for restore	operations/puppet	production	+26 -2
elastic: add fake elasticsearch.keystore	labs/private	master	+1 -0
Bump changelog for custom repository-s3 snapshot	operations/software/elasticsearch/plugins	master	+6 -0
Use a custom repository-s3 snapshot	operations/software/elasticsearch/plugins	master	+2 -2
Fix wrong BUILD_VERSION	operations/software/elasticsearch/plugins	master	+1 -1
Elastic: Add S3 plugin	operations/software/elasticsearch/plugins	master	+24 -0
Revert "Upgrade to elasticsearch 7.10.2"	operations/software/elasticsearch/plugins	master	+57 -63

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		bking	T309648 Restore lost index in cloudelastic
Resolved		bking	T309720 Deploy S3 plugin on all Search team-managed Elastic hosts
Resolved		bking	T309868 Test openjdk 8 package upgrades on eqiad stretch hosts
Resolved		MatthewVernon	T309715 Create swift thanos account for Search platform team
Resolved		bking	T309721 Build new wmf-elasticsearch-search-plugins deb package for bullseye
Resolved	PRODUCTION ERROR	Gehel	T311247 CirrusSearchChangeFailed: Error in one or more bulk request actions

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2022-06-08T23:11:21Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-08T23:15:30Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

I plan to ignore this during train deployment today as it sounds like a fix in progress, and the message length will now be truncated by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/804396 to allow the structured logging to be parsed and filtered correctly. Thanks for the help with that, @EBernhardson !

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:53:26Z] <ryankemper> T309648 Copied newly built wmf-elasticsearch-search-plugins from stretch to bullseye (root@apt1001:/home/ryankemper# reprepro copy bullseye-wikimedia stretch-wikimedia wmf-elasticsearch-search-plugins); then ran apt update on relforge*; new plugin package showing as available now: 6.8.23-3~stretch 1001

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:54:08Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:58:27Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T19:21:32Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T19:21:41Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

RhinosF1 merged a task: T310400: cloudelastic1001 through cloudelastic1006: CRITICAL - unassigned shard / commonswiki_file.Jun 10 2022, 9:45 PM

RhinosF1 added a subscriber: Dzahn.

I ACKed the Icinga alerts with a link to this so they are not in "unhandled CRIT" anymore.

Per the Elastic Secure Settings docs,

"All secure settings are node-specific settings that must have the same value on every node. Therefore you must run this command on every node."
[...]

However, we tested this today in deployment-prep and it does appear that you can copy the keystore from one instance to another and it is still readable (tested with elasticsearch-keystore list command). However, we could only see the keys in the keystore not the values, as our version of elasticsearch-keystore is too old for the "show" command. We should verify the values are readable as well, but we're fairly confident it will work.

Also from the docs:

"Modifications to the keystore do not take effect until you restart Elasticsearch." In other words, we will need to restart the whole fleet once we deploy the keystore file.

There's a few ways we could do this:

Copying the keystore file to all instances
Maintaining a single keystore file per host and symlinking it to all instances
Echoing the values into each keystore file, similar to the process described here

I'm leaning towards option 2, but will discuss with colleagues and get back.

Change 807623 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] [wip] elastic: temp keystore for index restoration

https://gerrit.wikimedia.org/r/807623

Change 807650 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[labs/private@master] elastic: add fake elasticsearch.keystore

https://gerrit.wikimedia.org/r/807650

Change 807650 merged by Ryan Kemper:

[labs/private@master] elastic: add fake elasticsearch.keystore

https://gerrit.wikimedia.org/r/807650

RKemper mentioned this in rLPRI7a5ddcc43f73: elastic: add fake elasticsearch.keystore.Jun 22 2022, 10:30 PM

Per pairing yesterday, we've decided to deploy the keystore file via puppet.

Private puppet is necessary because the file contains secrets (thanos cluster pw, to be specific).

We will disable puppet and roll out in an orderly fashion, as "modifications to the keystore do not take effect until you restart Elasticsearch."

Gehel assigned this task to bking.Jun 27 2022, 3:15 PM

Gehel added a subtask: T311247: CirrusSearchChangeFailed: Error in one or more bulk request actions.Jun 27 2022, 3:28 PM

Gehel mentioned this in T311247: CirrusSearchChangeFailed: Error in one or more bulk request actions.

Mentioned in SAL (#wikimedia-operations) [2022-06-28T19:08:18Z] <ryankemper> T309648 Disabling puppet across all cirrus hosts in order to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/807623: ryankemper@cumin1001:~$ sudo -E cumin 'R:elasticsearch::instance' 'disable-puppet "T309648"'

Change 807623 merged by Ryan Kemper:

[operations/puppet@production] elastic: configure keystore values for restore

https://gerrit.wikimedia.org/r/807623

Mentioned in SAL (#wikimedia-operations) [2022-06-28T19:14:39Z] <ryankemper> T309648 Enabling puppet on just elastic2053 and running puppet agent. Expecting to see result of https://gerrit.wikimedia.org/r/807623 being that the new s3 user/pass creds are added to the elasticsearch keystore

Change 809243 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix s3 user/pass logic

https://gerrit.wikimedia.org/r/809243

Change 809243 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix s3 user/pass logic

https://gerrit.wikimedia.org/r/809243

Change 809267 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: elasticsearch-keystore takes from stdin

https://gerrit.wikimedia.org/r/809267

Change 809267 merged by Ryan Kemper:

[operations/puppet@production] elastic: elasticsearch-keystore takes from stdin

https://gerrit.wikimedia.org/r/809267

Change 809271 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: use full paths for shell cmds

https://gerrit.wikimedia.org/r/809271

Change 809271 merged by Ryan Kemper:

[operations/puppet@production] elastic: use full paths for shell cmds

https://gerrit.wikimedia.org/r/809271

Change 809276 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't mutate keystore group

https://gerrit.wikimedia.org/r/809276

Change 809276 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't mutate keystore group

https://gerrit.wikimedia.org/r/809276

Change 809282 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809282

Change 809282 merged by Ryan Kemper:

[operations/puppet@production] elasic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809282

Change 809284 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809284

Change 809284 merged by Ryan Kemper:

[operations/puppet@production] elastic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809284

Change 809287 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: delegate echo loc to PATH

https://gerrit.wikimedia.org/r/809287

Change 809287 merged by Ryan Kemper:

[operations/puppet@production] elastic: delegate echo loc to PATH

https://gerrit.wikimedia.org/r/809287

Well it took us about 8 patches to get it right but we now have puppet properly handling the keystore logic to add the s3 user/pass.

Next up is to actually try the index restoration from codfw -> cloudelastic

Mentioned in SAL (#wikimedia-operations) [2022-06-29T04:36:12Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T05:56:19Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T06:02:36Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T06:04:11Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T13:13:00Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T16:22:08Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T18:27:06Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T23:30:14Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

In T309648#7987610, @EBernhardson wrote:
We can override by setting ES_PATH_CONF when invoking elasticsearch-keystore:
sudo su -g elasticsearch root /usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list

Turns out this (su -g) only works on debian 10. For the debian 9 hosts we need to use sg to change the group:

sudo sg elasticsearch -c "/usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list"

@EBernhardson figured out the proper repository settings, specifically the "endpoint" value should be the bare domain, WITHOUT "/auth/v1.0" at the end.

So the API call to register a snapshot repo should be like:

curl -H 'Content-type: Application/json'  http://localhost:9200 -XPUT -d 

 {
  "type": "s3",
  "settings": {
    "bucket": "snapperson",
    "client": "default",
    "endpoint": ""https://thanos-swift.discovery.wmnet",
    "path_style_access": "true"
}
  }

started snapshot in prod via the following cmd:

curl -XPUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648?wait_for_completion=true&pretty" -H 'Content-Type: application/json' -d'
{
  "indices": "commonswiki_file",
  "include_global_state": false,
  "metadata": {
    "taken_by": "bking",
    "taken_because":  "T309648"
  }
}
'

^^ above failed , new snapshot started with the following API call:

curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2?pretty" -H 'Content-Type: application/json' -d'
> {
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "bking",
>     "taken_because":  "T309648"
>   }
> }
> '
{
  "accepted" : true
}

Restarted all nodes on cloudelastic to enable the S3 plugin.
started restore with the following command:

curl -H 'Content-type: Application/json' -XPOST  http://127.0.0.1:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2/_restore \
>  -d '
> {
>   "indices": "commonswiki_file_1647921177",
>   "include_global_state": false
>   }
>   '
{"accepted":true}

Change 811350 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Disable commonswiki writes to cloudelastic

https://gerrit.wikimedia.org/r/811350

Change 811355 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] cloudelastic: Increase primary cluster heap from 45G to 55G

https://gerrit.wikimedia.org/r/811355

Change 811355 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: Increase primary cluster heap from 45G to 55G

https://gerrit.wikimedia.org/r/811355

It looks like the snapshot taken on july 1st only required ~1hr to create the snapshot. I'm deleting the existing snapshots with the following commands, and will create a new one to minimize the number of updates we missed that will have to be corrected by the Saneitizer.

drop snapshots:

curl -XDELETE http://localhost:9200/_snapshot/elastic_snaps/snapshot_t309648
curl -XDELETE http://localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2

create new snapshot:

 :) (ebernhardson@cloudelastic1004)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_3?pretty" -H 'Content-Type: application/json' -d'
> {
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "ebernhardson",
>     "taken_because":  "T309648"
>   }
> }
>
> '
{
  "accepted" : true
}

In T309648#8053276, @EBernhardson wrote:

create new snapshot:

:) (ebernhardson@cloudelastic1004)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_3?pretty" -H 'Content-Type: application/json' -d'

Wasn't paying enough attention, ran the command from cloudelastic. That's not going to work. Deleted the snapshot and started a 4th attempt from codfw this time:

 :) (ebernhardson@elastic2050)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_4?pretty" -H 'Content-Type: application/json' -d'{
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "ebernhardson",
>     "taken_because":  "T309648"
>   }
> }'
{
  "accepted" : true
}

Change 811350 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Disable commonswiki writes to cloudelastic

https://gerrit.wikimedia.org/r/811350

Mentioned in SAL (#wikimedia-operations) [2022-07-05T20:24:54Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 23s)

Change 811372 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811372

Change 811279 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.18] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811279

Change 811280 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.19] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811280

Change 811374 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: disable saneitizer for perf reasons

https://gerrit.wikimedia.org/r/811374

Change 811372 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811372

Change 811279 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.18] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811279

Change 811280 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.19] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811280

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:19:47Z] <ebernhardson@deploy1002> Synchronized php-1.39.0-wmf.19/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811280|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 43s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:27:36Z] <ebernhardson@deploy1002> Synchronized php-1.39.0-wmf.18/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811279|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 37s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:35:20Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 42s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:28:11Z] <ryankemper> T309648 Manually restarting cloudelastic1006 before proceeding to a normal rolling restart of cloudelasti

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:28:22Z] <ryankemper> T309648 Manually restarting cloudelastic1006 before proceeding to a normal rolling restart of cloudelastic

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:48:44Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-07-05T23:15:38Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648

Updated cloudelastic snapshot repository settings to increase restore throttling, by default it was 40mb/s:

curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps -d '{
    "type": "s3",
    "settings": {
        "bucket": "elasticsearch-snapshot",
        "client": "default",
        "path_style_access": "true",
        "endpoint": "https://thanos-swift.discovery.wmnet",
        "max_restore_bytes_per_sec": "512mb"
    }
}'

Started the restore:

$ curl -XPOST -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps/snapshot_t309648_attempt_4/_restore -d '{
    "indices": "commonswiki_file_1647921177",
    "include_global_state": false
}'
{"accepted":true} :)

On restore it's using the same index settings as the source cluster, which means 2 replicas instead of the 1 that cloudelastic is expecting. Updated to match expectations:

$ curl -H 'Content-Type: application/json' -XPUT https://cloudelastic.wikimedia.org:9243/commonswiki_file/_settings -d '{"index":{"auto_expand_replicas": "0-1"}}'                         
{"acknowledged":true}

Recovery failed, similar errors to last time. Using shardId 14 as an example:

[commonswiki_file_1647921177/kwe-zy2bRpS6EaXeNmvoSA][[commonswiki_file_1647921177][14]]
Caused by: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 905052160

Elasticsearch tries multiple times for each shard. Suspiciously it fails (for same shardId) with the same amount of expected data, but differing amounts of data received each time. This suggests that we are getting prematurely closed connections for elasticsearch<->thanos-swift. Not clear why yet.

Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 704643072
Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 1019478016
Caused by: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 914489344
Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 1028653056

Reviewed the cluster overview dashboards for thanos, on suspicious metric is that disk utilization per host on the thanos-be100[1234] hosts goes from ~50% to 100% as soon as the restore starts. Critically only 1 disk (out of 14) goes that high, but perhaps our user account goes to a particular disk or some such. Unsure. It's only a weak guess, but changing some settings to be gentler to thanos and trying the restore again:

$ curl -XDELETE https://cloudelastic.wikimedia.org:9243/commonswiki_file_1647921177
{"acknowledged":true}

updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [8] to [1]
updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [8] to [1]

$ curl -XPUT -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"persistent": {"cluster.routing.allocation.node_concurrent_recoveries": 1}}' 
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"node_concurrent_recoveries":"1"}}}},"transient":{}}

$ curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps -d '{
>     "type": "s3",
>     "settings": {
>         "bucket": "elasticsearch-snapshot",
>         "client": "default",
>         "path_style_access": "true",
>         "endpoint": "https://thanos-swift.discovery.wmnet",
>         "max_restore_bytes_per_sec": "40mb"
>     }
> }'
{"acknowledged":true}

Reviewed _cluster/settings and reset things i've previously tuned back to the defaults. Not clear this will help, but trying to push thanos-swift less hard and see if it makes anything start working. Most of these settings were aiming to improve cluster-restart speed and how long it took to shuffle shards around the cluster post-restart

updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [8] to [2]

$ curl -XPUT -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance": null}}'
{"acknowledged":true,"persistent":{},"transient":{}}

updating [indices.recovery.max_concurrent_file_chunks] from [4] to [1]

$ curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"transient":{"indices.recovery.max_concurrent_file_chunks": null}}'
{"acknowledged":true,"persistent":{},"transient":{}}

So far this is unsuccessful. Attempting a restore reports failing shards within 10 minutes. Additionally even though we told it to only perform 1 restore per node, it starts up 18 (=3 per node) shard recoveries in parallel. As a next step i'm deleting the existing snapshots and will updat the snapshot settings to include chunk_size: 1gb. I have no particular proof this will help, but randomly guessing that if the problem is http connections closing unexpectedly, maybe having smaller files to transfer will help out.

One other suspicious piece, the cluster-overview dashboard for the eqiad thanos cluster showed significantly increased disk utilization, up to a solid 100% on one of the disks for all 4 backend nodes, starting at about 23:30 which is the same time we started the restore. There was a similar spike yesterday at around the same time, but not every day for the past 7 days. Planning to wait until thanos has settled down before attempting to snapshot+restore again.

Per this morning's conversation with @EBernhardson , we are going to shift our focus away from the restore just long enough to upgrade our production clusters to bullseye. As I write this, we are reimaging cloudelastic to bullseye.
See https://phabricator.wikimedia.org/T309343 for more details.

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.18; 2022-06-27).Jul 12 2022, 2:01 AM

Managed a complete restore of commonswiki_file in cloudelastic. This ticket should stay open until the process is documented in https://wikitech.wikimedia.org/wiki/Search