Page MenuHomePhabricator

Restore lost index in cloudelastic
Open, Unbreak Now!Public

Description

During T309343 , I inadvertently started a reimage before the cluster could finishing moving all its shards, causing one shard to be lost.

Creating ticket to delete and restore the missing index commonswiki_file_1647920262

Details

ProjectBranchLines +/-Subject
mediawiki/extensions/CirrusSearchmaster+16 -10
operations/puppetproduction+2 -2
mediawiki/extensions/CirrusSearchwmf/1.39.0-wmf.19+16 -10
mediawiki/extensions/CirrusSearchwmf/1.39.0-wmf.18+16 -10
operations/mediawiki-configmaster+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
operations/puppetproduction+4 -4
operations/puppetproduction+6 -2
operations/puppetproduction+15 -11
operations/puppetproduction+26 -2
labs/privatemaster+1 -0
operations/software/elasticsearch/pluginsmaster+6 -0
operations/software/elasticsearch/pluginsmaster+2 -2
operations/software/elasticsearch/pluginsmaster+1 -1
operations/software/elasticsearch/pluginsmaster+24 -0
operations/software/elasticsearch/pluginsmaster+57 -63
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looks like cloudelastic1002 , elastic1074, and thanos-fe1002 all "live" in rack B2. Not that that guarantees faster upload speeds, but I am going to start there.

Specifically, I will attempt to create Elastic snapshot config on elastic1074 following this guide: https://www.elastic.co/guide/en/elasticsearch/plugins/6.8/repository-s3-client.html

Once the config is created, we will attempt to push a snapshot to the thanos-swift cluster.

@RKemper and I paired on this today.

Puppet kept resetting the instances' keystore file, we'll have to deal with that eventually. Also, the elasticsearch-keystore CLI app hard-codes the elasticsearch.keystore file's path to /etc/elasticsearch/elasticsearch.keystore , whereas we run multiple instances out of /etc/elasticsearch/ , each with their own keystore file.

So we had to create a new keystore file, then copy it to the subdirectory ( /etc/elasticsearch/relforge-eqiad in our case, taking care to preserve file permissions and mode [640 and root:elasticsearch] ).

Following the guide linked above, we assembled the following cURL command:

curl -X PUT "localhost:9200/_snapshot/T309648?pretty" -H 'Content-Type: application/json' -d'
{
  "type": "s3",
  "settings": {
    "bucket": "T309648",
    "client": "default",
    "endpoint": "https://thanos-swift.discovery.wmnet/auth/v1.0"
  }
}

which fails with the following error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "repository_verification_exception",
        "reason" : "[T309648] path  is not accessible on master node"
      }
    ],
    "type" : "repository_verification_exception",
    "reason" : "[T309648] path  is not accessible on master node",
    "caused_by" : {
      "type" : "i_o_exception",
      "reason" : "Unable to upload object [tests-n26XKbzCTPKjCo1wkIPelw/master.dat] using a single upload",
      "caused_by" : {
        "type" : "sdk_client_exception",
        "reason" : "Unable to load credentials from service endpoint",
        "caused_by" : {
          "type" : "socket_timeout_exception",
          "reason" : "connect timed out"
        }
      }
    }
  },
  "status" : 500
}

Most likely, the problem is with the elastic S3 plugin settings, as we have confirmed interoperability with the S3 API and thanos-swift cluster in T302494 . Will revisit this tomorrow.

Continued work today with the help of @RKemper and @EBernhardson . We discovered the following:

  • The error listed above occurs when the sensitive values (s3 secret key and access key ) aren't in the elastic keystore. So the error messages above are a bit misleading.
  • Access to thanos-swift endpoint via its S3 API works fine using the official Amazon libraries (boto) .
  • The S3 API and/or client libraries are considerably stricter about names than swift itself. I created a bucket (or "container" in swift-speak) called T309648 . Trying to access this bucket with the boto libraries (or with the Elastic s3 plugin) causes exceptions.
  • There is a concept of "bucket access style" that I don't fully understand, but could be part of our problem.

Summary of investigation today:

  • Turned on debug logging for "com.amazonaws" and "org.elasticsearch.repositories.s3" in elasticsearch
  • There are two ways of accessing an s3 container, virtual hosted-style and path-style access. S3 deprecated path-style access and was intending to turn it off, but then left it in.
  • Our flink updater uses virtual hosted-style access, so that works in prod. But i suspect it only works in k8s
  • Elasticsearch via the aws client keeps attempting <bucket>.thanos-swift.discovery.wmnet but virtual hosted-style dns entries are not available in prod
  • This became confused a bit because invalid bucket names, like T309648 which includes a capital letter, revert to path-style access (but then get a fail response from the service because the bucket name is invalid)

Resolution:

git diff
diff --git a/plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Service.java b/plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Service.java
index d77ace639f1..1677dd20c42 100644
--- a/plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Service.java
+++ b/plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Service.java
@@ -150,6 +150,7 @@ class S3Service extends AbstractComponent implements Closeable {
         // We do this because directly constructing the client is deprecated (was already deprecated in 1.1.223 too)
         // so this change removes that usage of a deprecated API.
         builder.withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, null));
+        builder.enablePathStyleAccess();
 
         return builder.build();
     }
  • Compile with (needs java 11+) ./gradlew plugins:repository-s3:assemble and the plugin will be in plugins/repository-s3/build/distributions/repository-s3-6.8.23-SNAPSHOT.zip
  • Tested this custom plugin on relforge, looks to work as expected
  • Wont need the custom plugin in 7.10, they brought this option back.

Still todo:

  • I had to bypass the keystore with -Des.allow_insecure_properties=true in the jvm.options file. Ideally we will figure out the keystore, might be easier now that we have the rest of the connection verified working.

For the keystore:

  • By default calling elasticsearch-keystore invokes java with the wrong es.path.conf:
/bin/java -Xms4m -Xmx64m -XX:+UseSerialGC -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=oss -Des.distribution.type=deb -cp /usr/share/elasticsearch/lib/* org.elasticsearch.common.settings.KeyStoreCli list
  • We can override by setting ES_PATH_CONF when invoking elasticsearch-keystore:
sudo su -g elasticsearch root /usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list
  • Used same invocation to create s3.client.default.{access,secret}_key in the reflorge-eqiad keystore
  • su -g elasticsearch root ... is necessary to ensure the resulting keystore will be owned by root:elasticsearch and not root:root. If it's owned by root:root elasticsearch will refuse to start.
  • Repeated for relforge1004, although most of this was already setup.

Change 803628 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Use a custom repository-s3 snapshot

https://gerrit.wikimedia.org/r/803628

Change 803628 merged by Bking:

[operations/software/elasticsearch/plugins@master] Use a custom repository-s3 snapshot

https://gerrit.wikimedia.org/r/803628

Change 804003 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/software/elasticsearch/plugins@master] Bump changelog for custom repository-s3 snapshot

https://gerrit.wikimedia.org/r/804003

Change 804003 merged by Ryan Kemper:

[operations/software/elasticsearch/plugins@master] Bump changelog for custom repository-s3 snapshot

https://gerrit.wikimedia.org/r/804003

Mentioned in SAL (#wikimedia-operations) [2022-06-08T23:11:21Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-08T23:15:30Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

dduvall added subscribers: dduvall, Agusbou2015.

I plan to ignore this during train deployment today as it sounds like a fix in progress, and the message length will now be truncated by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/804396 to allow the structured logging to be parsed and filtered correctly. Thanks for the help with that, @EBernhardson !

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:53:26Z] <ryankemper> T309648 Copied newly built wmf-elasticsearch-search-plugins from stretch to bullseye (root@apt1001:/home/ryankemper# reprepro copy bullseye-wikimedia stretch-wikimedia wmf-elasticsearch-search-plugins); then ran apt update on relforge*; new plugin package showing as available now: 6.8.23-3~stretch 1001

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:54:08Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T18:58:27Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T19:21:32Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-09T19:21:41Z] <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - ryankemper@cumin1001 - T309648

Per the Elastic Secure Settings docs,

"All secure settings are node-specific settings that must have the same value on every node. Therefore you must run this command on every node."
[...]

However, we tested this today in deployment-prep and it does appear that you can copy the keystore from one instance to another and it is still readable (tested with elasticsearch-keystore list command). However, we could only see the keys in the keystore not the values, as our version of elasticsearch-keystore is too old for the "show" command. We should verify the values are readable as well, but we're fairly confident it will work.

Also from the docs:

"Modifications to the keystore do not take effect until you restart Elasticsearch." In other words, we will need to restart the whole fleet once we deploy the keystore file.

There's a few ways we could do this:

  • Copying the keystore file to all instances
  • Maintaining a single keystore file per host and symlinking it to all instances
  • Echoing the values into each keystore file, similar to the process described here

I'm leaning towards option 2, but will discuss with colleagues and get back.

Change 807623 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] [wip] elastic: temp keystore for index restoration

https://gerrit.wikimedia.org/r/807623

Change 807650 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[labs/private@master] elastic: add fake elasticsearch.keystore

https://gerrit.wikimedia.org/r/807650

Change 807650 merged by Ryan Kemper:

[labs/private@master] elastic: add fake elasticsearch.keystore

https://gerrit.wikimedia.org/r/807650

Per pairing yesterday, we've decided to deploy the keystore file via puppet.

Private puppet is necessary because the file contains secrets (thanos cluster pw, to be specific).

We will disable puppet and roll out in an orderly fashion, as "modifications to the keystore do not take effect until you restart Elasticsearch."

Mentioned in SAL (#wikimedia-operations) [2022-06-28T19:08:18Z] <ryankemper> T309648 Disabling puppet across all cirrus hosts in order to test out https://gerrit.wikimedia.org/r/c/operations/puppet/+/807623: ryankemper@cumin1001:~$ sudo -E cumin 'R:elasticsearch::instance' 'disable-puppet "T309648"'

Change 807623 merged by Ryan Kemper:

[operations/puppet@production] elastic: configure keystore values for restore

https://gerrit.wikimedia.org/r/807623

Mentioned in SAL (#wikimedia-operations) [2022-06-28T19:14:39Z] <ryankemper> T309648 Enabling puppet on just elastic2053 and running puppet agent. Expecting to see result of https://gerrit.wikimedia.org/r/807623 being that the new s3 user/pass creds are added to the elasticsearch keystore

Change 809243 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix s3 user/pass logic

https://gerrit.wikimedia.org/r/809243

Change 809243 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix s3 user/pass logic

https://gerrit.wikimedia.org/r/809243

Change 809267 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: elasticsearch-keystore takes from stdin

https://gerrit.wikimedia.org/r/809267

Change 809267 merged by Ryan Kemper:

[operations/puppet@production] elastic: elasticsearch-keystore takes from stdin

https://gerrit.wikimedia.org/r/809267

Change 809271 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: use full paths for shell cmds

https://gerrit.wikimedia.org/r/809271

Change 809271 merged by Ryan Kemper:

[operations/puppet@production] elastic: use full paths for shell cmds

https://gerrit.wikimedia.org/r/809271

Change 809276 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: don't mutate keystore group

https://gerrit.wikimedia.org/r/809276

Change 809276 merged by Ryan Kemper:

[operations/puppet@production] elastic: don't mutate keystore group

https://gerrit.wikimedia.org/r/809276

Change 809282 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elasic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809282

Change 809282 merged by Ryan Kemper:

[operations/puppet@production] elasic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809282

Change 809284 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809284

Change 809284 merged by Ryan Kemper:

[operations/puppet@production] elastic: grep path differs btw OS vers

https://gerrit.wikimedia.org/r/809284

Change 809287 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: delegate echo loc to PATH

https://gerrit.wikimedia.org/r/809287

Change 809287 merged by Ryan Kemper:

[operations/puppet@production] elastic: delegate echo loc to PATH

https://gerrit.wikimedia.org/r/809287

Well it took us about 8 patches to get it right but we now have puppet properly handling the keystore logic to add the s3 user/pass.

Next up is to actually try the index restoration from codfw -> cloudelastic

Mentioned in SAL (#wikimedia-operations) [2022-06-29T04:36:12Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T05:56:19Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T06:02:36Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T06:04:11Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T13:13:00Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T16:22:08Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T18:27:06Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-06-29T23:30:14Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster restart to pickup swift-s3 plugin - bking@cumin1001 - T309648

  • We can override by setting ES_PATH_CONF when invoking elasticsearch-keystore:
sudo su -g elasticsearch root /usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list

Turns out this (su -g) only works on debian 10. For the debian 9 hosts we need to use sg to change the group:

sudo sg elasticsearch -c "/usr/bin/env ES_PATH_CONF=/etc/elasticsearch/relforge-eqiad /usr/share/elasticsearch/bin/elasticsearch-keystore list"

@EBernhardson figured out the proper repository settings, specifically the "endpoint" value should be the bare domain, WITHOUT "/auth/v1.0" at the end.

So the API call to register a snapshot repo should be like:

curl -H 'Content-type: Application/json'  http://localhost:9200 -XPUT -d 

 {
  "type": "s3",
  "settings": {
    "bucket": "snapperson",
    "client": "default",
    "endpoint": ""https://thanos-swift.discovery.wmnet",
    "path_style_access": "true"
}
  }

started snapshot in prod via the following cmd:

curl -XPUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648?wait_for_completion=true&pretty" -H 'Content-Type: application/json' -d'
{
  "indices": "commonswiki_file",
  "include_global_state": false,
  "metadata": {
    "taken_by": "bking",
    "taken_because":  "T309648"
  }
}
'

^^ above failed , new snapshot started with the following API call:

curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2?pretty" -H 'Content-Type: application/json' -d'
> {
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "bking",
>     "taken_because":  "T309648"
>   }
> }
> '
{
  "accepted" : true
}

Restarted all nodes on cloudelastic to enable the S3 plugin.
started restore with the following command:

curl -H 'Content-type: Application/json' -XPOST  http://127.0.0.1:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2/_restore \
>  -d '
> {
>   "indices": "commonswiki_file_1647921177",
>   "include_global_state": false
>   }
>   '
{"accepted":true}

Change 811350 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Disable commonswiki writes to cloudelastic

https://gerrit.wikimedia.org/r/811350

Change 811355 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] cloudelastic: Increase primary cluster heap from 45G to 55G

https://gerrit.wikimedia.org/r/811355

Change 811355 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: Increase primary cluster heap from 45G to 55G

https://gerrit.wikimedia.org/r/811355

It looks like the snapshot taken on july 1st only required ~1hr to create the snapshot. I'm deleting the existing snapshots with the following commands, and will create a new one to minimize the number of updates we missed that will have to be corrected by the Saneitizer.

drop snapshots:

curl -XDELETE http://localhost:9200/_snapshot/elastic_snaps/snapshot_t309648
curl -XDELETE http://localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_2

create new snapshot:

 :) (ebernhardson@cloudelastic1004)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_3?pretty" -H 'Content-Type: application/json' -d'
> {
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "ebernhardson",
>     "taken_because":  "T309648"
>   }
> }
>
> '
{
  "accepted" : true
}

create new snapshot:

:) (ebernhardson@cloudelastic1004)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_3?pretty" -H 'Content-Type: application/json' -d'

Wasn't paying enough attention, ran the command from cloudelastic. That's not going to work. Deleted the snapshot and started a 4th attempt from codfw this time:

 :) (ebernhardson@elastic2050)-~$ curl -X PUT "localhost:9200/_snapshot/elastic_snaps/snapshot_t309648_attempt_4?pretty" -H 'Content-Type: application/json' -d'{
>   "indices": "commonswiki_file",
>   "include_global_state": false,
>   "metadata": {
>     "taken_by": "ebernhardson",
>     "taken_because":  "T309648"
>   }
> }'
{
  "accepted" : true
}

Change 811350 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Disable commonswiki writes to cloudelastic

https://gerrit.wikimedia.org/r/811350

Mentioned in SAL (#wikimedia-operations) [2022-07-05T20:24:54Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 23s)

Change 811372 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811372

Change 811279 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.18] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811279

Change 811280 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.19] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811280

Change 811374 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: disable saneitizer for perf reasons

https://gerrit.wikimedia.org/r/811374

Change 811372 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811372

Change 811279 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.18] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811279

Change 811280 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.39.0-wmf.19] job queue: Squelch errors related to unwritable cloudelastic

https://gerrit.wikimedia.org/r/811280

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:19:47Z] <ebernhardson@deploy1002> Synchronized php-1.39.0-wmf.19/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811280|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 43s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:27:36Z] <ebernhardson@deploy1002> Synchronized php-1.39.0-wmf.18/extensions/CirrusSearch/includes/Job/ElasticaWrite.php: Backport: [[gerrit:811279|job queue: Squelch errors related to unwritable cloudelastic (T309648)]] (duration: 03m 37s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T21:35:20Z] <ebernhardson@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: Revert: [[gerrit:811350|cirrus: Disable commonswiki writes to cloudelastic (T309648)]] (duration: 03m 42s)

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:28:11Z] <ryankemper> T309648 Manually restarting cloudelastic1006 before proceeding to a normal rolling restart of cloudelasti

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:28:22Z] <ryankemper> T309648 Manually restarting cloudelastic1006 before proceeding to a normal rolling restart of cloudelastic

Mentioned in SAL (#wikimedia-operations) [2022-07-05T22:48:44Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648

Mentioned in SAL (#wikimedia-operations) [2022-07-05T23:15:38Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart - ryankemper@cumin1001 - T309648

Updated cloudelastic snapshot repository settings to increase restore throttling, by default it was 40mb/s:

curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps -d '{
    "type": "s3",
    "settings": {
        "bucket": "elasticsearch-snapshot",
        "client": "default",
        "path_style_access": "true",
        "endpoint": "https://thanos-swift.discovery.wmnet",
        "max_restore_bytes_per_sec": "512mb"
    }
}'

Started the restore:

$ curl -XPOST -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps/snapshot_t309648_attempt_4/_restore -d '{
    "indices": "commonswiki_file_1647921177",
    "include_global_state": false
}'
{"accepted":true} :)

On restore it's using the same index settings as the source cluster, which means 2 replicas instead of the 1 that cloudelastic is expecting. Updated to match expectations:

$ curl -H 'Content-Type: application/json' -XPUT https://cloudelastic.wikimedia.org:9243/commonswiki_file/_settings -d '{"index":{"auto_expand_replicas": "0-1"}}'                         
{"acknowledged":true} 

Recovery failed, similar errors to last time. Using shardId 14 as an example:

[commonswiki_file_1647921177/kwe-zy2bRpS6EaXeNmvoSA][[commonswiki_file_1647921177][14]]
Caused by: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 905052160

Elasticsearch tries multiple times for each shard. Suspiciously it fails (for same shardId) with the same amount of expected data, but differing amounts of data received each time. This suggests that we are getting prematurely closed connections for elasticsearch<->thanos-swift. Not clear why yet.

Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 704643072
Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 1019478016
Caused by: org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 914489344
Caused by: java.io.IOException: Premature end of Content-Length delimited message body (expected: 1073741824; received: 1028653056

Reviewed the cluster overview dashboards for thanos, on suspicious metric is that disk utilization per host on the thanos-be100[1234] hosts goes from ~50% to 100% as soon as the restore starts. Critically only 1 disk (out of 14) goes that high, but perhaps our user account goes to a particular disk or some such. Unsure. It's only a weak guess, but changing some settings to be gentler to thanos and trying the restore again:

$ curl -XDELETE https://cloudelastic.wikimedia.org:9243/commonswiki_file_1647921177
{"acknowledged":true}
updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [8] to [1]
updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [8] to [1]
$ curl -XPUT -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"persistent": {"cluster.routing.allocation.node_concurrent_recoveries": 1}}' 
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"node_concurrent_recoveries":"1"}}}},"transient":{}}
$ curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_snapshot/elastic_snaps -d '{
>     "type": "s3",
>     "settings": {
>         "bucket": "elasticsearch-snapshot",
>         "client": "default",
>         "path_style_access": "true",
>         "endpoint": "https://thanos-swift.discovery.wmnet",
>         "max_restore_bytes_per_sec": "40mb"
>     }
> }'
{"acknowledged":true} 

Reviewed _cluster/settings and reset things i've previously tuned back to the defaults. Not clear this will help, but trying to push thanos-swift less hard and see if it makes anything start working. Most of these settings were aiming to improve cluster-restart speed and how long it took to shuffle shards around the cluster post-restart

updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [8] to [2]
$ curl -XPUT -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance": null}}'
{"acknowledged":true,"persistent":{},"transient":{}} 
updating [indices.recovery.max_concurrent_file_chunks] from [4] to [1]
$ curl -H 'Content-Type: application/json'  -XPUT https://cloudelastic.wikimedia.org:9243/_cluster/settings -d '{"transient":{"indices.recovery.max_concurrent_file_chunks": null}}'
{"acknowledged":true,"persistent":{},"transient":{}}

So far this is unsuccessful. Attempting a restore reports failing shards within 10 minutes. Additionally even though we told it to only perform 1 restore per node, it starts up 18 (=3 per node) shard recoveries in parallel. As a next step i'm deleting the existing snapshots and will updat the snapshot settings to include chunk_size: 1gb. I have no particular proof this will help, but randomly guessing that if the problem is http connections closing unexpectedly, maybe having smaller files to transfer will help out.

One other suspicious piece, the cluster-overview dashboard for the eqiad thanos cluster showed significantly increased disk utilization, up to a solid 100% on one of the disks for all 4 backend nodes, starting at about 23:30 which is the same time we started the restore. There was a similar spike yesterday at around the same time, but not every day for the past 7 days. Planning to wait until thanos has settled down before attempting to snapshot+restore again.

Per this morning's conversation with @EBernhardson , we are going to shift our focus away from the restore just long enough to upgrade our production clusters to bullseye. As I write this, we are reimaging cloudelastic to bullseye.
See https://phabricator.wikimedia.org/T309343 for more details.