Page MenuHomePhabricator

Upgrade relforge to elasticsearch 6.8.23
Closed, ResolvedPublic3 Estimated Story Points

Description

See parent task for more details.

AC:

  • relforge cluster is running elasticsearch 6.8.23

Event Timeline

Change 763479 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] elasticsearch: upgrade deployment-prep to elasticsearch 6.8

https://gerrit.wikimedia.org/r/763479

Current upgrade cookbook needs to be adapted to add the steps that have been executed manually on deployment-prep (write access to /etc/elasticsearch required during upgrade). The cookbooks also rely on hardcoded list of servers in Spicerack, which should be addressed in T278378 first.

Gehel set the point value for this task to 3.Mar 7 2022, 4:59 PM

Change 769109 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] elastic: relax & restore perms during upgrade

https://gerrit.wikimedia.org/r/769109

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:48:13Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:49:21Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:51:15Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:06:06Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:14Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Change 769109 merged by jenkins-bot:

[operations/cookbooks@master] elastic: relax & restore perms during upgrade

https://gerrit.wikimedia.org/r/769109

Change 769789 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elastic: add missing restart flag

https://gerrit.wikimedia.org/r/769789

Change 763479 merged by Razzi:

[operations/puppet@production] elasticsearch: upgrade relforge to elasticsearch 6.8

https://gerrit.wikimedia.org/r/763479

Change 769789 merged by jenkins-bot:

[operations/cookbooks@master] elastic: add missing restart flag

https://gerrit.wikimedia.org/r/769789

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:52Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:56Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:06Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:46Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:05:53Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:08:03Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:55:37Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:56:47Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Change 771072 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch: remove custom restart handling

https://gerrit.wikimedia.org/r/771072

bking updated Other Assignee, removed: bking.

Change 771072 merged by Bking:

[operations/cookbooks@master] elasticsearch: remove custom restart handling

https://gerrit.wikimedia.org/r/771072

Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:45:44Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:59:15Z] <ryankemper> T301955 Downtimed relforge for 2 days; stuck in yellow status during upgrade b/c replica shards cannot be scheduled to a host of lower elasticsearch version than primary shards. Working on patch for our rolling-operation cookbook to disable replication during operation

Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:26:39Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:29:12Z] <ryankemper> T301955 Lifted downtime on relforge now that cluster upgrade is complete and cluster is back to green status

RKemper subscribed.

Upgrade complete.

Note that we ran into the following, which we had to work around by manually upgrading the second host:

{"index":"queries_27012021","shard":3,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"CLUSTER_RECOVERED","at":"2022-03-21T21:46:37.871Z","last_allocation_status":"no_attempt"},"can_allocate":"no","allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions":[{"node_id":"E7e7HF1YTvSql8UdZVrLBQ","node_name":"relforge1003-relforge-eqiad","transport_address":"10.64.5.37:9300","node_attributes":{"hostname":"relforge1003","rack":"A2","fqdn":"relforge1003.eqiad.wmnet","row":"A"},"node_decision":"no","deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[queries_27012021][3], node[E7e7HF1YTvSql8UdZVrLBQ], [P], s[STARTED], a[id=4DkiEULDRum86eYAs1T9_g]]"}]},{"node_id":"JYN55FKeSpSEuEqGsMzjIA","node_name":"relforge1004-relforge-eqiad","transport_address":"10.64.21.126:9300","node_attributes":{"hostname":"relforge1004","rack":"B2","row":"B","fqdn":"relforge1004.eqiad.wmnet"},"node_decision":"no","deciders":[{"decider":"node_version","decision":"NO","explanation":"cannot allocate replica shard to a node with version [6.5.4] since this is older than the primary version [6.8.23]"}]}]}

This problem is only a hard blocker on relforge given it's a two host cluster. For production, we don't have that constraint. However the row awareness / allocation constraint will make things complicated so we'll want to be sure to remove that constraint before we upgrade production.

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:32Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:41Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:14:58Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:15:05Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:16:31Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:19:34Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:23Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:26Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:24:01Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:27:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955