Page MenuHomePhabricator

Restart elasticsearch clusters to apply readahead changes
Closed, ResolvedPublic3 Estimated Story Points

Description

Overview

We performed a round of restarts following https://gerrit.wikimedia.org/r/c/operations/puppet/+/632319, but we later uncovered an issue that prevented the readahead changes from taking effect. We need to restart again in order to get the readahead shrinking change properly applied.

AC

  • Restart performed on all relevant elasticsearch nodes (relforge, cloudforge, eqiad, codfw)
    • relforge
    • cloudelastic
    • eqiad
    • codfw
  • Verified new readahead values are taking effect

DEPLOY NOTES

We'll deploy the production cirrus clusters (eqiad + codfw) first, and then cloudelastic and relforge later.

This is because for cloudelastic100[5,6] the readahead-udev rule doesn't work due to different partition. (See https://phabricator.wikimedia.org/T265699 to track progress on that fix)

Event Timeline

CBogen set the point value for this task to 3.Oct 26 2020, 6:50 PM

Mentioned in SAL (#wikimedia-operations) [2020-10-28T02:58:43Z] <ryankemper> T266492 Beginning rolling restart of codfw cirrus cluster, 3 nodes at a time, on ryankemper@cumin2001 tmux session elasticsearch_restart_codfw

Mentioned in SAL (#wikimedia-operations) [2020-10-28T04:43:45Z] <ryankemper> T266492 Finished rolling restart of codfw cirrus cluster

RKemper updated the task description. (Show Details)
Gehel triaged this task as High priority.Oct 28 2020, 1:29 PM

Mentioned in SAL (#wikimedia-operations) [2020-10-29T01:17:24Z] <ryankemper> T266492 Beginning rolling restart of eqiad cirrus cluster, 3 nodes at a time, on ryankemper@cumin1001 tmux session elasticsearch_restart_eqiad

Mentioned in SAL (#wikimedia-operations) [2021-01-13T07:04:35Z] <ryankemper> T266492 T268779 T265699 Restarting cloudelastic to apply new readahead changes, this will also verify cloudelastic support works in our elasticsearch spicerack code. Only going one node at a time because cloudelastic elasticsearch indices only have 1 replica shard per index.

Mentioned in SAL (#wikimedia-operations) [2021-01-13T22:53:02Z] <ryankemper> T266492 T268779 T265699 Restarting cloudelastic to apply new readahead changes, this will also verify cloudelastic support works in our elasticsearch spicerack code. Only going one node at a time because cloudelastic elasticsearch indices only have 1 replica shard per index

Mentioned in SAL (#wikimedia-operations) [2021-01-13T22:53:09Z] <ryankemper> T266492 T268779 T265699 sudo -i cookbook sre.elasticsearch.rolling-restart cloudelastic "cloudelastic cluster restart" --task-id T266492 --nodes-per-run 1

Mentioned in SAL (#wikimedia-operations) [2021-01-14T00:00:32Z] <ryankemper> T266492 T268779 T265699 Rolling restart of cloudelastic was successful

Mentioned in SAL (#wikimedia-operations) [2021-01-14T00:10:03Z] <ryankemper> T266492 Beginning rolling restart of relforge

Mentioned in SAL (#wikimedia-operations) [2021-01-14T00:13:38Z] <ryankemper> sudo -i cookbook sre.elasticsearch.rolling-restart relforge "relforge cluster restart" --task-id T266492 --nodes-per-run 1 --without-lvs

Mentioned in SAL (#wikimedia-operations) [2021-01-14T00:22:23Z] <ryankemper> T266492 Restart of relforge successful