Page MenuHomePhabricator

Switchover es5 master from es1023 to es1024
Closed, ResolvedPublic

Description

Let's switchover es1023 to es1024 which is already running Buster and Mariadb 10.4.

The idea is to move writes to es4 for a few minutes and then do the switchover on es5 master.

Steps:

  • Give weight 50 to es1024
  • Disable alerts on es5 hosts
  • switchover.py --timeout=15 --only-slave-move es1023.eqiad.wmnet es1024.eqiad.wmnet
  • Disable puppet es1024 and es1023
  • Merge puppet change to promote es1024 to master https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/607236/
  • Disable writes for es5 on MW: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/
  • !log Starting es5 failover from es1023 es1024
  • ./switchover.py --skip-slave-move es1023 es1024
  • Depool es1023 entirely
  • First 10.4 master, so let's double check that all the slaves are well connected.
  • Let es1023 replication stopped, as it will be running 10.1 and its master will run 10.4, let's reimage before starting replication.
  • dbctl --scope eqiad section es5 set-master es1024
  • Enable and run puppet at es1023 and es1024
  • events_coredb_master.sql on the new master es1024
  • events_coredb_slave.sql on the new slave es1023
  • Revert the above patch to make es5 writable again.
  • Change es5-master DNS https://gerrit.wikimedia.org/r/c/operations/dns/+/609899/
  • Disable notifications on es1023
  • Reimage es1023
  • Slowly repool es1023 and remove weight from es1024

@jcrespo I would appreciate a review of the above simplified steps

Date & time: Tuesday 7th July at 05:00 AM UTC

Event Timeline

Marostegui triaged this task as Medium priority.Jun 18 2020, 9:58 AM
Marostegui moved this task from Triage to Pending comment on the DBA board.

We cannot put es5 in RO from MW

We used to be able to do it: we can remove writes from it from rotation, and only write to es4. That allows a switchover without read only/impact on application (make es1, es2, es3 & es5 read only, leave only es4 as r/w). I don't know how that is done now, with the new schema- you should ask dbctl/mw experts.

@CDanis can you confirm if es5 can be set up as RO with: dbctl --scope eqiad section es5 ro "Maintenance on es5" and then revert with dbctl --scope eqiad section es5 rw ?
I don't recall if we've ever done an es switchover with dbctl yet!

As per the chat yesterday on -databases, I am going to assume we cannot do it and we need to go via MW to disable es5 momentarily, make everything write to es4, do the switchover and then revert that change.
Going to change the steps above accordingly

Change 606663 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool cluster27 (es5) from writes.

https://gerrit.wikimedia.org/r/606663

Looks like you already figured this out, but just commenting to confirm: a config deploy editing $wgDefaultExternalStore and 'templateOverridesByCluster' => in db-eqiad.php will be needed.

We could make dbctl (or rather the glue code that translates dbctl JSON into Mediawiki data structures) handle this; I don't think it would be too hard.

Looks like you already figured this out, but just commenting to confirm: a config deploy editing $wgDefaultExternalStore and 'templateOverridesByCluster' => in db-eqiad.php will be needed.

We are not fully sure about templateOverridesByCluster though, the last time we set an es section in RO, we didn't change it there, we just changed wgDefaultExternalStore see: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/454210/
This is the patch I have put up for review https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/606663/ which actually does that.
However I would like @aaron @Krinkle or @tstarling to confirm whether we need to set it to 'is static' => true or not too.

@jcrespo let's do this Tuesday 30th at 05:00 AM UTC?

Change 607236 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote es1024 to es5 master

https://gerrit.wikimedia.org/r/607236

Moved this to Tuesday 7th July at 05:00 AM UTC as I will be off the 1st of July, and I want to keep an eye after the switchover and the following days.

Mentioned in SAL (#wikimedia-operations) [2020-07-06T13:26:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Reduce es1024 weight in preparation for tomorrow's switchover T255755', diff saved to https://phabricator.wikimedia.org/P11750 and previous config saved to /var/cache/conftool/dbconfig/20200706-132634-marostegui.json

Change 607236 merged by Marostegui:
[operations/puppet@production] mariadb: Promote es1024 to es5 master

https://gerrit.wikimedia.org/r/607236

Change 606663 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool cluster27 (es5) from writes.

https://gerrit.wikimedia.org/r/606663

Mentioned in SAL (#wikimedia-operations) [2020-07-07T05:02:35Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Disable es5 writes T255755 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2020-07-07T05:12:36Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote es1024 to es5 master T255755', diff saved to https://phabricator.wikimedia.org/P11758 and previous config saved to /var/cache/conftool/dbconfig/20200707-051236-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T05:16:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1023 entirely T255755', diff saved to https://phabricator.wikimedia.org/P11759 and previous config saved to /var/cache/conftool/dbconfig/20200707-051620-marostegui.json

Change 609899 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update es5-master alias

https://gerrit.wikimedia.org/r/609899

Mentioned in SAL (#wikimedia-operations) [2020-07-07T05:26:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Enable es5 writes T255755 (duration: 00m 56s)

Change 609899 merged by Marostegui:
[operations/dns@master] wmnet: Update es5-master alias

https://gerrit.wikimedia.org/r/609899

Change 609900 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es1023: Disable notifications

https://gerrit.wikimedia.org/r/609900

Change 609900 merged by Marostegui:
[operations/puppet@production] es1023: Disable notifications

https://gerrit.wikimedia.org/r/609900

Change 609904 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Reimage es1023 to Buster

https://gerrit.wikimedia.org/r/609904

Change 609904 merged by Marostegui:
[operations/puppet@production] install_server: Reimage es1023 to Buster

https://gerrit.wikimedia.org/r/609904

I have written some documentation about failing over es hosts, as it is slightly different from the normal sX failovers: https://wikitech.wikimedia.org/wiki/MariaDB#External_store_section_failover_checklist

Mentioned in SAL (#wikimedia-operations) [2020-07-07T06:29:28Z] <marostegui> Reimage es1023 to Buster T255755

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1023.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007070632_marostegui_1791.log.

Change 609914 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not format es10[12]* and es20[12]*

https://gerrit.wikimedia.org/r/609914

Change 609914 merged by Marostegui:
[operations/puppet@production] install_server: Do not format es10[12]* and es20[12]*

https://gerrit.wikimedia.org/r/609914

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['es1023.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007070720_marostegui_11923.log.

Completed auto-reimage of hosts:

['es1023.eqiad.wmnet']

and were ALL successful.

Change 609969 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es1023: Enable notifications

https://gerrit.wikimedia.org/r/609969

Change 609969 merged by Marostegui:
[operations/puppet@production] es1023: Enable notifications

https://gerrit.wikimedia.org/r/609969

Mentioned in SAL (#wikimedia-operations) [2020-07-07T08:19:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11768 and previous config saved to /var/cache/conftool/dbconfig/20200707-081909-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T08:31:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11769 and previous config saved to /var/cache/conftool/dbconfig/20200707-083144-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T09:10:15Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11770 and previous config saved to /var/cache/conftool/dbconfig/20200707-091015-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T09:23:57Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Fully repool es1023 after reimage T255755', diff saved to https://phabricator.wikimedia.org/P11771 and previous config saved to /var/cache/conftool/dbconfig/20200707-092357-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-07T09:26:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove weight from es1024 as it is the current master T255755', diff saved to https://phabricator.wikimedia.org/P11772 and previous config saved to /var/cache/conftool/dbconfig/20200707-092635-marostegui.json

Marostegui updated the task description. (Show Details)

Everything was done. Thanks everyone for helping out!