Page MenuHomePhabricator

Failover x1 master db1120 -> db1103
Closed, ResolvedPublic

Description

When: Thursday 21st 06:15 AM

  • Team calendar invite

Affected wikis::
The x1 cluster is used by MediaWiki at WMF for databases that are "global" or "cross-wiki" in nature, and are typically associated with a MediaWiki extension

Affected features:

BounceHandler
Cognate
ContentTranslation
Echo
Flow
GrowthExperiments
ReadingLists
UrlShortener
WikimediaEditorTasks

Checklist:

NEW primary: db1103
OLD primary: db1120

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1120.eqiad.wmnet h=db1103.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover x1 T313398" 'A:db-section-x1'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1103 set-weight 0
sudo dbctl config commit -m "Set db1103 with weight 0 T313398"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1120 db1103
  • Disable puppet on both nodes
sudo cumin 'db1120* or db1103*' 'disable-puppet "primary switchover T313398"'

Failover:

  • Log the failover:
!log Starting x1 eqiad failover from db1120 to db1103 - T313398
  • Set section read-only:
sudo dbctl --scope eqiad section x1 ro "Maintenance until 06:15 UTC - T313398"
sudo dbctl config commit -m "Set x1 eqiad as read-only for maintenance - T313398"
  • Check x1 is indeed read-only
  • Switch primaries:
db-mysql db1120 -e "set global read_only=1"
sudo db-switchover --skip-slave-move db1120 db1103
echo "===== db1120 (OLD)"; sudo db-mysql db1120 -e 'show slave status\G'
echo "===== db1103 (NEW)"; sudo db-mysql db1103 -e 'show slave status\G'
db-mysql db1103 -e "set global read_only=0"
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section x1 set-master db1103
sudo dbctl config commit -m "Promote db1103 to x1 primary and set section read-write T313398"
  • Restart puppet on both hosts:
sudo cumin 'db1120* or db1103*' 'run-puppet-agent -e "primary switchover T313398"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1103 heartbeat -e "delete from heartbeat where file like 'db1120%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1103
events_coredb_slave.sql on the new slave db1120
sudo dbctl instance db1120 set-candidate-master --section x1 true
sudo dbctl instance db1103 set-candidate-master --section x1 false
(dborch1001): sudo orchestrator-client -c untag -i db1103 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1120 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 'x1';"
  • (If needed): Depool db1120 for maintenance.
sudo dbctl instance db1120 depool
sudo dbctl config commit -m "Depool db1120 T313398"
  • Change db1120 weight to mimic the previous weight db1103:
sudo dbctl instance db1120 edit
  • Apply outstanding schema changes to db1120 (if any)
  • Update/resolve this ticket.

Event Timeline

Marostegui added a project: User-notice.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:14:39Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:14:59Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:17:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1103 with weight 0 T313398', diff saved to https://phabricator.wikimedia.org/P31560 and previous config saved to /var/cache/conftool/dbconfig/20220721-051752-root.json

Change 815830 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/815830

Change 815831 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/815831

Change 815830 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/815830

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:08:56Z] <marostegui> Starting x1 eqiad failover from db1120 to db1103 - T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:10:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1103 to x1 primary and set section read-write T313398', diff saved to https://phabricator.wikimedia.org/P31564 and previous config saved to /var/cache/conftool/dbconfig/20220721-061001-root.json

Change 815831 merged by Marostegui:

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/815831

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:11:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1120 T313398', diff saved to https://phabricator.wikimedia.org/P31565 and previous config saved to /var/cache/conftool/dbconfig/20220721-061145-root.json

Marostegui updated the task description. (Show Details)

This was done

From an editor perspective, would this task have resulted in any lost actions (e.g. If I saved an edit with an Echo @ping during that time, would the Notification have been silently&unavoidably lost)?
Or, would it only have resulted in some very short delays (for things like Echo pings), and (I assume) briefly unavailable features (like GrowthExperiments) ?

(Context: The answer will determine if/how I mention this in Tech News, within the existing line about s7.
E.g. I might add (bold) "... some wikis were in read-only mode and some features unavailable for a few minutes ...")

From an editor perspective, would this task have resulted in any lost actions (e.g. If I saved an edit with an Echo @ping during that time, would the Notification have been silently&unavoidably lost)?
Or, would it only have resulted in some very short delays (for things like Echo pings), and (I assume) briefly unavailable features (like GrowthExperiments) ?

I don't think anything would have been lost, either delayed or retried. The downtime was around 34 seconds though, so very very brief.

Great, thank you. In that case, I don't think this needs a distinct mention in Tech News then. Removing tag accordingly. Have a good weekend. :)

The downtime was around 34 seconds though, so very very brief.

A quick note about 30 seconds being brief. Volunteer-me once deleted fr.wp's homepage for a fusion of history. Between the deletion of the page and its restoration, it took me literally 20 seconds. Over these 20 seconds, one person messaged me to know why I deleted the main page, asking to have it being restored asap.

What reassures me here with failovers is that things are not lost, but just delayed. But a lot of things can be noticed in a 30 seconds interval, as my example shows; we shouldn't forget it. ;)

I do understand your point @Trizek-WMF but I think it's a bit different:

  • Read-only limits write (editing, etc.). Deleting the main page (while being quite an adventure, I admit) heavily impacts reads and Wikipedia has a massive disparity between read and write. Specially the fact that around 3% of all reads to wikipedia go to the main page. For example, French Wikipedia yesterday had 24K edits yesterday but had 1B page views last month. A napkin calculation gives the ratio of 1408 page view to edit for French Wikipedia.
  • We intentionally chose the time that human edits is at the lowest to minimize the impact. For example, in the whole minute of 6:00AM UTC yesterday, there were only eight edits on French Wikipedia. Six edits the day before and so on. I admit it's a bit inconvenient for me to wake up in such hour (specially as a night owl) but I personally suggested that time knowing the implications.
  • It doesn't really break anything. Bots are expected to wait and retry in such cases, pywikibot does it automatically. And both editors (VE and traditional) wouldn't lose edits, it just gives you a message like a "hit the submit button again in a minute". Very similar workflow to when your edit token expires ("session lost" error) which I get way more often than getting read-only error (the session lost error can be reproduced when you keep the edit tab open for too long). And in thirty seconds, if you try again, it's already saved.
  • In SRE practices, we have concept of error budget. We have way more edits failing due to intermittent network issues, race conditions, dead locks, etc. Even we just take edits and assume 8 fail every day (e.g. read-only taking a full minute and we do it every day which we don't), the ratio of failed edits is 0.03% and well within in our error budget. A more realistic calculation would be something along the lines 0.0014% or 14ppm. That's extremely low.
  • The last but not least, we have to increase our maintenance in our databases. We are addressing really long-standing high priority issues (e.g. links normalization, revision migration, etc.) on top of addressing tech debts (e.g. standardizing timestamp fields) on top of regular maintenance (e.g. kernel reboots for security updates, mariadb upgrades, OS upgrades, random power or network issues, etc.). This year, we did four times as schema changes than we did last year. Part of this is inevitable as if we don't address the first category (high priority schema improvements), no one would be able to edit Wikipedia in two or three years. Commons is already in an extremely dangerous state reaching 2TB.
    • We really have to streamline the process of master switchovers, I already made it that the task and patches are done automatically (example: T314369)

P.S. Regarding deleting the main page, A similar story from English Wikipedia:

One admin was discussing the deletion of the main page in IRC and asked if the technical ability to delete pages with over 5,000 revisions, like the main page, had ever been re-enabled. Another admin (jokingly) commented that he had tested it and found that the main page still couldn't be deleted. The first admin thought he would test it for himself. The main page got deleted.[1][a]