⚓ T313398 Failover x1 master db1120 -> db1103

	Subject	Repo	Branch	Lines +/-
	wmnet: Update x1-master CNAME	operations/dns	master	+1 -1
	mariadb: Promote db1103 to x1 master	operations/puppet	production	+4 -4

		Status	Subtype	Assigned	Task
		Resolved		ayounsi	T313382 asw2-c5-eqiad crash
		Resolved		• Marostegui	T313398 Failover x1 master db1120 -> db1103

• Marostegui created this task.Jul 20 2022, 11:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 20 2022, 11:25 AM

• Marostegui updated the task description. (Show Details)Jul 20 2022, 11:26 AM

• Marostegui mentioned this in T313382: asw2-c5-eqiad crash.

• Marostegui triaged this task as High priority.Jul 20 2022, 11:29 AM

• Marostegui added a project: User-notice.

• Marostegui updated the task description. (Show Details)

We need to switchover this master tomorrow due to issues with the switch where this master is connected to (T313382)

• Marostegui moved this task from Triage to In progress on the DBA board.Jul 20 2022, 11:31 AM

RhinosF1 subscribed.Jul 20 2022, 11:43 AM

• Marostegui updated the task description. (Show Details)Jul 20 2022, 12:59 PM

• Marostegui updated the task description. (Show Details)Jul 20 2022, 2:36 PM

• Marostegui added a subscriber: Trizek-WMF.

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:14:39Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:14:59Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T05:17:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1103 with weight 0 T313398', diff saved to https://phabricator.wikimedia.org/P31560 and previous config saved to /var/cache/conftool/dbconfig/20220721-051752-root.json

• Marostegui updated the task description. (Show Details)Jul 21 2022, 5:26 AM

Change 815830 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/815830

gerritbot added a project: Patch-For-Review.Jul 21 2022, 5:32 AM

• Marostegui updated the task description. (Show Details)Jul 21 2022, 5:32 AM

Change 815831 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/815831

• Marostegui updated the task description. (Show Details)Jul 21 2022, 6:06 AM

Change 815830 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/815830

• Marostegui updated the task description. (Show Details)Jul 21 2022, 6:08 AM

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:08:56Z] <marostegui> Starting x1 eqiad failover from db1120 to db1103 - T313398

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:10:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1103 to x1 primary and set section read-write T313398', diff saved to https://phabricator.wikimedia.org/P31564 and previous config saved to /var/cache/conftool/dbconfig/20220721-061001-root.json

Change 815831 merged by Marostegui:

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/815831

Mentioned in SAL (#wikimedia-operations) [2022-07-21T06:11:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1120 T313398', diff saved to https://phabricator.wikimedia.org/P31565 and previous config saved to /var/cache/conftool/dbconfig/20220721-061145-root.json

This was done

Maintenance_bot moved this task from In progress to Done on the DBA board.Jul 21 2022, 6:29 AM

Maintenance_bot removed a project: Patch-For-Review.

From an editor perspective, would this task have resulted in any lost actions (e.g. If I saved an edit with an Echo @ping during that time, would the Notification have been silently&unavoidably lost)?
Or, would it only have resulted in some very short delays (for things like Echo pings), and (I assume) briefly unavailable features (like GrowthExperiments) ?

(Context: The answer will determine if/how I mention this in Tech News, within the existing line about s7.
E.g. I might add (bold) "... some wikis were in read-only mode and some features unavailable for a few minutes ...")

Quiddity moved this task from To Triage to Announce in next Tech/News on the User-notice board.Jul 21 2022, 8:24 PM

In T313398#8095909, @Quiddity wrote:

From an editor perspective, would this task have resulted in any lost actions (e.g. If I saved an edit with an Echo @ping during that time, would the Notification have been silently&unavoidably lost)?
Or, would it only have resulted in some very short delays (for things like Echo pings), and (I assume) briefly unavailable features (like GrowthExperiments) ?

I don't think anything would have been lost, either delayed or retried. The downtime was around 34 seconds though, so very very brief.

Great, thank you. In that case, I don't think this needs a distinct mention in Tech News then. Removing tag accordingly. Have a good weekend. :)

In T313398#8096692, @Marostegui wrote:

The downtime was around 34 seconds though, so very very brief.

A quick note about 30 seconds being brief. Volunteer-me once deleted fr.wp's homepage for a fusion of history. Between the deletion of the page and its restoration, it took me literally 20 seconds. Over these 20 seconds, one person messaged me to know why I deleted the main page, asking to have it being restored asap.

What reassures me here with failovers is that things are not lost, but just delayed. But a lot of things can be noticed in a 30 seconds interval, as my example shows; we shouldn't forget it. ;)

I do understand your point @Trizek-WMF but I think it's a bit different:

Read-only limits write (editing, etc.). Deleting the main page (while being quite an adventure, I admit) heavily impacts reads and Wikipedia has a massive disparity between read and write. Specially the fact that around 3% of all reads to wikipedia go to the main page. For example, French Wikipedia yesterday had 24K edits yesterday but had 1B page views last month. A napkin calculation gives the ratio of 1408 page view to edit for French Wikipedia.
We intentionally chose the time that human edits is at the lowest to minimize the impact. For example, in the whole minute of 6:00AM UTC yesterday, there were only eight edits on French Wikipedia. Six edits the day before and so on. I admit it's a bit inconvenient for me to wake up in such hour (specially as a night owl) but I personally suggested that time knowing the implications.
It doesn't really break anything. Bots are expected to wait and retry in such cases, pywikibot does it automatically. And both editors (VE and traditional) wouldn't lose edits, it just gives you a message like a "hit the submit button again in a minute". Very similar workflow to when your edit token expires ("session lost" error) which I get way more often than getting read-only error (the session lost error can be reproduced when you keep the edit tab open for too long). And in thirty seconds, if you try again, it's already saved.
In SRE practices, we have concept of error budget. We have way more edits failing due to intermittent network issues, race conditions, dead locks, etc. Even we just take edits and assume 8 fail every day (e.g. read-only taking a full minute and we do it every day which we don't), the ratio of failed edits is 0.03% and well within in our error budget. A more realistic calculation would be something along the lines 0.0014% or 14ppm. That's extremely low.
The last but not least, we have to increase our maintenance in our databases. We are addressing really long-standing high priority issues (e.g. links normalization, revision migration, etc.) on top of addressing tech debts (e.g. standardizing timestamp fields) on top of regular maintenance (e.g. kernel reboots for security updates, mariadb upgrades, OS upgrades, random power or network issues, etc.). This year, we did four times as schema changes than we did last year. Part of this is inevitable as if we don't address the first category (high priority schema improvements), no one would be able to edit Wikipedia in two or three years. Commons is already in an extremely dangerous state reaching 2TB.
- We really have to streamline the process of master switchovers, I already made it that the task and patches are done automatically (example: T314369)

P.S. Regarding deleting the main page, A similar story from English Wikipedia:

One admin was discussing the deletion of the main page in IRC and asked if the technical ability to delete pages with over 5,000 revisions, like the main page, had ever been re-enabled. Another admin (jokingly) commented that he had tested it and found that the main page still couldn't be deleted. The first admin thought he would test it for himself. The main page got deleted.[1][a]

Ladsgroup mentioned this in T292543: Improve the community relations process for data center switchover.Jan 17 2023, 5:30 PM

Failover x1 master db1120 -> db1103
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Failover x1 master db1120 -> db1103Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Failover x1 master db1120 -> db1103
Closed, ResolvedPublic
Actions

Related Objects
Search...