Page MenuHomePhabricator

Prepare db1018 and s2-slaves for s2 master failover
Closed, ResolvedPublic

Event Timeline

jcrespo created this task.Jan 29 2016, 3:13 PM
jcrespo raised the priority of this task from to High.
jcrespo updated the task description. (Show Details)
jcrespo added projects: Operations, DBA.
jcrespo added a subscriber: jcrespo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 29 2016, 3:13 PM
jcrespo renamed this task from Prepare db1018 for s2 master failover to Prepare db1018 and s2-slaves for s2 master failover.Jan 29 2016, 3:18 PM
jcrespo set Security to None.

Change 267261 had a related patch set uploaded (by Jcrespo):
Depool db1018 for maintenance

https://gerrit.wikimedia.org/r/267261

Change 267261 merged by Jcrespo:
Depool db1018 for maintenance

https://gerrit.wikimedia.org/r/267261

Change 267294 had a related patch set uploaded (by Jcrespo):
Install Jessie on db1018

https://gerrit.wikimedia.org/r/267294

Change 267294 merged by Jcrespo:
Install Jessie on db1018

https://gerrit.wikimedia.org/r/267294

jcrespo moved this task from Triage to In progress on the DBA board.Jan 29 2016, 11:33 PM

Change 267678 had a related patch set uploaded (by Jcrespo):
Pool db1018; Depool db1054

https://gerrit.wikimedia.org/r/267678

jcrespo added a subscriber: greg.Feb 1 2016, 7:42 PM

Adding @greg, as this (setting s2 shard as read-only) needs Release Engineering coordination, and sadly we are in a timer here.

  • Decide a date for the maintenance
  • Warn affected users (editors of s2 shard)
  • Stop all colliding maintenance at the time
greg added a subscriber: Johan.Feb 1 2016, 7:48 PM

How bad is the time crunch/when is the latest you would *comfortably* want to do this?

For the warning users part, we'll need some help from @Johan (ping, just heads up for now)

The plan is: doing a try this week/early next week. Best case scenario, 1-10 seconds in read-only mode, and almost no user notice.

If that doesn't work, we rollback and fix the issues.

The 16th to 21st of February, all edits fail, a few days before that, maintenance (long running queries) fail.

List of wikis affected:

mysql -e "SHOW DATABASES like '%wik%'"
+------------------+
| Database (%wik%) |
+------------------+
| bgwiki           |
| bgwiktionary     |
| cswiki           |
| enwikiquote      |
| enwiktionary     |
| eowiki           |
| fiwiki           |
| idwiki           |
| itwiki           |
| l10nwiki         |
| nlwiki           |
| nowiki           |
| plwiki           |
| ptwiki           |
| svwiki           |
| thwiki           |
| trwiki           |
| zhwiki           |
+------------------+

The best time for this would be between 1am and 7am UTC, but that is not a big requirement, as the difference in load is not that big, and it is better a time where there are people around to act in case of problems.

Krenair added a subscriber: Krenair.Feb 1 2016, 8:25 PM

Interesting. I have never noticed that l10nwiki database before. All it contains is two empty tables (localisation and localisation_file_hash).

greg added a comment.Feb 1 2016, 8:30 PM

Looks like the communities of the effected wikis aren't all in one part of the globe (based on my hugely hand-wavy assessment), so whatever time of day that is best for you, @jcrespo, is good.

Re which date: Let's do early next week to give people a heads up.

Just for my edification and for the email announcement, the reason for this failover is.... ?

jcrespo added a comment.EditedFeb 1 2016, 8:54 PM

Hardware maintenance (there is an LVM/fs problem that prevents the partition from growing), OS upgrade and mariadb upgrade (5.5 -> 10) on the s2 master server. I have the long version if you want, too :-)

greg added a comment.Feb 1 2016, 9:07 PM

That's good enough for me :)

Tuesday 9 Feb, 23:00 UTC, does that work for anyone on your team?

greg added a comment.Feb 1 2016, 9:30 PM

The hour before afternoon SWAT should be OK, yeah. Anything specific you need from us? We'll all be around during the time, but if you need someone's undivided attention let me know and we can pick someone :)

No, the only complex thing is DBA-specific. I will need just the usual attention as if it was a deployment (logs monitoring for higher rates of errors, etc). I will do all of that, but having an extra pair of eyes familiar with mediawiki and its configuration will be enough.

This should be trivial, and that servers has acted as a master for codfw for a few months already, but I must be prepared for rollback should something strange happens. This will eventually happen for all masters, but the first one is the most complicated. By the time we get to the last one I will not even need to notify anybody! :-)

Change 267678 merged by Jcrespo:
Pool db1018; Depool db1021

https://gerrit.wikimedia.org/r/267678

All slaves (except dbstore1001) have been prepared.

Disk is at 2% right now.

Change 268709 had a related patch set uploaded (by Jcrespo):
Depool db1018 for reimaging

https://gerrit.wikimedia.org/r/268709

Change 268709 merged by Jcrespo:
Depool db1018 for reimaging

https://gerrit.wikimedia.org/r/268709

db1018 has been reinstalled with jessie/MariaDB 10.0.23 and it is ready for switchover. Only tasks pending is prepare/decide about the list of things to do, in order, for the failover, and the potential rollback.

@greg I realized that I said 1-10 seconds of unavailability, but that doesn't have into account that currently, it takes a bit over a minute to deploy a file change to all appservers. This means that the downtime will start when we deploy the change to set the wikis as read-only (~1+ minute) + actual failover + sanity check + deployment to set the wikis as read-write (1+ minutes). That may be upt to 3 minutes, in total, in read-only mode.

All this adds to the reasons of why database pooling should be outside of code deployment (tracked on T119626)-where this would be instantaneous, but that is outside of the scope of this particular maintenance.

differences in schema in db1018.eqiad.wmnet
**********************************

======================================================================
 [-(`il_to`,`il_from_namespace`,`il_from`)-] {+(`il_from_namespace`,`il_to`,`il_from`)+}
======================================================================
 [-(`pl_namespace`,`pl_title`,`pl_from_namespace`,`pl_from`)-] {+(`pl_from_namespace`,`pl_namespace`,`pl_title`,`pl_from`)+}
======================================================================
======================================================================
 [-(`tl_namespace`,`tl_title`,`tl_from_namespace`,`tl_from`)-] {+(`tl_from_namespace`,`tl_namespace`,`tl_title`,`tl_from`)+}

Change 269381 had a related patch set uploaded (by Jcrespo):
s2-master now points to db1018 (instead of db1024)

https://gerrit.wikimedia.org/r/269381

Change 269389 had a related patch set uploaded (by Jcrespo):
Final configuration after failover (db1018 as the new s2-master)

https://gerrit.wikimedia.org/r/269389

Change 269391 had a related patch set uploaded (by Jcrespo):
Enabling read only mode for s2 before its master failover

https://gerrit.wikimedia.org/r/269391

aude added a subscriber: aude.Feb 9 2016, 2:23 PM

I suggest we enable a (non-intrusive) central notice banner on the affected wikis, such as https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/Genericmaintenancenotice to let them (i'd say logged in users only) know a bit in advance.

I think https://gerrit.wikimedia.org/r/#/c/269391/2/wmf-config/db-eqiad.php is also helpful, but think it would only be on edit and such (and uncached places), while central notice would worok for everyone.

aude added a comment.Feb 9 2016, 2:26 PM

and https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs lists some of the staff members that handle banners, though any meta admin can do this. (if we choose to do this)

(me no longer one of the admins, due to inactivity there :/)

I suggest we enable a (non-intrusive) central notice banner on the affected wikis, such as https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/Genericmaintenancenotice to let them (i'd say logged in users only) know a bit in advance.
I think https://gerrit.wikimedia.org/r/#/c/269391/2/wmf-config/db-eqiad.php is also helpful, but think it would only be on edit and such (and uncached places), while central notice would worok for everyone.

@Johan is that something you can help with?

I suggest we enable a (non-intrusive) central notice banner on the affected wikis, such as https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners/edit/Genericmaintenancenotice to let them (i'd say logged in users only) know a bit in advance.
I think https://gerrit.wikimedia.org/r/#/c/269391/2/wmf-config/db-eqiad.php is also helpful, but think it would only be on edit and such (and uncached places), while central notice would worok for everyone.

@Johan is that something you can help with?

Pinging @Jalexander and @Pcoombe for their help here as well.

@greg @jcrespo I can help with this. Is there a page/email you would like the "Read more" link to point at? Or should we just leave that out?

@Pcoombe

There is:

https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160209T2300

But we can create a page or email with the following information:

Between 23:00 and 23:59 UTC, February 9th 2016 there is scheduled a maintenance window that will affect some of the wikis hosted by the Wikimedia Foundation. The maintenance is needed in order to perform necessary hardware, operating system and MariaDB upgrades of one of the master database servers. During the upgrade, content on affected wikis be available at all times, but edits may fail during approximately 5 minutes within that schedule (these wikis will be in "read only mode"). The following wikis will be affected:
bg.wikipedia.org
bg.wiktionary.org
cs.wikipedia.org
en.wikiquote.org
en.wiktionary.org
eo.wikipedia.org
fi.wikipedia.org
id.wikipedia.org
it.wikipedia.org
nl.wikipedia.org
no.wikipedia.org
pl.wikipedia.org
pt.wikipedia.org
sv.wikipedia.org
th.wikipedia.org
tr.wikipedia.org
zh.wikipedia.org
All other wikis will not be affected by the maintenance.
We apologize in advance for this disruption and will try to minimize the duration of the maintenance work.

Can someone review this? My communication skills are usually horrible, and I may be missing important information or it may not be understandable enough (I am not a native English speaker),

greg added a comment.Feb 9 2016, 7:08 PM

My simple/simplifying edits:

Between 23:00 and 23:59 UTC, February 9th 2016 there is a scheduled maintenance window that will affect some of the wikis hosted by the Wikimedia Foundation. The maintenance is needed in order to perform necessary hardware, operating system and database upgrades. During the upgrade, content on affected wikis will be available at all times, but edits may fail during approximately 5 minutes within that schedule (these wikis will be in "read only mode"). The following wikis will be affected:
bg.wikipedia.org
bg.wiktionary.org
cs.wikipedia.org
en.wikiquote.org
en.wiktionary.org
eo.wikipedia.org
fi.wikipedia.org
id.wikipedia.org
it.wikipedia.org
nl.wikipedia.org
no.wikipedia.org
pl.wikipedia.org
pt.wikipedia.org
sv.wikipedia.org
th.wikipedia.org
tr.wikipedia.org
zh.wikipedia.org
All other wikis will not be affected by this maintenance.
We apologize in advance for this disruption and will try to minimize the duration of the maintenance work.

I've sent an email with that text to wikitech, too.

Okay, there are CentralNotice banners set up for logged-in users between 2245 and 2359 UTC on the affected wikis, linking to the above page. Preview.

Change 269391 merged by Jcrespo:
Enabling read only mode for s2 before its master failover

https://gerrit.wikimedia.org/r/269391

Change 269381 merged by Jcrespo:
s2-master now points to db1018 (instead of db1024)

https://gerrit.wikimedia.org/r/269381

Change 269389 abandoned by Jcrespo:
Final configuration after failover (db1018 as the new s2-master)

Reason:
Merged other change with similar results.

https://gerrit.wikimedia.org/r/269389

Change 269581 had a related patch set uploaded (by Jcrespo):
Updating new master on codfw configuration too (just in case)

https://gerrit.wikimedia.org/r/269581

Change 269581 merged by Jcrespo:
Updating new master on codfw configuration too (just in case)

https://gerrit.wikimedia.org/r/269581

Failover was done successfully https://wikitech.wikimedia.org/wiki/Planned_Maintenance-February_9_2016 , pending issues being tracked on T126436.

jcrespo closed this task as Resolved.Feb 10 2016, 2:17 PM
jcrespo claimed this task.