Page MenuHomePhabricator

Prepare a disaster recovery plan for failing over Phabricator
Closed, ResolvedPublic

Description

I split this off from T137928: Deploy phabricator to phab2001.codfw.wmnet to unblock that task because it doesn't actually depend on this one being completed.

See T164810: Switch phabricator production to codfw for a high-level outline of the steps. We should document the process more formally (and someplace besides phabricator since it won't be available when phabricator is offline).

Phabricator isn't considered the highest priority of systems to be getting online in the event of a disaster, nonetheless, it should be possible to recover relatively quickly so that we can utilize phabricator when coordinating the recovery of other systems.

Wikitech page for Disaster Recovery plan: https://wikitech.wikimedia.org/wiki/Phabricator/Disaster_Recovery

Related Objects

Event Timeline

mmodell created this task.

Consider storing the information on wikitech wiki. Since there is wikitech-static which is a copy of that and kept completely outside normal WMF infratstructure for this very reason, to be available in the event of a disaster.

https://wikitech-static.wikimedia.org/wiki/Main_Page

note: the steps are a bit different for failing over between data centers vs within a single data center.

From @Dzahn via IRC:

07:48:42	<mutante>	for eqiad/codfw parts of it are all prepared in hiera
07:48:47	<mutante>	and applied per dc
07:49:02	<mutante>	for eqiad we have IPs applied via hostname
07:49:08	<mutante>	for codfw by role
07:49:23	<mutante>	this inconsistency was actually nice for a switch to phab1002 in this case
07:49:35	<mutante>	i could just set other IPs for phab1002 also by host
Alroilim removed mmodell as the assignee of this task.
Alroilim set Due Date to Feb 1 2019, 9:00 PM.
Alroilim updated the task description. (Show Details)
Alroilim removed subscribers: Ladsgroup, jcrespo, Aklapper and 5 others.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptFeb 2 2019, 7:17 PM
Gopavasanth assigned this task to mmodell.
Gopavasanth added subscribers: Ladsgroup, jcrespo, Aklapper.
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 23 2019, 6:15 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)

Change 529847 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Point m3-master codfw to dbproxy2003

https://gerrit.wikimedia.org/r/529847

Change 529847 merged by Marostegui:
[operations/dns@master] wmnet: Point m3-master codfw to dbproxy2003

https://gerrit.wikimedia.org/r/529847

The proxy at codfw is now provisioned.
It obviously points to the codfw databases, which are on read-only.
In case of disaster and if we had to switch everything to codfw, they'd need to be set as writable

root@cumin1001:~# mysql --skip-ssl -hm3-master.codfw.wmnet
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 319933
Server version: 10.1.39-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@m3-master.codfw.wmnet[(none)]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db2065     |
+------------+
1 row in set (0.04 sec)

root@m3-master.codfw.wmnet[(none)]> show global variables like 'read_only';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | ON    |
+---------------+-------+
1 row in set (0.04 sec)

We are requesting to keep a second phab server in eqiad (T232887). That would allow us to failover within eqiad.

Dzahn renamed this task from Prepare a disaster recovery plan for failing over from phab1001 to phab2001 (or phab2001 to 1001) to Prepare a disaster recovery plan for failing over Phabricator.Oct 24 2019, 12:26 AM

Some scenarios that we should describe and test:

  1. A simple failure of the phabricator server, e.g. a disk failure or other hardware failure on phab1001
  2. Complete datacenter failover, e.g. some major event takes down eqiad and we need to fail over to codfw
  3. Master database fails, we need to fail over to a slave and swap the slave to become a master

#2 has the proxies, databases and CNAMEs ready (T190572#5413180). The hosts are read-only, so it would need to get an admin to set read_only=OFF if needed.
#3 is already done with the proxies.
If the master goes down, the proxy would automatically failover to the existing slave (which is read-only) and would need to be set up as read_only=OFF by an admin.

I believe we can close this as resolved?