Description

I split this off from T137928: Deploy phabricator to phab2001.codfw.wmnet to unblock that task because it doesn't actually depend on this one being completed.

See T164810: Switch phabricator production to codfw for a high-level outline of the steps. We should document the process more formally (and someplace besides phabricator since it won't be available when phabricator is offline).

Phabricator isn't considered the highest priority of systems to be getting online in the event of a disaster, nonetheless, it should be possible to recover relatively quickly so that we can utilize phabricator when coordinating the recovery of other systems.

Wikitech page for Disaster Recovery plan: https://wikitech.wikimedia.org/wiki/Phabricator/Disaster_Recovery

Details

	Subject	Repo	Branch	Lines +/-
	wmnet: Point m3-master codfw to dbproxy2003	operations/dns	master	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• mmodell	T182832 Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state
Resolved	Paladox	T125357 /maniphest/report/project/ : Maximum execution time of 10 seconds exceeded
Resolved	Dzahn	T151070 Move Phabricator from PHP 7.0 to PHP 7.2
Resolved	Dzahn	T238956 switch prod Phabricator from phab1003 to phab1001
Resolved	Dzahn	T190568 Reimage both phab1001 and phab2001 to stretch / buster
Resolved	Joe	T154658 Prepare and improve the datacenter switchover procedure
Resolved	LSobanski	T156937 Provide cross-dc redundancy (active-active or active-passive) to all important misc services
Invalid	None	T164810 Switch phabricator production to codfw
Resolved	• mmodell	T152129 reinstall iridium (phabricator) as phab1001 with jessie
Resolved	• mmodell	T137928 Deploy phabricator to phab2001.codfw.wmnet
Resolved	• mmodell	T168699 Verify that the codfw lvs is configured correctly for Phabricator
Resolved	• mmodell	T190572 Prepare a disaster recovery plan for failing over Phabricator
Declined	• mmodell	T232883 Make PHD run on the backup phabricator server (phab2001, currently)

Event Timeline

• mmodell triaged this task as Medium priority.Mar 23 2018, 8:22 PM

• mmodell created this task.

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptMar 23 2018, 8:22 PM

• mmodell mentioned this in T137928: Deploy phabricator to phab2001.codfw.wmnet.Mar 23 2018, 8:23 PM

• mmodell updated the task description. (Show Details)

• mmodell updated the task description. (Show Details)Mar 23 2018, 8:31 PM

Consider storing the information on wikitech wiki. Since there is wikitech-static which is a copy of that and kept completely outside normal WMF infratstructure for this very reason, to be available in the event of a disaster.

https://wikitech-static.wikimedia.org/wiki/Main_Page

greg added a project: Phabricator.Apr 23 2018, 4:42 PM

greg added a project: Documentation.Apr 23 2018, 5:05 PM

jcrespo added a project: DBA.Jun 8 2018, 12:48 PM

jcrespo moved this task from Triage to Blocked external/Not db team on the DBA board.

note: the steps are a bit different for failing over between data centers vs within a single data center.

From @Dzahn via IRC:

07:48:42	<mutante>	for eqiad/codfw parts of it are all prepared in hiera
07:48:47	<mutante>	and applied per dc
07:49:02	<mutante>	for eqiad we have IPs applied via hostname
07:49:08	<mutante>	for codfw by role
07:49:23	<mutante>	this inconsistency was actually nice for a switch to phab1002 in this case
07:49:35	<mutante>	i could just set other IPs for phab1002 also by host

• mmodell updated the task description. (Show Details)Jun 8 2018, 12:57 PM

• mmodell moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.Jun 25 2018, 4:54 PM

• srodlund moved this task from Backlog to Completed, Blocked, Wrong Category, Refinement Needed, or Older Needs Review on the Documentation board.Aug 16 2018, 5:48 PM

Aklapper moved this task from To Triage to Infrastructure on the Phabricator board.Aug 29 2018, 6:36 PM

• mmodell added a project: User-MModell.Sep 10 2018, 4:51 PM

• mmodell moved this task from Backlog to Soon on the User-MModell board.Sep 10 2018, 4:58 PM

• mmodell moved this task from In-progress to Backlog on the Release-Engineering-Team (Kanban) board.Oct 22 2018, 4:36 PM

Ladsgroup subscribed.Jan 15 2019, 10:00 PM

• mmodell added a parent task: T137928: Deploy phabricator to phab2001.codfw.wmnet.Jan 23 2019, 7:56 PM

• Alroilim closed this task as Declined.Feb 2 2019, 7:17 PM

• Alroilim removed • mmodell as the assignee of this task.

• Alroilim set Due Date to Feb 1 2019, 9:00 PM.

• Alroilim removed projects: User-MModell, DBA, Documentation, Phabricator, Release-Engineering-Team (Kanban).

• Alroilim updated the task description. (Show Details)

• Alroilim removed subscribers: Ladsgroup, jcrespo, Aklapper and 5 others.

Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptFeb 2 2019, 7:17 PM

Gopavasanth reopened this task as Open.Feb 2 2019, 7:43 PM

Gopavasanth assigned this task to • mmodell.

Gopavasanth added subscribers: Ladsgroup, jcrespo, Aklapper.

Aklapper removed Due Date.Feb 23 2019, 6:15 AM

Aklapper added projects: User-MModell, DBA, Documentation, Phabricator, Release-Engineering-Team (Kanban).

Aklapper updated the task description. (Show Details)

Aklapper added subscribers: Paladox, ArielGlenn, Dzahn, • mmodell.

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 23 2019, 6:15 AM

greg edited projects, added Release-Engineering-Team (Backlog); removed Release-Engineering-Team (Kanban).May 22 2019, 10:20 PM

• Marostegui mentioned this in T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4].May 29 2019, 11:50 AM

• Phabricator_maintenance edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team (Backlog).Jun 12 2019, 11:52 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Later / Need volunteer on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:55 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

greg edited projects, added Release-Engineering-Team (Development services); removed Release-Engineering-Team.Jun 24 2019, 7:50 PM

greg moved this task from Later / Need volunteer to Soon-ish on the Release-Engineering-Team-TODO board.Jul 6 2019, 4:55 AM

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)

Change 529847 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Point m3-master codfw to dbproxy2003

https://gerrit.wikimedia.org/r/529847

gerritbot added a project: Patch-For-Review.Aug 13 2019, 4:59 AM

Change 529847 merged by Marostegui:
[operations/dns@master] wmnet: Point m3-master codfw to dbproxy2003

https://gerrit.wikimedia.org/r/529847

Maintenance_bot removed a project: Patch-For-Review.Aug 14 2019, 5:10 AM

The proxy at codfw is now provisioned.
It obviously points to the codfw databases, which are on read-only.
In case of disaster and if we had to switch everything to codfw, they'd need to be set as writable

root@cumin1001:~# mysql --skip-ssl -hm3-master.codfw.wmnet
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 319933
Server version: 10.1.39-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@m3-master.codfw.wmnet[(none)]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db2065     |
+------------+
1 row in set (0.04 sec)

root@m3-master.codfw.wmnet[(none)]> show global variables like 'read_only';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| read_only     | ON    |
+---------------+-------+
1 row in set (0.04 sec)

We are requesting to keep a second phab server in eqiad (T232887). That would allow us to failover within eqiad.

Dzahn renamed this task from Prepare a disaster recovery plan for failing over from phab1001 to phab2001 (or phab2001 to 1001) to Prepare a disaster recovery plan for failing over Phabricator.Oct 24 2019, 12:26 AM

mark mentioned this in T232887: The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent.Oct 24 2019, 3:57 PM

• mmodell closed subtask T232883: Make PHD run on the backup phabricator server (phab2001, currently) as Resolved.Nov 6 2019, 6:05 PM

• mmodell reopened subtask T232883: Make PHD run on the backup phabricator server (phab2001, currently) as Open.

Some scenarios that we should describe and test:

A simple failure of the phabricator server, e.g. a disk failure or other hardware failure on phab1001
Complete datacenter failover, e.g. some major event takes down eqiad and we need to fail over to codfw
Master database fails, we need to fail over to a slave and swap the slave to become a master

#2 has the proxies, databases and CNAMEs ready (T190572#5413180). The hosts are read-only, so it would need to get an admin to set read_only=OFF if needed.
#3 is already done with the proxies.
If the master goes down, the proxy would automatically failover to the existing slave (which is read-only) and would need to be set up as read_only=OFF by an admin.