Page MenuHomePhabricator

Reimage both phab1001 and phab2001 to stretch / buster
Open, HighPublic

Description

We should reimage phab1001 and phab2001 to stretch now that we have unblocked the reimage with this T187127 being resolved.

This reimage is needed for T182832

We should reimage phab2001 first then phab1001

blockers noted in meeting:

  • ensure testing is possible from deployment servers (firewall holes)
  • ensure replacement server has 64GB RAM like prod server

todo:

  • install OS on phab1003
  • switch prod traffic to phab1003
  • reinstall phab2001 with stretch
  • reinstall phab1001 with buster or decom it and keep phab1003 as prod server
  • if needed: switch traffic back
  • give back phab1003 or keep as permanent failover in the same DC | requested to keep in T232887

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dzahn changed the task status from Open to Stalled.Apr 17 2019, 10:02 PM

blocked on T215335

Dzahn updated the task description. (Show Details)Apr 18 2019, 9:15 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)May 8 2019, 2:51 PM

The blocking ticket is closed; what else is needed for this to move forward?

Change 510597 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable php-fpm

https://gerrit.wikimedia.org/r/510597

Change 510614 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable php-fpm, disable logmails on phab1003

https://gerrit.wikimedia.org/r/510614

Change 510614 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm, disable logmails on phab1003

https://gerrit.wikimedia.org/r/510614

Mentioned in SAL (#wikimedia-operations) [2019-05-15T21:47:24Z] <mutante> phab1003 - ip -6 addr del 2620:0:861:ed1a::3:16/128 dev lo - remove extra service IP for phab's separate sshd, duplicated with phab1001 (T190568)

Change 510623 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: add vcs listen addresses for phab1003

https://gerrit.wikimedia.org/r/510623

Change 510623 merged by Dzahn:
[operations/puppet@production] phabricator: add vcs listen addresses for phab1003

https://gerrit.wikimedia.org/r/510623

Dzahn changed the task status from Stalled to Open.May 23 2019, 4:50 AM
Dzahn claimed this task.
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)May 23 2019, 4:14 PM

Phabricator has been switched to phab1003 as the prod server now and that meant:

  • php 5 to php 7.2
  • mod_php to php-fpm
  • jessie to stretch (and the httpd version with it)
Dzahn added a comment.May 23 2019, 4:16 PM

I am thinking now we could make the process easier and just keep phab1003 as the prod server and just discuss whether we want to keep phab1001 as a permanent stand-by in the same DC or give it back to the spares pool.

Next phab2001 will also need to be reinstalled with stretch.

@Dzahn: agreed. I don't know who should decide if we keep phab1001 or return it. We've proven now that the migration is doable in a sort-of short amount of time, so a warm standby would certainly improve availability in the event of a major failure with the primary machine.

Dzahn added a comment.May 23 2019, 4:19 PM

I will bring this up in my next subteam discussion meeting which should be in a week. Until then i will hold on to phab1001. Maybe we wait a few days to keep it as is.. and then install stretch.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905291932_dzahn_103379_phab2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-05-29T19:32:40Z] <mutante> phba2001 - reinstalling with stretch - upgrade from jessie (T190568)

Change 513202 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Change 513202 merged by Dzahn:
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292034_dzahn_151516_phab2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292217_dzahn_171903_phab2001_codfw_wmnet.log.

Dzahn updated the task description. (Show Details)May 29 2019, 11:31 PM

Change 513242 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Change 513242 merged by Dzahn:
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']
Joe moved this task from Backlog to Doing on the serviceops board.Jun 21 2019, 7:37 AM
Dzahn added a comment.Jul 9 2019, 9:13 PM

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?) then we decom phab1001 or we go with the original plan and reinstall phab1001 and switch back to it and then give the temp host phab1003 back to the pool (maybe dcops prefer this because we asked for a temp host only but also maybe it doesn't matter to them at all since we would be giving another host back instead).

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?)

What's the current procedure to switch over the active Phab server, just a DNS name change? IIRC we can't switchover to codfw currently due to the Phab DB not being replicated, is that still correct? Phabricator is an important service and the Phabricator server is currently a SPOF. If the procedure to failover the active Phab server is non-intrusive I'd suggest to reimage phab1001 to Stretch and keep it as the failover server (similar as we do for cloudnet, cloudservices, cloudcontrol e.g.). If we run into hardware issues with active Phab server we can then easily switchover to the failover host in eqiad. If the failover procedure is non-instrusive this might also simplify Phab maintenance a lot as new changes could more easily staged on the secondary host (e.g. by switching /etc/hosts manually to that server). And if we e.g. reboot the Phab host we could switch to the secondary host and minimise user-visible downtime.

Change 510597 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm in Hiera on both hosts

https://gerrit.wikimedia.org/r/510597

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)

Dzahn changed the task status from Open to Stalled.Sep 13 2019, 6:54 PM
Dzahn renamed this task from Reimage both phab1001 and phab2001 to stretch to Reimage both phab1001 and phab2001 to stretch / buster.Sep 13 2019, 7:01 PM
Dzahn updated the task description. (Show Details)
Dzahn changed the status of subtask T137928: Deploy phabricator to phab2001.codfw.wmnet from Stalled to Open.Sep 13 2019, 7:22 PM

Change 536698 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch phab1001 from jessie to buster

https://gerrit.wikimedia.org/r/536698

Dzahn changed the task status from Stalled to Open.Sep 13 2019, 10:47 PM

Change 536698 merged by Dzahn:
[operations/puppet@production] DHCP: switch phab1001 from jessie to buster

https://gerrit.wikimedia.org/r/536698

Change 536701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: apply spare::system role to phab1001

https://gerrit.wikimedia.org/r/536701

Change 536701 merged by Dzahn:
[operations/puppet@production] site: apply spare::system role to phab1001

https://gerrit.wikimedia.org/r/536701

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909132312_dzahn_241683_phab1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['phab1001.eqiad.wmnet']

and were ALL successful.

Change 536712 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/phabricator: apply phab role on phab1001

https://gerrit.wikimedia.org/r/536712

Dzahn updated the task description. (Show Details)Sep 14 2019, 7:48 AM

Change 536712 merged by Dzahn:
[operations/puppet@production] site/phabricator: apply phab role on phab1001

https://gerrit.wikimedia.org/r/536712

Change 541666 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: support buster with PHP 7.3 packages

https://gerrit.wikimedia.org/r/541666

Mentioned in SAL (#wikimedia-operations) [2019-10-08T23:28:53Z] <mutante> phab1001 - replacing tin.eqiad.wmnet with deploy1001.eqiad.wmnet in phabricator/deployment-cache/.config:git_server - wondering if we can ever get rid of tin (T190568)

Change 541666 merged by Dzahn:
[operations/puppet@production] phabricator: support buster with PHP 7.3 packages

https://gerrit.wikimedia.org/r/541666

Change 541930 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator::httpd: support stretch/buster with/without php-fpm

https://gerrit.wikimedia.org/r/541930

Change 541930 merged by Dzahn:
[operations/puppet@production] phabricator::httpd: support stretch/buster with/without php-fpm

https://gerrit.wikimedia.org/r/541930

Change 541967 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on buster

https://gerrit.wikimedia.org/r/541967

Change 541967 merged by Dzahn:
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on buster

https://gerrit.wikimedia.org/r/541967

Dzahn added a comment.Wed, Oct 9, 11:52 PM

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?)

What's the current procedure to switch over the active Phab server, just a DNS name change?

IIRC we can't switchover to codfw currently due to the Phab DB not being replicated, is that still correct?

Not anymore now. DBA unblocked that.

Phabricator is an important service and the Phabricator server is currently a SPOF. If the procedure to failover the active Phab server is non-intrusive I'd suggest to reimage phab1001 to Stretch and keep it as the failover server (similar as we do > for cloudnet, cloudservices, cloudcontrol e.g.). If we run into hardware issues with active Phab server we can then easily switchover to the failover host in eqiad. If the failover procedure is non-instrusive this might also simplify Phab maintenance > a lot as new changes could more easily staged on the secondary host (e.g. by switching /etc/hosts manually to that server). And if we e.g. reboot the Phab host we could switch to the secondary host and minimise user-visible downtime.

Talked with 20after and we agree and there is T232887 meanwhile to ask to keep the second server in eqiad permanently.

@Muehlenhoff Currently moving to buster is blocked by T235140

Change 541993 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install packages and apt::repo independent of using php-fpm

https://gerrit.wikimedia.org/r/541993

Change 542191 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on any distro

https://gerrit.wikimedia.org/r/542191

Change 541993 abandoned by Dzahn:
phabricator: fix duplicate installation of PHP packages

Reason:
in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/ /542193

https://gerrit.wikimedia.org/r/541993

Change 542191 merged by Dzahn:
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on any distro

https://gerrit.wikimedia.org/r/542191