Page MenuHomePhabricator

Reimage both phab1001 and phab2001 to stretch / buster
Closed, ResolvedPublic

Description

We should reimage phab1001 and phab2001 to stretch now that we have unblocked the reimage with this T187127 being resolved.

This reimage is needed for T182832

We should reimage phab2001 first then phab1001

blockers noted in meeting:

  • ensure testing is possible from deployment servers (firewall holes)
  • ensure replacement server has 64GB RAM like prod server

todo:

  • install OS on phab1003
  • switch prod traffic to phab1003
  • reinstall phab2001 with stretch
  • reinstall phab1001 with buster
  • reinstall phab2001 with buster
  • switch production from phab1003 to phab1001 -> T238956
  • give back phab1003 or keep as permanent failover in the same DC | requested to keep in T232887 -> decom in T238956

Details

Related Gerrit Patches:
operations/puppet : productionATS/varnish: add phabricator-new to point to phab1001
operations/puppet : productioninstall_server: switch phab2001 to buster
operations/puppet : productionadd spare::system role to phab1001
operations/puppet : productionphabricator: install s-nail instead of heirloom-mailx on any distro
operations/puppet : productionphabricator: fix duplicate installation of PHP packages
operations/puppet : productionphabricator: install s-nail instead of heirloom-mailx on buster
operations/puppet : productionphabricator::httpd: support stretch/buster with/without php-fpm
operations/puppet : productionphabricator: support buster with PHP 7.3 packages
operations/puppet : productionsite/phabricator: apply phab role on phab1001
operations/puppet : productionsite: apply spare::system role to phab1001
operations/puppet : productionDHCP: switch phab1001 from jessie to buster
operations/puppet : productionphabricator: enable php-fpm in Hiera on both hosts
operations/puppet : productionphabricator: activate logmail on phab1003, disable on phab1001
operations/puppet : productioninstall_server: switch phab2001 to stretch installer
operations/puppet : productionphabricator: add vcs listen addresses for phab1003
operations/puppet : productionphabricator: enable php-fpm, disable logmails on phab1003
operations/puppet : productionphabricator: allow http from deployment hosts on stand-by servers
operations/puppet : productionphabricator: Add new var phabricator_server_new
operations/puppet : productionphabricator: Set phabricator_server_failover to phab1002
operations/dns : masterassign wmf4727 as phab1002

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Dzahn: agreed. I don't know who should decide if we keep phab1001 or return it. We've proven now that the migration is doable in a sort-of short amount of time, so a warm standby would certainly improve availability in the event of a major failure with the primary machine.

Dzahn added a comment.May 23 2019, 4:19 PM

I will bring this up in my next subteam discussion meeting which should be in a week. Until then i will hold on to phab1001. Maybe we wait a few days to keep it as is.. and then install stretch.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905291932_dzahn_103379_phab2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-05-29T19:32:40Z] <mutante> phba2001 - reinstalling with stretch - upgrade from jessie (T190568)

Change 513202 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Change 513202 merged by Dzahn:
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292034_dzahn_151516_phab2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292217_dzahn_171903_phab2001_codfw_wmnet.log.

Dzahn updated the task description. (Show Details)May 29 2019, 11:31 PM

Change 513242 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Change 513242 merged by Dzahn:
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']
Joe moved this task from Backlog to Doing on the serviceops board.Jun 21 2019, 7:37 AM
Dzahn added a comment.Jul 9 2019, 9:13 PM

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?) then we decom phab1001 or we go with the original plan and reinstall phab1001 and switch back to it and then give the temp host phab1003 back to the pool (maybe dcops prefer this because we asked for a temp host only but also maybe it doesn't matter to them at all since we would be giving another host back instead).

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?)

What's the current procedure to switch over the active Phab server, just a DNS name change? IIRC we can't switchover to codfw currently due to the Phab DB not being replicated, is that still correct? Phabricator is an important service and the Phabricator server is currently a SPOF. If the procedure to failover the active Phab server is non-intrusive I'd suggest to reimage phab1001 to Stretch and keep it as the failover server (similar as we do for cloudnet, cloudservices, cloudcontrol e.g.). If we run into hardware issues with active Phab server we can then easily switchover to the failover host in eqiad. If the failover procedure is non-instrusive this might also simplify Phab maintenance a lot as new changes could more easily staged on the secondary host (e.g. by switching /etc/hosts manually to that server). And if we e.g. reboot the Phab host we could switch to the secondary host and minimise user-visible downtime.

Change 510597 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm in Hiera on both hosts

https://gerrit.wikimedia.org/r/510597

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)

Dzahn changed the task status from Open to Stalled.Sep 13 2019, 6:54 PM
Dzahn renamed this task from Reimage both phab1001 and phab2001 to stretch to Reimage both phab1001 and phab2001 to stretch / buster.Sep 13 2019, 7:01 PM
Dzahn updated the task description. (Show Details)
Dzahn changed the status of subtask T137928: Deploy phabricator to phab2001.codfw.wmnet from Stalled to Open.Sep 13 2019, 7:22 PM

Change 536698 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch phab1001 from jessie to buster

https://gerrit.wikimedia.org/r/536698

Dzahn changed the task status from Stalled to Open.Sep 13 2019, 10:47 PM

Change 536698 merged by Dzahn:
[operations/puppet@production] DHCP: switch phab1001 from jessie to buster

https://gerrit.wikimedia.org/r/536698

Change 536701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: apply spare::system role to phab1001

https://gerrit.wikimedia.org/r/536701

Change 536701 merged by Dzahn:
[operations/puppet@production] site: apply spare::system role to phab1001

https://gerrit.wikimedia.org/r/536701

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/201909132312_dzahn_241683_phab1001_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['phab1001.eqiad.wmnet']

and were ALL successful.

Change 536712 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/phabricator: apply phab role on phab1001

https://gerrit.wikimedia.org/r/536712

Dzahn updated the task description. (Show Details)Sep 14 2019, 7:48 AM

Change 536712 merged by Dzahn:
[operations/puppet@production] site/phabricator: apply phab role on phab1001

https://gerrit.wikimedia.org/r/536712

Change 541666 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: support buster with PHP 7.3 packages

https://gerrit.wikimedia.org/r/541666

Mentioned in SAL (#wikimedia-operations) [2019-10-08T23:28:53Z] <mutante> phab1001 - replacing tin.eqiad.wmnet with deploy1001.eqiad.wmnet in phabricator/deployment-cache/.config:git_server - wondering if we can ever get rid of tin (T190568)

Change 541666 merged by Dzahn:
[operations/puppet@production] phabricator: support buster with PHP 7.3 packages

https://gerrit.wikimedia.org/r/541666

Change 541930 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator::httpd: support stretch/buster with/without php-fpm

https://gerrit.wikimedia.org/r/541930

Change 541930 merged by Dzahn:
[operations/puppet@production] phabricator::httpd: support stretch/buster with/without php-fpm

https://gerrit.wikimedia.org/r/541930

Change 541967 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on buster

https://gerrit.wikimedia.org/r/541967

Change 541967 merged by Dzahn:
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on buster

https://gerrit.wikimedia.org/r/541967

Dzahn added a comment.Oct 9 2019, 11:52 PM

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?)

What's the current procedure to switch over the active Phab server, just a DNS name change?

IIRC we can't switchover to codfw currently due to the Phab DB not being replicated, is that still correct?

Not anymore now. DBA unblocked that.

Phabricator is an important service and the Phabricator server is currently a SPOF. If the procedure to failover the active Phab server is non-intrusive I'd suggest to reimage phab1001 to Stretch and keep it as the failover server (similar as we do > for cloudnet, cloudservices, cloudcontrol e.g.). If we run into hardware issues with active Phab server we can then easily switchover to the failover host in eqiad. If the failover procedure is non-instrusive this might also simplify Phab maintenance > a lot as new changes could more easily staged on the secondary host (e.g. by switching /etc/hosts manually to that server). And if we e.g. reboot the Phab host we could switch to the secondary host and minimise user-visible downtime.

Talked with 20after and we agree and there is T232887 meanwhile to ask to keep the second server in eqiad permanently.

@Muehlenhoff Currently moving to buster is blocked by T235140

Change 541993 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install packages and apt::repo independent of using php-fpm

https://gerrit.wikimedia.org/r/541993

Change 542191 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on any distro

https://gerrit.wikimedia.org/r/542191

Change 541993 abandoned by Dzahn:
phabricator: fix duplicate installation of PHP packages

Reason:
in favor of https://gerrit.wikimedia.org/r/c/operations/puppet/ /542193

https://gerrit.wikimedia.org/r/541993

Change 542191 merged by Dzahn:
[operations/puppet@production] phabricator: install s-nail instead of heirloom-mailx on any distro

https://gerrit.wikimedia.org/r/542191

Change 550902 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add spare::system role to phab1001

https://gerrit.wikimedia.org/r/550902

Change 550902 abandoned by Dzahn:
add spare::system role to phab1001

https://gerrit.wikimedia.org/r/550902

Change 551286 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS/varnish: add phabricator-new to point to phab1001

https://gerrit.wikimedia.org/r/551286

Change 551287 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch phab2001 to buster

https://gerrit.wikimedia.org/r/551287

Change 551287 merged by Dzahn:
[operations/puppet@production] install_server: switch phab2001 to buster

https://gerrit.wikimedia.org/r/551287

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201911182229_dzahn_139445_phab2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-11-18T22:31:37Z] <mutante> phab2001 - reinstalling with buster (T190568)

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2019-11-18T23:37:34Z] <mutante> phab2001 - restart ssh-phab service after reimaging (some race condition binding to the IP before getting it on the interface after fresh install .. reschedule pybal checks (T190568)

Mentioned in SAL (#wikimedia-operations) [2019-11-19T00:39:41Z] <mutante> phab2001 - rsyncing /srv/repos data from phab1003 (T190568)

19:10 < mutante> !log phab2001 - restart ssh-phab service after repooling it after buster reinstall, it wasn't listening on the IPv6 IP,causing LVS/pybal alerts


After reinstall and due to some puppet race condition the phab-ssh service was started before the IPv6 service IP was added to the interface.

Therefore the additional SSH daemon was listening only on IPv4 but not on IPv6 which made pybal mark the backend as down, which caused Icinga alerts for backends that are marked as down but pooled (after repooling it).

I had to restart the service with systemctl restart ssh-phab to make it listen on the IPv6 as well and then ssh 2620:0:860:103:10:192:32:149 from lvs2002 worked again and the Icinga alerts recovered.

PYBAL CRITICAL - CRITICAL - git-ssh4_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled: git-ssh6_22: Servers phab2001-vcs.codfw.wmnet are marked down but pooled
19:11 <+icinga-wm> RECOVERY - PyBal IPVS diff check on lvs2002 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal

Dzahn updated the task description. (Show Details)Wed, Nov 20, 12:16 AM
Dzahn updated the task description. (Show Details)Wed, Nov 20, 12:19 AM
  • phab1001 is on buster
  • phab2001 is now also on buster
  • next: declare maintenance window to switch prod from phab1003 to phab1001
  • decom phab1003 and give back to the pool

Simplified the steps accordingly in the ticket header.

Technically this ticket is resolved. The "switch prod from phab1003 to phab1001" and "decom phab1003" are probably both separate (sub)tasks. Then on the other hand can a ticket be resolved if subtasks are open? We often say yes :p

Dzahn closed this task as Resolved.Fri, Nov 22, 10:10 PM
Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)

Change 551286 abandoned by Dzahn:
ATS/varnish: add phabricator-new to point to phab1001

Reason:
not needed anymore

https://gerrit.wikimedia.org/r/551286