Page MenuHomePhabricator

Reimage both phab1001 and phab2001 to stretch
Open, HighPublic

Description

We should reimage phab1001 and phab2001 to stretch now that we have unblocked the reimage with this T187127 being resolved.

This reimage is needed for T182832

We should reimage phab2001 first then phab1001

blockers noted in meeting:

  • ensure testing is possible from deployment servers (firewall holes)
  • ensure replacement server has 64GB RAM like prod server

todo:

  • install OS on phab1003
  • switch prod traffic to phab1003
  • reinstall phab2001 with stretch
  • reinstall phab1001 with stretch or decom it and keep phab1003 as prod server
  • if needed: switch traffic back
  • give back phab1003 or keep as permanent failover in the same DC

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We could also avoid downtime by temporarily reusing mw1298 (former image scaler) and reinstalling it as phab1002 with stretch. Then we can switch to phab1002 and reimage phab1001 (with an eventual switchback to phab1001/stretch) without having a Phabricator downtime of > 2 hours. The specs are roughly the same, phab1001 has a slightly more powerful CPU than mw1298, but both have 64 GB RAM and looking at Prometheus CPU usage is usually ~ 25% so that be fine.

Dzahn added a comment.May 25 2018, 5:26 PM

Yes, fully agree. We had already made a very similar plan on IRC, just wasn't sure which specific server to pick. I'll go with your suggestion of mw1298. Thanks for checking the specs. Sounds good!

Change 435211 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] rename wmf6937 from mw1298 to phab1002

https://gerrit.wikimedia.org/r/435211

+1 this sounds like a good plan.

Change 435211 abandoned by Dzahn:
assign wmf4727 as phab1002

https://gerrit.wikimedia.org/r/435211

Using the mw server has not been approved (T195623). We will have to use another spare machine with just 32GB RAM.

Dzahn removed Dzahn as the assignee of this task.Jun 21 2018, 12:02 PM

Meanwhile we have wmf4727 and it is in site.pp and using the phabricator puppet role. The actual switch from phab1001 to phab1002 has to be coordinated with Mukunda. Since i will be on vacation for a while i am temp. unassigning it from me. If others get a chance to do it that would be great. Otherwise i will take it back after my return. It shouldn't be blocked by me though.

Is the repos being resynced to phab1002?

Path to be rsync is /srv/repos

The code is there to allow a user to do it. But it's not auto-syncing in the background. It needs a human to run the command. Though.. this is unexpected because auto_sync is set to true.

if $active_server != undef {
    rsync::quickdatacopy { 'srv-repos':
      ensure      => present,
      source_host => $active_server,
      dest_host   => $passive_server,
      auto_sync   => true,
      module_path => '/srv/repos',
    }

Change 441384 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] phabricator: Add new var phabricator_server_new

https://gerrit.wikimedia.org/r/441384

Change 447949 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] phabricator: Set phabricator_server_failover to phab1002

https://gerrit.wikimedia.org/r/447949

Mentioned in SAL (#wikimedia-operations) [2018-07-26T01:32:14Z] <mutante> phab1001 - rm /usr/local/sbin/sync-srv-repos that has a reference to non-existing server iridium.eqiad.wmnet (formerly phab) (T190568)

Change 447949 merged by Dzahn:
[operations/puppet@production] phabricator: Set phabricator_server_failover to phab1002

https://gerrit.wikimedia.org/r/447949

Mentioned in SAL (#wikimedia-operations) [2018-07-26T01:42:20Z] <mutante> phab1002 - starting rsync of repo data from phab1001 in a screen session after gerrit:447949 T190568

Change 441384 abandoned by Paladox:
phabricator: Add new var phabricator_server_new

https://gerrit.wikimedia.org/r/441384

@Dzahn What are the steps missing to failover to phab1002? This task seems close to be able to finally do it :)

@elukey @20after4 sorry for not replying earlier here. i think the only missing step to failover to phab1002 is that we set a maintenance window together / add it to deployment calendar / make an announcement. unless i forgot something

Change 486368 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: allow http from deployment hosts on stand-by servers

https://gerrit.wikimedia.org/r/486368

Change 486368 merged by Dzahn:
[operations/puppet@production] phabricator: allow http from deployment hosts on stand-by servers

https://gerrit.wikimedia.org/r/486368

Dzahn added a comment.Jan 24 2019, 8:47 PM

one follow-up was to ensure testing is possible with apache-fast-test / curl /etc. from deployment_servers.

This was previously not possible because firewall rules only allowed http connections from caching servers (varnish) to their backends.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/486181 added a firewall hole to allow deployment_servers to the production server

and the additional https://gerrit.wikimedia.org/r/c/operations/puppet/+/486368 was needed to also allow these connections to standby (non-active) servers because the puppet code says that firewall holes are not opened at all unless on the "active server" (prod). Now we are saying "if on a standby server then allow from deployment server" but that still keeps the holes for caching servers and email communication closed.

This unblocked testing the phab1002 installation because now we can compare like so:

[deploy1001:~] $ apache-fast-test phab.url phab1001.eqiad.wmnet
testing 13 urls on 1 servers, totalling 13 requests
spawning threads..

http://bugs.wikimedia.org
 * 301 Moved Permanently https://bugs.wikimedia.org/
http://bugzilla.wikimedia.org
 * 301 Moved Permanently https://bugzilla.wikimedia.org/
http://phab.wmfusercontent.org
 * 301 Moved Permanently https://phab.wmfusercontent.org/
http://phabricator.wikimedia.org
 * 301 Moved Permanently https://phabricator.wikimedia.org/
http://phabricator.wikimedia.org/T166013
 * 301 Moved Permanently https://phabricator.wikimedia.org/T166013
http://phabricator.wikimedia.org/maniphest/task/create/l
 * 301 Moved Permanently https://phabricator.wikimedia.org/maniphest/task/create/l
http://phabricator.wikimedia.org/maniphest/task/edit/form/1/
 * 301 Moved Permanently https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
http://phabricator.wikimedia.org/project/sprint/board/foo
 * 301 Moved Permanently https://phabricator.wikimedia.org/project/sprint/board/foo
https://bugzilla.wikimedia.org
 * 302 Found https://phabricator.wikimedia.org
https://phab.wmfusercontent.org
 * 302 Found https://phabricator.wikimedia.org
https://phabricator.wikimedia.org
 * 200 OK 34008
https://phabricator.wikimedia.org/T166013
 * 200 OK 93005
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
 * 200 OK 12135
[deploy1001:~] $ apache-fast-test phab.url phab1002.eqiad.wmnet
testing 13 urls on 1 servers, totalling 13 requests
spawning threads..

http://bugs.wikimedia.org
 * 301 Moved Permanently https://bugs.wikimedia.org/
http://bugzilla.wikimedia.org
 * 301 Moved Permanently https://bugzilla.wikimedia.org/
http://phab.wmfusercontent.org
 * 301 Moved Permanently https://phab.wmfusercontent.org/
http://phabricator.wikimedia.org
 * 301 Moved Permanently https://phabricator.wikimedia.org/
http://phabricator.wikimedia.org/T166013
 * 301 Moved Permanently https://phabricator.wikimedia.org/T166013
http://phabricator.wikimedia.org/maniphest/task/create/l
 * 301 Moved Permanently https://phabricator.wikimedia.org/maniphest/task/create/l
http://phabricator.wikimedia.org/maniphest/task/edit/form/1/
 * 301 Moved Permanently https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
http://phabricator.wikimedia.org/project/sprint/board/foo
 * 301 Moved Permanently https://phabricator.wikimedia.org/project/sprint/board/foo
https://bugzilla.wikimedia.org
 * 302 Found https://phabricator.wikimedia.org
https://phab.wmfusercontent.org
 * 302 Found https://phabricator.wikimedia.org
https://phabricator.wikimedia.org
 * 200 OK 31245
https://phabricator.wikimedia.org/T166013
 * 429 Too Many Requests
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
 * 429 Too Many Requests

phab.url is a plain text file with a list of URLs. tested it works on phab1001, 1002 and also 2001.

"* 429 Too Many Requests" someone will want to whitelist deploy1001 ip so it's not hit by the rate limiter.

Dzahn updated the task description. (Show Details)Jan 24 2019, 8:58 PM
Dzahn changed the task status from Open to Stalled.Apr 17 2019, 10:02 PM

blocked on T215335

Dzahn updated the task description. (Show Details)Apr 18 2019, 9:15 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)May 8 2019, 2:51 PM

The blocking ticket is closed; what else is needed for this to move forward?

Change 510597 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable php-fpm

https://gerrit.wikimedia.org/r/510597

Change 510614 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: enable php-fpm, disable logmails on phab1003

https://gerrit.wikimedia.org/r/510614

Change 510614 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm, disable logmails on phab1003

https://gerrit.wikimedia.org/r/510614

Mentioned in SAL (#wikimedia-operations) [2019-05-15T21:47:24Z] <mutante> phab1003 - ip -6 addr del 2620:0:861:ed1a::3:16/128 dev lo - remove extra service IP for phab's separate sshd, duplicated with phab1001 (T190568)

Change 510623 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: add vcs listen addresses for phab1003

https://gerrit.wikimedia.org/r/510623

Change 510623 merged by Dzahn:
[operations/puppet@production] phabricator: add vcs listen addresses for phab1003

https://gerrit.wikimedia.org/r/510623

Dzahn changed the task status from Stalled to Open.May 23 2019, 4:50 AM
Dzahn claimed this task.
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)May 23 2019, 4:14 PM

Phabricator has been switched to phab1003 as the prod server now and that meant:

  • php 5 to php 7.2
  • mod_php to php-fpm
  • jessie to stretch (and the httpd version with it)
Dzahn added a comment.May 23 2019, 4:16 PM

I am thinking now we could make the process easier and just keep phab1003 as the prod server and just discuss whether we want to keep phab1001 as a permanent stand-by in the same DC or give it back to the spares pool.

Next phab2001 will also need to be reinstalled with stretch.

@Dzahn: agreed. I don't know who should decide if we keep phab1001 or return it. We've proven now that the migration is doable in a sort-of short amount of time, so a warm standby would certainly improve availability in the event of a major failure with the primary machine.

Dzahn added a comment.May 23 2019, 4:19 PM

I will bring this up in my next subteam discussion meeting which should be in a week. Until then i will hold on to phab1001. Maybe we wait a few days to keep it as is.. and then install stretch.

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905291932_dzahn_103379_phab2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2019-05-29T19:32:40Z] <mutante> phba2001 - reinstalling with stretch - upgrade from jessie (T190568)

Change 513202 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Change 513202 merged by Dzahn:
[operations/puppet@production] install_server: switch phab2001 to stretch installer

https://gerrit.wikimedia.org/r/513202

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292034_dzahn_151516_phab2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

phab2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/201905292217_dzahn_171903_phab2001_codfw_wmnet.log.

Dzahn updated the task description. (Show Details)May 29 2019, 11:31 PM

Change 513242 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Change 513242 merged by Dzahn:
[operations/puppet@production] phabricator: activate logmail on phab1003, disable on phab1001

https://gerrit.wikimedia.org/r/513242

Completed auto-reimage of hosts:

['phab2001.codfw.wmnet']

Of which those FAILED:

['phab2001.codfw.wmnet']
Joe moved this task from Backlog to Doing on the serviceops board.Jun 21 2019, 7:37 AM
Dzahn added a comment.Jul 9 2019, 9:13 PM

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?) then we decom phab1001 or we go with the original plan and reinstall phab1001 and switch back to it and then give the temp host phab1003 back to the pool (maybe dcops prefer this because we asked for a temp host only but also maybe it doesn't matter to them at all since we would be giving another host back instead).

Next we need to make a decision whether we keep phab1003 as the prod host permanently (why not i guess?)

What's the current procedure to switch over the active Phab server, just a DNS name change? IIRC we can't switchover to codfw currently due to the Phab DB not being replicated, is that still correct? Phabricator is an important service and the Phabricator server is currently a SPOF. If the procedure to failover the active Phab server is non-intrusive I'd suggest to reimage phab1001 to Stretch and keep it as the failover server (similar as we do for cloudnet, cloudservices, cloudcontrol e.g.). If we run into hardware issues with active Phab server we can then easily switchover to the failover host in eqiad. If the failover procedure is non-instrusive this might also simplify Phab maintenance a lot as new changes could more easily staged on the secondary host (e.g. by switching /etc/hosts manually to that server). And if we e.g. reboot the Phab host we could switch to the secondary host and minimise user-visible downtime.

Change 510597 merged by Dzahn:
[operations/puppet@production] phabricator: enable php-fpm in Hiera on both hosts

https://gerrit.wikimedia.org/r/510597

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)