Deploy phabricator to phab2001.codfw.wmnet
Open, HighPublic

Description

Now that T137838: setup phab2001.codfw.wmnet (WMF6405) is done, it's time to get some redundancy in phabricator.

There are a lot of pieces to this and most of these will eventually split off as subtasks but I'm writing them here for now:

  • get phabricator puppet roles working 100% on jessie
  • figure out how to get email working from codfw
  • mirror all repositories to phab2001.codfw.wmnet
  • prepare a disaster recovery plan for failing over from iridium to phab2001
  • rename iridium to phab1001.eqiad.wmnet

related courtesy link: Phabricator Clustering Introduction

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Volans added a subscriber: Volans.Aug 16 2016, 6:21 PM

@Dzahn @mmodell python-phabricator (0.6.1-1) has been backported into Jessie (from Debian Stretch) in T142097.
A sanity check should be done to ensure that it works properly with the current code in phab_epipe.py given that our package for Trusty has the 0.4.0.

@RobH: FYI see my previous post given that the package python-phabricator should affect only phab_epipe.py in case phab2001 will become active and you're the major user of that script 😉

I don't think it will be an issue, I've looked at phab_epipe.py and the changelog for python-phabricator. The only potential issue I know of is the newer token-based authentication scheme employed by phabricator. We might have to take advantage of the features added in 0.5.0 to use tokens, however, it seems the old certificate-based authentication should still work?

yes it looks like user+certificate auth is still supported

Change 303740 merged by Dzahn:
phabricator: add systemd unit file for phd service

https://gerrit.wikimedia.org/r/303740

Dzahn added a comment.Aug 16 2016, 8:09 PM

Added systemd unit file. systemd now knows phd as a service.

root@phab2001:/etc/systemd/system# systemctl status phd.service
● phd.service - phabricator-phd

Loaded: loaded (/etc/systemd/system/phd.service; disabled)
Active: failed (Result: exit-code) since Tue 2016-08-16 20:06:40 UTC; 29s ago

The failure to start is because "Failed at step EXEC spawning /srv/phab/phabricator/bin/phd: No such file or directory". /srv/phab/phabricator/ is there but not the ./bin. Next would be to ensure that exists before the service.

The failure to start is because "Failed at step EXEC spawning /srv/phab/phabricator/bin/phd: No such file or directory". /srv/phab/phabricator/ is there but not the ./bin. Next would be to ensure that exists before the service.

That directory is part of the phabricator repository, it should be there. I'll look into this.

mmodell added a comment.EditedAug 16 2016, 10:22 PM

@Dzahn: we need a phab-deploy user on phab2001 and it doesn't seem to be set up. False alarm, it's there

mmodell added a comment.EditedAug 16 2016, 10:47 PM

@Dzahn: I was able to do a scap3 deployment from tin and initialize the submodules. After doing that plus one more puppet run I was able to start phd.

I'm not sure how to fix the circular dependency with puppet depending on files deployed by scap and scap depending on files deployed with puppet.

@mmodell ah, cool! i think it's ok if puppet works after the first scap deploy on a fresh host

I see Icinga also reports "phd" as running (PROCS OK: 19 processes with UID = 997 (phd) ) now. Nice.

Earlier we got paged about it not running then recovering on phab2001. Since this is still going to be worked on i disabled notifications. When this ticket gets resolved they should be enabled again.

@Dzahn: I believe that it would work after two puppet runs even without a scap deploy, now that I have it set up right on tin.

We do have an issue with running services that have not been properly configured for clustering so I've disabled puppet and phd on phab2001 for now.

I see Icinga also reports "phd" as running (PROCS OK: 19 processes with UID = 997 (phd) ) now. Nice.

Earlier we got paged about it not running then recovering on phab2001. Since this is still going to be worked on i disabled notifications. When this ticket gets resolved they should be enabled again.

Also, see https://gerrit.wikimedia.org/r/#/c/305149

I suggest we keep the monitoring active for both DCs but make sure we only get SMS for the active data center.

monitor::service for example has "critical => true/false" where critical means paging and "contactgroups" is being used already where "sms" means paging.

We could put in Hiera somewhere which is the currently active DC and then change the contact groups based on that. When you switch to the other DC it would be ideally just flipping that one master value in Hiera that does this (and other things).

phab2001 and iridium will need to be able to ssh to eachother for phabricator cluster services to work correctly. I don't see a way to tunnel through a bastion though it might be possible.

Change 305277 had a related patch set uploaded (by Dzahn):
phabricator: allow ssh between instances for cluster support

https://gerrit.wikimedia.org/r/305277

Dzahn edited the task description. (Show Details)Aug 17 2016, 8:35 PM
Dzahn edited the task description. (Show Details)

Change 305277 merged by Dzahn:
phabricator: allow ssh between servers for cluster support

https://gerrit.wikimedia.org/r/305277

Dzahn added a comment.Aug 18 2016, 8:33 PM

phab2001 and iridium will need to be able to ssh to eachother for phabricator cluster services to work correctly.

There are now iptables rules via ferm for this, on both servers:

ACCEPT     tcp  --  iridium.eqiad.wmnet  anywhere             tcp dpt:ssh
ACCEPT     tcp  --  phab2001.codfw.wmnet  anywhere             tcp dpt:ssh

I can't confirm that ssh from one to another works yet though.

a) also need to allow IPv6, it's not in ip6tables

but also

b) ssh: connect to host 10.64.32.150 port 22: No route to host

@Dzahn: I think iridium is somehow specially segregated in the network. How that works, I'm not sure.

Dzahn added a comment.EditedAug 18 2016, 8:55 PM

@mmodell yea, looks like it. we gotta add netops in that case.

Change 305591 had a related patch set uploaded (by Dzahn):
phabricator: don't run phd on inactive server yet

https://gerrit.wikimedia.org/r/305591

Change 305591 merged by Dzahn:
phabricator: don't run phd on inactive server yet

https://gerrit.wikimedia.org/r/305591

Volans added a comment.Sep 5 2016, 1:38 PM

@Dzahn @mmodell FYI looks like Salt is not properly configured for phab2001, it doesn't answer to normal test.ping or other commands.

Change 324369 had a related patch set uploaded (by Dzahn):
phab: fix systemd unit file name of ssh-phab

https://gerrit.wikimedia.org/r/324369

Change 324369 merged by Dzahn:
phab: fix systemd unit file name of ssh-phab

https://gerrit.wikimedia.org/r/324369

Dzahn added a comment.Nov 30 2016, 1:24 AM

I re-enabled puppet on phab2001 today, i checked that it got the right secondary IP on the interface now (see subtask about networking that is resolved now). It was fine and did not break git-ssh anymore.

It got a bunch of other updates because puppet was disabled for a while.

Then after that last fix above about the systemd unit file, now puppet runs on phab2001 and there are no errors. I am checking the first checkbox about getting the phab roles to work on jessie.


Info: Applying configuration version '1480468216'
Notice: /Stage[main]/Phabricator::Vcs/Service[ssh-phab]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Phabricator::Vcs/Service[ssh-phab]: Unscheduling refresh on Service[ssh-phab]
Notice: Finished catalog run in 15.92 seconds
Dzahn added a comment.EditedNov 30 2016, 1:26 AM

@Volans puppet runs again on phab2001, it got a bunch of pending upgrades, it also started reporting in icinga again. (though not in production just yet)

salt issue is fixed:

[neodymium:~] $ sudo salt 'phab2001*' cmd.run 'hostname'
phab2001.codfw.wmnet:
    phab2001
Dzahn edited the task description. (Show Details)Nov 30 2016, 1:27 AM
Dzahn added a comment.Dec 1 2016, 5:02 PM

https://gerrit.wikimedia.org/r/#/c/324408/

https://gerrit.wikimedia.org/r/#/c/324551/

^ since these we can now override the phab server name(s) in hiera, for Apache config and git-ssh config.

This allowed Paladox to get this up and running:

F4931606 :)

Change 324796 had a related patch set uploaded (by Paladox):
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324797 had a related patch set uploaded (by Dzahn):
varnish misc: add phab2001 as a backend for phab-new

https://gerrit.wikimedia.org/r/324797

Change 324796 abandoned by Paladox:
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324832 had a related patch set (by Paladox) published:
Phabricator: Set domain for phab2001 in codfw

https://gerrit.wikimedia.org/r/324832

Change 324833 had a related patch set (by Paladox) published:
Phabricator: Set phabricator active server for iridium and phab2001

https://gerrit.wikimedia.org/r/324833

Change 324851 had a related patch set uploaded (by 20after4):
phabricator: cluster.addresses to whitelist iridium and phab2001

https://gerrit.wikimedia.org/r/324851

Change 324796 restored by Dzahn:
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324832 merged by Dzahn:
Phabricator: Set domain for phab2001 in codfw

https://gerrit.wikimedia.org/r/324832

Change 324796 merged by Dzahn:
Phabricator: allow rsyncing /srv/repos from active to passive server

https://gerrit.wikimedia.org/r/324796

Change 325067 had a related patch set uploaded (by Dzahn):
phabricator: use FQDN instead of short hostname in ferm rules

https://gerrit.wikimedia.org/r/325067

Change 325067 merged by Dzahn:
phabricator: use FQDN instead of short hostname in ferm rules

https://gerrit.wikimedia.org/r/325067

Change 325077 had a related patch set uploaded (by Dzahn):
phab2001: fix hosts_allowed in rsync config

https://gerrit.wikimedia.org/r/325077

Change 325077 merged by Dzahn:
phab2001: fix hosts_allowed in rsync config

https://gerrit.wikimedia.org/r/325077

Change 325079 had a related patch set uploaded (by Dzahn):
phab2001: allow rsync from iridium over IPv6 too

https://gerrit.wikimedia.org/r/325079

Change 325079 merged by Dzahn:
phab2001: allow rsync from iridium over IPv6 too

https://gerrit.wikimedia.org/r/325079

Dzahn added a comment.Dec 3 2016, 3:50 AM

We can now rsync from iridium to phab2001.

I have deleted the old contents of /srv/repos on phab2001, and started a new sync and it has finished.

the command line was, on iridium:

rsync -avp /srv/repos/ rsync://phab2001.codfw.wmnet:/phab-srv-repos

size of all repos is about 26G

sent 23,198,077,242 bytes  received 27,050,398 bytes  5,876,060.12 bytes/sec
total size is 23,078,007,366  speedup is 0.99
dzahn@iridium:/srv/repos$ du -hs .
26G     .
Zppix changed the status of subtask T152132: Setup test domain for phab2001 from "Open" to "Stalled".Dec 19 2016, 9:50 PM

Change 324833 abandoned by Paladox:
Phabricator: Set phabricator active server for iridium and phab2001

https://gerrit.wikimedia.org/r/324833

@mmodell hi,

any updates on these check boxes

figure out how to get email working from codfw
mirror all repositories to phab2001.codfw.wmnet
prepare a disaster recovery plan for failing over from iridium to phab2001

please?

I forgot did we rsync the repo's?

Also are emails working for codfw.

And has there been a disaster recovery plan for failing over from iridium to phab2001

I'm not sure about email, but last I knew this was blocked on getting the varnish backend set up... That may have changed, I'll verify.

mmodell edited the task description. (Show Details)Jan 23 2017, 6:39 PM
Dzahn added a comment.Jan 23 2017, 7:19 PM

@Paladox @mmodell that patch is waiting for refactoring of varnish misc-web backends. it has been mentioned in ops meeting. Brandon will probably merge something today that will unblock that so we can get the backend soon.

Dzahn added a comment.Jan 23 2017, 7:24 PM

re: rsync So rsyncd has been setup and we manually rsynced once afair. But i don't think we have anything (cron) that does this automatically yet. But maybe we don't want that.

re: rename iridium - that would be T152129. So we can either first make phab2001 the live server and then reinstall/rename iridium or the other way around. But right now it's kind of a circular dependency. I tend to think we should first finish this one here, switch to it, and then touch iridium. That would mean remove that checkbox from the ticket here.

Change 324797 abandoned by Dzahn:
varnish misc: add phab2001 as a backend for phab-new

Reason:
we can't make phab2001 the production phab server since we don't have per-service routing yet and we don't want to send unencrypted traffic across datacenters

https://gerrit.wikimedia.org/r/324797

Change 339763 had a related patch set uploaded (by Dzahn; owner: Paladox):
Phabricator: Migrate to base::service_unit for ssh-phab

https://gerrit.wikimedia.org/r/339763

Change 340158 had a related patch set (by Paladox) published:
Phabricator: Migrate to base::service_unit for phd

https://gerrit.wikimedia.org/r/340158

Change 341589 had a related patch set uploaded (by Dzahn):
[operations/puppet] phabricator: fix file names of systemd/upstart templates

https://gerrit.wikimedia.org/r/341589

Change 341589 merged by Dzahn:
[operations/puppet] phabricator: fix file names of systemd/upstart templates

https://gerrit.wikimedia.org/r/341589

Change 339763 merged by Dzahn:
[operations/puppet] Phabricator: Migrate to base::service_unit for ssh-phab

https://gerrit.wikimedia.org/r/339763

Dzahn added a comment.Tue, Mar 7, 8:28 PM

12:29 < mutante> !log phab2001 - phab-ssh service converted to base::service_unit and with working systemd unit file. 'systemctl ssh-phab status' is active (running) (T158434)

Change 341747 had a related patch set uploaded (by Dzahn):
[operations/puppet] phabricator: monitor PHD service only on active server

https://gerrit.wikimedia.org/r/341747

Change 341747 merged by Dzahn:
[operations/puppet] phabricator: monitor PHD service only on active server

https://gerrit.wikimedia.org/r/341747

Change 340158 merged by Dzahn:
[operations/puppet] Phabricator: Migrate to base::service_unit for phd

https://gerrit.wikimedia.org/r/340158

Mentioned in SAL (#wikimedia-operations) [2017-03-09T00:36:37Z] <mutante> iridium - temp. disable puppet | phab1001 - converting service to base::service_unit (T137928)