Deploy phabricator to phab2001.codfw.wmnet
Open, Stalled, NormalPublic

Description

Now that T137838: setup phab2001.codfw.wmnet (WMF6405) is done, it's time to get some redundancy in phabricator.

There are a lot of pieces to this and most of these will eventually split off as subtasks but I'm writing them here for now:

  • get phabricator puppet roles working 100% on jessie
  • figure out how to get email working from codfw
  • mirror all repositories to phab2001.codfw.wmnet
  • prepare a disaster recovery plan for failing over from phab1001 to phab2001 (or phab2001 to 1001)
  • rename iridium to phab1001.eqiad.wmnet

related courtesy link: Phabricator Clustering Introduction

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 305591 merged by Dzahn:
phabricator: don't run phd on inactive server yet

https://gerrit.wikimedia.org/r/305591

Volans added a comment.Sep 5 2016, 1:38 PM

@Dzahn @mmodell FYI looks like Salt is not properly configured for phab2001, it doesn't answer to normal test.ping or other commands.

Change 324369 had a related patch set uploaded (by Dzahn):
phab: fix systemd unit file name of ssh-phab

https://gerrit.wikimedia.org/r/324369

Change 324369 merged by Dzahn:
phab: fix systemd unit file name of ssh-phab

https://gerrit.wikimedia.org/r/324369

Dzahn added a comment.Nov 30 2016, 1:24 AM

I re-enabled puppet on phab2001 today, i checked that it got the right secondary IP on the interface now (see subtask about networking that is resolved now). It was fine and did not break git-ssh anymore.

It got a bunch of other updates because puppet was disabled for a while.

Then after that last fix above about the systemd unit file, now puppet runs on phab2001 and there are no errors. I am checking the first checkbox about getting the phab roles to work on jessie.


Info: Applying configuration version '1480468216'
Notice: /Stage[main]/Phabricator::Vcs/Service[ssh-phab]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Phabricator::Vcs/Service[ssh-phab]: Unscheduling refresh on Service[ssh-phab]
Notice: Finished catalog run in 15.92 seconds
Dzahn added a comment.EditedNov 30 2016, 1:26 AM

@Volans puppet runs again on phab2001, it got a bunch of pending upgrades, it also started reporting in icinga again. (though not in production just yet)

salt issue is fixed:

[neodymium:~] $ sudo salt 'phab2001*' cmd.run 'hostname'
phab2001.codfw.wmnet:
    phab2001
Dzahn updated the task description. (Show Details)Nov 30 2016, 1:27 AM
Dzahn added a comment.Dec 1 2016, 5:02 PM

https://gerrit.wikimedia.org/r/#/c/324408/

https://gerrit.wikimedia.org/r/#/c/324551/

^ since these we can now override the phab server name(s) in hiera, for Apache config and git-ssh config.

This allowed Paladox to get this up and running:

F4931606 :)

Change 324796 had a related patch set uploaded (by Paladox):
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324797 had a related patch set uploaded (by Dzahn):
varnish misc: add phab2001 as a backend for phab-new

https://gerrit.wikimedia.org/r/324797

Change 324796 abandoned by Paladox:
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324832 had a related patch set (by Paladox) published:
Phabricator: Set domain for phab2001 in codfw

https://gerrit.wikimedia.org/r/324832

Change 324833 had a related patch set (by Paladox) published:
Phabricator: Set phabricator active server for iridium and phab2001

https://gerrit.wikimedia.org/r/324833

Change 324851 had a related patch set uploaded (by 20after4):
phabricator: cluster.addresses to whitelist iridium and phab2001

https://gerrit.wikimedia.org/r/324851

Change 324796 restored by Dzahn:
Phabricator: rsync /srv/repos from iridium to phab2001

https://gerrit.wikimedia.org/r/324796

Change 324832 merged by Dzahn:
Phabricator: Set domain for phab2001 in codfw

https://gerrit.wikimedia.org/r/324832

Change 324796 merged by Dzahn:
Phabricator: allow rsyncing /srv/repos from active to passive server

https://gerrit.wikimedia.org/r/324796

Change 325067 had a related patch set uploaded (by Dzahn):
phabricator: use FQDN instead of short hostname in ferm rules

https://gerrit.wikimedia.org/r/325067

Change 325067 merged by Dzahn:
phabricator: use FQDN instead of short hostname in ferm rules

https://gerrit.wikimedia.org/r/325067

Change 325077 had a related patch set uploaded (by Dzahn):
phab2001: fix hosts_allowed in rsync config

https://gerrit.wikimedia.org/r/325077

Change 325077 merged by Dzahn:
phab2001: fix hosts_allowed in rsync config

https://gerrit.wikimedia.org/r/325077

Change 325079 had a related patch set uploaded (by Dzahn):
phab2001: allow rsync from iridium over IPv6 too

https://gerrit.wikimedia.org/r/325079

Change 325079 merged by Dzahn:
phab2001: allow rsync from iridium over IPv6 too

https://gerrit.wikimedia.org/r/325079

Dzahn added a comment.Dec 3 2016, 3:50 AM

We can now rsync from iridium to phab2001.

I have deleted the old contents of /srv/repos on phab2001, and started a new sync and it has finished.

the command line was, on iridium:

rsync -avp /srv/repos/ rsync://phab2001.codfw.wmnet:/phab-srv-repos

size of all repos is about 26G

sent 23,198,077,242 bytes  received 27,050,398 bytes  5,876,060.12 bytes/sec
total size is 23,078,007,366  speedup is 0.99
dzahn@iridium:/srv/repos$ du -hs .
26G     .
Zppix changed the status of subtask T152132: Setup test domain for phab2001 from Open to Stalled.Dec 19 2016, 9:50 PM

Change 324833 abandoned by Paladox:
Phabricator: Set phabricator active server for iridium and phab2001

https://gerrit.wikimedia.org/r/324833

@mmodell hi,

any updates on these check boxes

figure out how to get email working from codfw
mirror all repositories to phab2001.codfw.wmnet
prepare a disaster recovery plan for failing over from iridium to phab2001

please?

I forgot did we rsync the repo's?

Also are emails working for codfw.

And has there been a disaster recovery plan for failing over from iridium to phab2001

I'm not sure about email, but last I knew this was blocked on getting the varnish backend set up... That may have changed, I'll verify.

mmodell updated the task description. (Show Details)Jan 23 2017, 6:39 PM
Dzahn added a comment.Jan 23 2017, 7:19 PM

@Paladox @mmodell that patch is waiting for refactoring of varnish misc-web backends. it has been mentioned in ops meeting. Brandon will probably merge something today that will unblock that so we can get the backend soon.

Dzahn added a comment.Jan 23 2017, 7:24 PM

re: rsync So rsyncd has been setup and we manually rsynced once afair. But i don't think we have anything (cron) that does this automatically yet. But maybe we don't want that.

re: rename iridium - that would be T152129. So we can either first make phab2001 the live server and then reinstall/rename iridium or the other way around. But right now it's kind of a circular dependency. I tend to think we should first finish this one here, switch to it, and then touch iridium. That would mean remove that checkbox from the ticket here.

Change 324797 abandoned by Dzahn:
varnish misc: add phab2001 as a backend for phab-new

Reason:
we can't make phab2001 the production phab server since we don't have per-service routing yet and we don't want to send unencrypted traffic across datacenters

https://gerrit.wikimedia.org/r/324797

Change 339763 had a related patch set uploaded (by Dzahn; owner: Paladox):
Phabricator: Migrate to base::service_unit for ssh-phab

https://gerrit.wikimedia.org/r/339763

Change 340158 had a related patch set (by Paladox) published:
Phabricator: Migrate to base::service_unit for phd

https://gerrit.wikimedia.org/r/340158

Change 341589 had a related patch set uploaded (by Dzahn):
[operations/puppet] phabricator: fix file names of systemd/upstart templates

https://gerrit.wikimedia.org/r/341589

Change 341589 merged by Dzahn:
[operations/puppet] phabricator: fix file names of systemd/upstart templates

https://gerrit.wikimedia.org/r/341589

Change 339763 merged by Dzahn:
[operations/puppet] Phabricator: Migrate to base::service_unit for ssh-phab

https://gerrit.wikimedia.org/r/339763

Dzahn added a comment.Mar 7 2017, 8:28 PM

12:29 < mutante> !log phab2001 - phab-ssh service converted to base::service_unit and with working systemd unit file. 'systemctl ssh-phab status' is active (running) (T158434)

Change 341747 had a related patch set uploaded (by Dzahn):
[operations/puppet] phabricator: monitor PHD service only on active server

https://gerrit.wikimedia.org/r/341747

Change 341747 merged by Dzahn:
[operations/puppet] phabricator: monitor PHD service only on active server

https://gerrit.wikimedia.org/r/341747

Change 340158 merged by Dzahn:
[operations/puppet] Phabricator: Migrate to base::service_unit for phd

https://gerrit.wikimedia.org/r/340158

Mentioned in SAL (#wikimedia-operations) [2017-03-09T00:36:37Z] <mutante> iridium - temp. disable puppet | phab1001 - converting service to base::service_unit (T137928)

mmodell updated the task description. (Show Details)Jun 6 2017, 9:09 AM
mmodell changed the status of subtask T143175: Configure phabricator clustering for daemons and repositories from Open to Stalled.
mmodell updated the task description. (Show Details)Jul 17 2017, 4:28 PM
mmodell updated the task description. (Show Details)
mmodell changed the task status from Open to Stalled.Jul 31 2017, 6:46 PM

Currently blocked on Traffic. Attempting to make some headway on T163938: setup/install phab1001.eqiad.wmnet in the meantime.

Change 368957 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phab1001: add interface::add_ip6_mapped

https://gerrit.wikimedia.org/r/368957

Change 370179 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] phabricator: Auto rsync phab1001 to phab2001 (codfw)

https://gerrit.wikimedia.org/r/370179

Change 370179 merged by Dzahn:
[operations/puppet@production] phabricator: Auto rsync phab1001 to phab2001 (codfw)

https://gerrit.wikimedia.org/r/370179

Dzahn added a comment.Aug 5 2017, 12:16 AM

< mutante> !log phab2001 - changing UID/GID for phd user from 997:997 to 498:498 to make it match phab1001, to fix rsync breaking permissions. (rsync forces --numeric-ids when fetching from and rsyncd configured with chroot=yes). chown -R phw:www-data /srv/repos/
< mutante> !log "reserved" UID 498 for phd on https://wikitech.wikimedia.org/wiki/UID | phab2001: find -exec chown to fix all the files , restart cron
< mutante> Notice: /Stage[main]/Profile::Phabricator::Main/Rsync::Quickdatacopy[srv-repos]/Cron[rsync-srv-repos]/ensure: created
< mutante> the change is fine.. there is something that stops sync though
< mutante> !log phab2001 - installing various package upgrades, apt-get autoremove old kernel images
< mutante> !log phab2001 rebooting
< mutante> !log phab2001 - removed outdated /etc/hosts entries, that fixed rsync, syncing /srv/repos/ from phab1001


< mutante> twentyafterfour: i'll change the UID of the "phd" user on phab2001, so that it matches the UID on phab1001. then we won't have permission issues with rsync anymore
<+twentyafterfour> mutante: good idea
< mutante> since there are no running processes owned by phd and that particular UID is also not taken
< mutante> i will change it, then run find with -exec
..
< mutante> if you wonder "but normally rsync can handle this and do it by username"
< mutante> then the answer is "yea, normally, but
< mutante> "rsync forces --numeric-ids when fetching content from an rsyncd configured with use chroot = yes"
< mutante> also https://phabricator.wikimedia.org/T79786#1831969
< mutante> which is the same problem ,just mw user
< mutante> in our case we just have to fix 1 box though and not all appservers ...
< mutante> twentyafterfour: do you recall editing /etc/hosts on phab2001 to test something?
< mutante> the rsync stuff wasnt working and it was behaving really weird.. like if you use "ping phab1001 or telnet1001" you got the old IP, but host/dig showed the correct new IP.. then i found we have them in /etc/hosts. remove or update .. hmm
< mutante> removes them
< mutante> it's copying all the repo data over to codfw now since a while
< mutante> initial run is the longest. then it will just be the diff every 10 min
< mutante> feels better to have another backup

:).

Is codfw ment to be slow? It feels slower then phab1001 when it had the phabricator-new domain.

Dzahn added a comment.EditedAug 5 2017, 1:19 AM

Is codfw ment to be slow?

No :) But it was pretty busy with the huge initial rsync. And without "phd" running, what did you use that felt slow?

viewing tasks and viewing repos this was also slow before the rsync started.

Dzahn added a comment.Aug 5 2017, 2:04 AM

Both servers are the same model, same number of CPUs, same RAM, same puppet role.. just that phab2001 is a lot LESS busy...

can I ask which database is phab2001 connecting to right now?

TerraCodes updated the task description. (Show Details)Aug 7 2017, 8:24 AM

can I ask which database is phab2001 connecting to right now?

I’m presuming that it’s connected to the same one as phab1001
as I see no configs changing the database.

Dzahn added a comment.EditedAug 7 2017, 1:23 PM

The answer should be none, as the phd service isn't running there and nothing changed about phab2001 recently.

@Dzahn, thanks- I saw paladox comment:

Is codfw ment to be slow? It feels slower then phab1001 when it had the phabricator-new domain.

And I was worring for a moment we were doing cross-dc queries and publicly facing. Thanks for the clarification.

Dzahn added a comment.Aug 7 2017, 2:28 PM

Yea, so the service isn't up, but the config is indeed like this:

hieradata/role/eqiad/phabricator_server.yaml:phabricator::mysql::master: "m3-master.eqiad.wmnet"
hieradata/role/codfw/phabricator_server.yaml:phabricator::mysql::master: "m3-master.eqiad.wmnet"

So we need to talk about the right one for codfw.. at some point. But nothing changed here since about February, the recent activity was all about migrating from iridium to phab1001 (within eqiad).

Paladox was looking at repos in the browser, but there should be no db connections as long as phd stays stopped afaict.

Dzahn added a comment.Aug 7 2017, 2:36 PM

Just to make sure, I checked with tcpdump port 3306 and there was nothing, even when Apache is up. (on phab1001 there is constant activity).

I think we added m3-master.codfw.wmnet not a long time ago (maybe it was another misc server). If we failover the app, probably we should failover the db, too; while keeping the db in read only

Dzahn added a comment.Aug 7 2017, 2:42 PM

Thanks! @jcrespo. Unfortunately i was wrong, there would have been cross-dc queries when you accessed the URL https://phabricator-new.wikimedia.org/diffusion/ ,even without phd service. Jjust the / URL wasn't working. I stopped Apache and disabled puppet to prevent this from happening again. I will follow-up with a patch to make sure Apache is always stopped on the passive server.

Dzahn added a comment.Aug 7 2017, 5:01 PM

I think we added m3-master.codfw.wmnet not a long time ago (maybe it was another misc server)

It looks like we have m2 and m5 in codfw, but not m3 yet.

Change 370498 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: open firewall holes only on active_server

https://gerrit.wikimedia.org/r/370498

Change 370498 merged by Dzahn:
[operations/puppet@production] phabricator: open firewall holes only on active_server

https://gerrit.wikimedia.org/r/370498

mmodell changed the status of subtask T164810: Switch phabricator production to codfw from Open to Stalled.Aug 28 2017, 7:12 AM
mmodell lowered the priority of this task from High to Normal.