Page MenuHomePhabricator

Deploy phabricator to phab2001.codfw.wmnet
Closed, ResolvedPublic

Description

Now that T137838: setup phab2001.codfw.wmnet (WMF6405) is done, it's time to get some redundancy in phabricator.

There are a lot of pieces to this and most of these will eventually split off as subtasks but I'm writing them here for now:

related courtesy link: Phabricator Clustering Introduction

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+2 -0
operations/puppetproduction+2 -3
operations/puppetproduction+4 -0
operations/dnsmaster+0 -2
operations/puppetproduction+2 -8
operations/puppetproduction+10 -4
operations/puppetproduction+1 -1
operations/puppetproduction+12 -28
operations/puppetproduction+8 -6
operations/puppetproduction+4 -20
operations/puppetproduction+2 -2
operations/puppetproduction+5 -0
operations/puppetproduction+4 -1
operations/puppetproduction+27 -0
operations/puppetproduction+4 -2
operations/puppetproduction+3 -3
operations/puppetproduction+5 -3
operations/puppetproduction+3 -0
operations/puppetproduction+0 -0
operations/puppetproduction+9 -1
operations/puppetproduction+12 -0
operations/puppetproduction+29 -1
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
ResolvedJoe
ResolvedLSobanski
InvalidNone
Resolved mmodell
ResolvedPaladox
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
Resolved mmodell
Resolved mmodell
ResolvedRobH
ResolvedMoritzMuehlenhoff
ResolvedDzahn
InvalidNone
DeclinedDzahn
Resolved mmodell
DeclinedNone
Resolved mmodell
Resolved mmodell
Declined mmodell

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

can I ask which database is phab2001 connecting to right now?

can I ask which database is phab2001 connecting to right now?

I’m presuming that it’s connected to the same one as phab1001
as I see no configs changing the database.

The answer should be none, as the phd service isn't running there and nothing changed about phab2001 recently.

@Dzahn, thanks- I saw paladox comment:

Is codfw ment to be slow? It feels slower then phab1001 when it had the phabricator-new domain.

And I was worring for a moment we were doing cross-dc queries and publicly facing. Thanks for the clarification.

Yea, so the service isn't up, but the config is indeed like this:

hieradata/role/eqiad/phabricator_server.yaml:phabricator::mysql::master: "m3-master.eqiad.wmnet"
hieradata/role/codfw/phabricator_server.yaml:phabricator::mysql::master: "m3-master.eqiad.wmnet"

So we need to talk about the right one for codfw.. at some point. But nothing changed here since about February, the recent activity was all about migrating from iridium to phab1001 (within eqiad).

Paladox was looking at repos in the browser, but there should be no db connections as long as phd stays stopped afaict.

Just to make sure, I checked with tcpdump port 3306 and there was nothing, even when Apache is up. (on phab1001 there is constant activity).

I think we added m3-master.codfw.wmnet not a long time ago (maybe it was another misc server). If we failover the app, probably we should failover the db, too; while keeping the db in read only

Thanks! @jcrespo. Unfortunately i was wrong, there would have been cross-dc queries when you accessed the URL https://phabricator-new.wikimedia.org/diffusion/ ,even without phd service. Jjust the / URL wasn't working. I stopped Apache and disabled puppet to prevent this from happening again. I will follow-up with a patch to make sure Apache is always stopped on the passive server.

I think we added m3-master.codfw.wmnet not a long time ago (maybe it was another misc server)

It looks like we have m2 and m5 in codfw, but not m3 yet.

Change 370498 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: open firewall holes only on active_server

https://gerrit.wikimedia.org/r/370498

Change 370498 merged by Dzahn:
[operations/puppet@production] phabricator: open firewall holes only on active_server

https://gerrit.wikimedia.org/r/370498

mmodell lowered the priority of this task from High to Medium.Aug 28 2017, 7:12 AM
mmodell changed the status of subtask T164810: Switch phabricator production to codfw from Open to Stalled.

Change 389803 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] remove phabricator-new hostname

https://gerrit.wikimedia.org/r/389803

Change 389804 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] phab: dc failover behind the primary public name

https://gerrit.wikimedia.org/r/389804

Change 389804 merged by BBlack:
[operations/puppet@production] phab: dc failover behind the primary public name

https://gerrit.wikimedia.org/r/389804

Change 389803 merged by BBlack:
[operations/dns@master] remove phabricator-new hostname

https://gerrit.wikimedia.org/r/389803

So BBlack and I went over this stuff today on IRC. I'll paste an excerpt from the transcript of that IRC conversation, for posterity:

@BBlack

right now git-ssh.wikimedia.org is in DNS and statically configured to the same IP address as git-ssh.eqiad.wikimedia.org
and behind that IP, we have all the LVS configuration/infrastructure in eqiad that forwards that traffic down into phab1001-vcs.eqiad.wmnet
right now in DNS, there's also IPs and hostnames for git-ssh.codfw.wikimedia.org, but apparently we haven't configured the LVS part of this to forward traffic
down into phab2001-vcs.codfw.wmnet yet (which is fairly straightforward)
what's missing is adding something like the entries in:
https://github.com/wikimedia/operations-dns/blob/master/geo-resources
which is what drives the public/global x-dc failover of our primary HTTPS entrypoints, out at the geodns level.
and then pointing git-ssh.wikimedia.org at that geodns stuff instead of mapping it directly and statically to eqiad

@mmodell

since phabricator in codfw is only 90% configured we definitely don't want to direct any traffic there yet

@BBlack

there's some quibbles to sort out on setting that up, because unlike our other services it only lives at the core DCs and not the edges, but nothing fundamental

@mmodell

that all sounds pretty acceptable

@BBlack

and then... for that kind of geodns routing/failover, the assumption is everything is active/active by default
when you want to disable a site due to some level of outage/failure, you make temporary commits to https://github.com/wikimedia/operations-dns/blob/master/admin_state
which mark a given site+service as DOWN and then resolve all lookups to the remaining one(s)
it's all doable, maybe not today, but it's not that far off
a few hours of digging around how to structure some things

@mmodell

ok so all I need is an ssh tunnel between codfw and eqiad for phabricator to use for proxying connections then it can support active/active git+ssh connections

@BBlack

and in the net there will be 3 different switches to flip to change the DC-routing (codfw-vs-eqiad-vs-active/active), which has its up- and down-sides
2 different hieradata switches in cache_misc for the main phab HTTP stuff + aphlict websockets (which you can move in the same commit)
and a different switch over in the DNS repo for git-ssh

Re: all of the above about git-ssh: I pushed https://gerrit.wikimedia.org/r/#/c/389871/ and @ayounsi fixed up the router ACLs, so the public entrypoint git-ssh.codfw.wikimedia.org into phab2001-vcs works now. Also pushed the TTL reduction for the real service hostname git-ssh.wikimedia.org in https://gerrit.wikimedia.org/r/#/c/389869/ .

So we're in an "ok" place on all failover-related things.

We can failover (or configure active/active) the basic HTTP+Aphlict stuff with hierdata commits around https://phabricator.wikimedia.org/source/operations-puppet/browse/production/hieradata/role/common/cache/misc.yaml;361fd6cb1d452047b1bbbfda0d29b6c43561a0dd$99 .

For git-ssh, presently it would be manual DNS failover (no active/active) by changing which of the git-ssh per-DC IPs are pointed to by git-ssh.wikimedia.org around https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/wikimedia.org;e52acadd3f9e506dbccc232e03462c92acc36949$307 .

There's remaining work to do to get active/active/failover capabilities for git-ssh at the GeoDNS level.

Thanks again to @BBlack for helping out here!

I sent a test mail from phab2001 to @Paladox, which was received.

Alroilim renamed this task from Deploy phabricator to phab2001.codfw.wmnet to https://phabricator.wikimedia.org/tag/user-mmodell/.Feb 2 2019, 7:11 PM
Alroilim renamed this task from https://phabricator.wikimedia.org/tag/user-mmodell/ to https://phabricator.wikimedia.org.
Alroilim closed this task as Declined.
Alroilim removed mmodell as the assignee of this task.
Alroilim lowered the priority of this task from Medium to Lowest.
Alroilim set Due Date to Feb 1 2019, 9:00 PM.
Alroilim updated the task description. (Show Details)
Alroilim set Due Date to Feb 1 2019, 9:00 PM.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptFeb 2 2019, 7:11 PM
Paladox renamed this task from https://phabricator.wikimedia.org to Deploy phabricator to phab2001.codfw.wmnet.Feb 2 2019, 7:27 PM
Paladox reopened this task as Open.
Paladox assigned this task to mmodell.
Paladox raised the priority of this task from Lowest to Medium.
Paladox removed Due Date.
Paladox updated the task description. (Show Details)
Paladox added subscribers: BBlack, Paladox, Marostegui and 8 others.
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptFeb 2 2019, 7:27 PM
Paladox changed the task status from Open to Stalled.Feb 2 2019, 7:28 PM

Change 324851 abandoned by 20after4:
phabricator: cluster.addresses to whitelist iridium and phab2001

Reason:
obsolete

https://gerrit.wikimedia.org/r/324851

Mentioned in SAL (#wikimedia-operations) [2019-07-19T22:36:01Z] <mutante> phab2001 - switching apache to php-fpm and worker instead of mpm-prefork (to match phab1001) (T190568 T137928 T190572)

Dzahn changed the task status from Stalled to Open.Sep 13 2019, 7:22 PM

Change 551284 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] re-add phabricator-new to point to caching layer

https://gerrit.wikimedia.org/r/551284

Change 551285 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: use codfw db servers for codfw server

https://gerrit.wikimedia.org/r/551285

Change 551285 merged by Dzahn:
[operations/puppet@production] phabricator: use codfw db servers for codfw server

https://gerrit.wikimedia.org/r/551285

  • phab2001 has been upgraded to buster.
  • puppet run shows no errors (unless when i try to start phd, T232883#5676791)
  • /srv/deployment/phabricator are both over 500M but slightly different sizes.

@20after4 Could you deploy to make them the same version? I see:

phab2001: deployment -> deployment-cache/revs/e4e2b2271aad4bb9fb65a421952f7846dab59dc4

vs.

phab1003: deployment -> deployment-cache/revs/61f10999d8837a8c9dbeea12f67b2554daf057ab

Change 551284 abandoned by Dzahn:
re-add phabricator-new to point to caching layer

Reason:
not needed anymore

https://gerrit.wikimedia.org/r/551284

  • phab2001 has been upgraded to buster.
  • puppet run shows no errors (unless when i try to start phd, T232883#5676791)
  • /srv/deployment/phabricator are both over 500M but slightly different sizes.

@20after4 Could you deploy to make them the same version? I see:

phab2001: deployment -> deployment-cache/revs/e4e2b2271aad4bb9fb65a421952f7846dab59dc4

vs.

phab1003: deployment -> deployment-cache/revs/61f10999d8837a8c9dbeea12f67b2554daf057ab

done