Page MenuHomePhabricator

networking: allow ssh between iridium and phab2001
Closed, ResolvedPublic

Description

splitting out from the parent task to deploy phab2001

We need to allow ssh between the phabricator servers for cluster support.

That is between iridium.eqiad.wmnet (CNAME phab1001.eqiad.wmnet) and phab2001.codfw.wmnet

We already added iptables rules via ferm (in parent task), but apparently we need network ACL changes in addition to that.

for example:

@phab2001:~# ssh 10.64.32.150
ssh: connect to host 10.64.32.150 port 22: No route to host

Event Timeline

There are no ACLs between private-* subnets, irrespective of datacenters (there are very few exceptions).

This isn't network-related. I logged in to debug this but some things look very broken:

faidon@phab2001:~$ traceroute 10.64.32.150
traceroute to 10.64.32.150 (10.64.32.150), 30 hops max, 60 byte packets
 1  iridium-vcs.eqiad.wmnet (10.64.32.186)  2998.463 ms !H  2998.431 ms !H  2998.422 ms !H

iridium-vcs.eqiad.wmnet? What is that? It pings:

faidon@phab2001:~$ ping 10.64.32.186
PING 10.64.32.186 (10.64.32.186) 56(84) bytes of data.
64 bytes from 10.64.32.186: icmp_seq=1 ttl=64 time=0.033 ms

It looks like an alias to iridium?

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether f0:1f:af:e8:c5:27 brd ff:ff:ff:ff:ff:ff
    inet 10.64.32.150/22 brd 10.64.35.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.64.32.186/21 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 2620:0:861:103:10:64:32:186/128 scope global deprecated 
       valid_lft forever preferred_lft 0sec
    inet6 2620:0:861:103:10:64:32:150/64 scope global

Which is it, /21 or /22? One of them is an alias presumably, in which case it needs to be /32, i.e. no subnet mask.

What is this and why does it exist? The puppet manifests even after following all(?) of their indirections, are very light on details. The name is not descriptive at all.

Finally, on another note, please don't do a CNAME from a roleNNNN to a star-name address. It's just way more confusing. Either rename it, or leave it is as-is.

iridium-vcs.eqiad.wmnet? What is that? It pings:
It looks like an alias to iridium?
Which is it, /21 or /22? One of them is an alias presumably, in which case it needs to be /32, i.e. no subnet mask.
What is this and why does it exist? The puppet manifests even after following all(?) of their indirections, are very light on details. The name is not descriptive at all.

I don't know about this part either, wasn't involved in that afair. I found this (https://gerrit.wikimedia.org/r/#/c/243806/) so i think @chasemp might know more.

Finally, on another note, please don't do a CNAME from a roleNNNN to a star-name address. It's just way more confusing. Either rename it, or leave it is as-is.

ok. yes, iridium is supposed to be reinstalled (trusty to jessie too) and renamed. adding the CNAME was to be a temp-only thing, so we could start using the name in ferm rules and the mysql setup before that has happened. the puppet roles aren't ready for cluster and jessie yet. it kind of unblocked getting that going and was originally suggested by Jaime on a review for a ferm change.

regarding that reinstall, btw @20after4 my thoughts were like "once phab2001 is installed and working, are we going to make that the active server and serve phabricator out of codfw, reinstall iridium to actual phab1001 and then switch back?, or is it more like we'll need downtime to reinstall iridium. we _could_ also decouple the hostname change from the trusty->jessie upgrade but let's get it working on jessie, right?

iridium-vcs.eqiad.wmnet? What is that? It pings:

as Greg pointed out that is T100519

regarding that reinstall, btw @20after4 my thoughts were like "once phab2001 is installed and working, are we going to make that the active server and serve phabricator out of codfw, reinstall iridium to actual phab1001 and then switch back?, or is it more like we'll need downtime to reinstall iridium. we _could_ also decouple the hostname change from the trusty->jessie upgrade but let's get it working on jessie, right?

Yeah switching phab2001 to be the primary server sounds good to me. Assuming we get everything working in codfw without issues then I think we should go that route.

WARNING: The following contains Flagrant Optimism™ which may not accurately reflect reality.

I'd also be ok with a little downtime for the reinstall. If we can avoid wiping the data on /srv/ then reimaging the system and bringing it back up with puppet shouldn't take more than an hour or two.

The custom ssh setup happened in T100519: Phabricator needs to expose ssh and we need to duplicate that stuff in codfw

ssh from iridium to phab2001 still isn't working.. :(

ok from phab2001 I can ssh to 10.64.32.186 but not to 10.64.32.150 and host iridium.eqiad.wmnet resolves to:

iridium.eqiad.wmnet has address 10.64.32.150
iridium.eqiad.wmnet has IPv6 address 2620:0:861:103:10:64:32:150

Mentioned in SAL (#wikimedia-operations) [2016-10-21T21:40:02Z] <mutante> phab2001 that IP was also on iridium/phab1001, it should not be hardcoded in puppet, causing issues in T143363

10.64.32.186 is hardcoded in puppet in several places

hieradata/role/eqiad/phabricator/main.yaml:  - "10.64.32.186"
hieradata/role/eqiad/phabricator/main.yaml:  - "[2620:0:861:103:10:64:32:186]"
modules/role/manifests/phabricator/main.pp:        address   => '10.64.32.186',
modules/role/manifests/phabricator/main.pp:        address   => '2620:0:861:103:10:64:32:186',
modules/role/manifests/phabricator/main.pp:        rule => 'saddr (0.0.0.0/0 ::/0) daddr (10.64.32.186/32 208.80.154.250/32 2620:0:861:103:10:64:32:186/128 2620:0:861:ed1a::3:16/128) proto tcp dport (22) ACCEPT;',

This added the same IP on both servers, iridium/phab1001 AND phab2001 causing these problems. (t was noticed by @Volans when he saw today how the salt-minion check on phab2001 started flapping because it could not reach the saltmaster anymore).

root@phab2001:~# ip a s | grep 10.64

inet 10.64.32.186/21 scope global eth0
inet6 2620:0:861:103:10:64:32:186/128 scope global deprecated

root@iridium:~# ip a s | grep 10.64

inet 10.64.32.150/22 brd 10.64.35.255 scope global eth0
inet 10.64.32.186/21 scope global eth0
inet6 2620:0:861:103:10:64:32:186/128 scope global deprecated 
inet6 2620:0:861:103:10:64:32:150/64 scope global

After seeing this we removed the IP from the interface on phab2001

14:37 < mutante> !log phab2001 - ip addr del 10.64.32.186/21 dev eth0

and started the salt-minion and right after:

14:38 < icinga-wm> RECOVERY - salt-minion processes on phab2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion

Also, now you can ssh to both 10.64.32.168 and 10.64.32.150 from phab2001.

Currently 10.64.32.186 is iridium-vcs.eqiad.wmnet and phab2001 does not have a second IP.

So it looks to me what we need is:

  • rename iridium-vcs.eqiad.wmnet to phab1001-vcs.eqiad.wmnet
  • create the equivalent phab2001-vcs.codfw.wmnet with a new IP in 10.192..32.0/22
  • change puppet/hiera to make sure phab1001 gets phab1001-vcs and phab2001 gets phab2001-vcs IPs assigned on their interfaces, don't hardcode just one secondary IP
  • adjust ferm rules / ACLs accordingly

Change 317290 had a related patch set uploaded (by Dzahn):
rename iridium-vcs to phab1001-vcs

https://gerrit.wikimedia.org/r/317290

Change 317291 had a related patch set uploaded (by Dzahn):
add phab2001-vcs.codfw.wmnet

https://gerrit.wikimedia.org/r/317291

Change 317295 had a related patch set uploaded (by Dzahn):
phabricator: add vcs::listen_addresses for codfw

https://gerrit.wikimedia.org/r/317295

Change 317296 had a related patch set uploaded (by Dzahn):
add git-ssh.codfw.wikimedia.org service IP

https://gerrit.wikimedia.org/r/317296

Change 317291 merged by Dzahn:
add phab2001-vcs.codfw.wmnet

https://gerrit.wikimedia.org/r/317291

Change 317296 merged by Dzahn:
add git-ssh.codfw.wikimedia.org service IP

https://gerrit.wikimedia.org/r/317296

radon:~] $ host git-ssh.wikimedia.org
git-ssh.wikimedia.org has address 208.80.154.250
git-ssh.wikimedia.org has IPv6 address 2620:0:861:ed1a::3:16

[radon:~] $ host git-ssh.eqiad.wikimedia.org
git-ssh.eqiad.wikimedia.org has address 208.80.154.250
git-ssh.eqiad.wikimedia.org has IPv6 address 2620:0:861:ed1a::3:16

[radon:~] $ host git-ssh.codfw.wikimedia.org
git-ssh.codfw.wikimedia.org has address 208.80.153.250
git-ssh.codfw.wikimedia.org has IPv6 address 2620:0:860:ed1a::3:fa

[radon:~] $ host phab2001-vcs.codfw.wmnet
phab2001-vcs.codfw.wmnet has address 10.192.32.149
phab2001-vcs.codfw.wmnet has IPv6 address 2620:0:860:103:10:192:32:149

@Dzahn, the same but more readable:

hostnameIPv?Address
git-ssh.wikimedia.orgIPv4208.80.154.250
git-ssh.wikimedia.orgIPv62620:0:861:ed1a::3:16
git-ssh.eqiad.wikimedia.orgIPv4208.80.154.250
git-ssh.eqiad.wikimedia.orgIPv62620:0:861:ed1a::3:16
git-ssh.codfw.wikimedia.orgIPv4208.80.153.250
git-ssh.codfw.wikimedia.orgIPv62620:0:860:ed1a::3:fa
phab2001-vcs.codfw.wmnetIPv410.192.32.149
phab2001-vcs.codfw.wmnetIPv62620:0:860:103:10:192:32:149

So 208.80.154 is in eqiad and 153 is codfw?

So 208.80.154 is in eqiad and 153 is codfw?

Yes, so each DC has several rows and each row has a network. And yea, 154 is eqiad and 153 is codfw. Like this:

; 208.80.154.0/26 (public1-a-eqiad) (.0 - .63)
; 208.80.154.64/26 (public1-c-eqiad) (.64 - .127)
; 208.80.154.128/26 (public1-b-eqiad) (.128 - .191)

; 208.80.153.0/27 (public1-a-codfw)
; 208.80.153.32/27 (public1-b-codfw)
; 208.80.153.64/27 (public1-c-codfw)
; 208.80.153.96/27 (public1-d-codfw)

Change 317295 merged by Dzahn:
phabricator: add vcs::listen_addresses for codfw

https://gerrit.wikimedia.org/r/317295

Change 318662 had a related patch set uploaded (by 20after4):
Move config for git-ssh(phabricator) to hiera

https://gerrit.wikimedia.org/r/318662

Change 317290 merged by Dzahn:
rename iridium-vcs to phab1001-vcs

https://gerrit.wikimedia.org/r/317290

iridium-vcs renamed:

[radon:~] $ host phab1001-vcs.eqiad.wmnet
phab1001-vcs.eqiad.wmnet has address 10.64.32.186
phab1001-vcs.eqiad.wmnet has IPv6 address 2620:0:861:103:10:64:32:186

[radon:~] $ host iridum-vcs.eqiad.wmnet
Host iridum-vcs.eqiad.wmnet not found: 3(NXDOMAIN)

Change 322034 had a related patch set uploaded (by Dzahn):
conftool/phabricator: replace iridium-vcs with phab1001-vcs

https://gerrit.wikimedia.org/r/322034

Change 322034 merged by Dzahn:
conftool/phabricator: replace iridium-vcs with phab1001-vcs

https://gerrit.wikimedia.org/r/322034

Change 318662 merged by Dzahn:
Move config for git-ssh(phabricator) to hiera

https://gerrit.wikimedia.org/r/318662

This should be resolved now, I think?

well, going by the original task description example:

was:

@phab2001:~# ssh 10.64.32.150
ssh: connect to host 10.64.32.150 port 22: No route to host

is now:

[phab2001:~] $ ssh 10.64.32.150
Password:

so that looks like yes to me

[phab2001:~] $ ssh 10.64.32.150
Password: 


[iridium:~] $ ssh phab2001.codfw.wmnet
The authenticity of host 'phab2001.codfw.wmnet (10.192.32.147)' can't be established.
ECDSA key fingerprint is 0e:bb:b9:81:e1:e7:0d:ed:86:ac:d7:08:a7:5e:f2:80.
Are you sure you want to continue connecting (yes/no)?