Page MenuHomePhabricator

git-ssh.wikimedia.org and IPv6 are broken after switch to phab1001
Closed, ResolvedPublic

Description

After switching to phab1001 from iridium, two problems remain which we (@Dzahn and myself) have been unable to figure out.

  1. IPv6 connections to/from phab1001 don't work. This caused phab to be unable to send mail for a while because it was connecting to mx1001 via ipv6. This is temporarily worked around by hard-coding the IPv4 address for mx1001 and mx2001 in phabricator's config.
  2. git-ssh.wikimedia.org refuses connections. The service is up on phab1001 and it's listening on the right IPs/port 22. pybal complained initially but we resolved that by depooling/repooling the node.

git-ssh.wikimedia.org might be unrelated to the ipv6 problem but I suspect they are the same root cause?

I think we need help from netops on this one.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
mmodell triaged this task as High priority.Aug 4 2017, 3:04 AM
mmodell added a project: Phabricator.
mmodell added a subscriber: faidon.

Note that this is high priority but not UBN, simply because git-ssh is barely used currently. Phabricator supports git over https which is the method that the vast majority of users are using. Additionally, since phabricator is not the authoritative host for many repos, there just aren't many people trying to push to phabricator over ssh.

Diffusion is the master repo for some sub-set of Toolforge projects. Its not a huge number of people impacted, but it is certainly non-zero. I'll send out a small announce on labs-l about this ticket.

Not sure what is the following IP:

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
    inet6 2620:0:861:103:10:64:32:186/128 scope global 
       valid_lft forever preferred_lft forever

But it was preferred over the regular host IP, so return traffic didn't know how to reach that IP.
Removing it solved the issue:

ayounsi@phab1001:~$ sudo ip -6 addr del 2620:0:861:103:10:64:32:186/128 dev eth0

Note that this IP was unreachable anyway, and because of the urgency of the task, it seemed safe to remove it.

ayounsi@phab1001:~$ telnet mx1001.wikimedia.org 25
Trying 2620:0:861:3:208:80:154:76...
Connected to mx1001.wikimedia.org.
Escape character is '^]'.
220 mx1001.wikimedia.org ESMTP Exim 4.84_2 Fri, 04 Aug 2017 03:37:21 +0000

ok so that takes care of the smtp/ipv6 issue, however, git-ssh still doesn't work. So I guess I was wrong about them being related.

Change 370144 had a related patch set uploaded (by 20after4; owner: 20after4):
[operations/puppet@production] Revert "PHAB: hard-code IP address for smtp"

https://gerrit.wikimedia.org/r/370144

That IP that was removed also existed on iridium before, on eth0, were i removed it:

00:23 mutante: iridium sudo ip addr del 2620:0:861:103:10:64:32:186/128 dev eth0

and then added it back on phab1001 the same way, same interface:

00:24 mutante: phab1001 sudo ip addr add 2620:0:861:103:10:64:32:186/128 dev eth0

the git-ssh issue is due to LVS not knowing where to forward the packets.

ayounsi@lvs1002:~$ sudo ipvsadm -Ln
TCP  208.80.154.250:22 wrr

(no backend servers)

After investigating:
The backend "VIP" server is set to 10.64.32.186 in the vlan private1-c-eqiad as it was hosted on a server in row C.

That VIP has been moved to phab1001 in row B, in the vlan private1-b-eqiad.

So the LVS server is sending packets to the wrong vlan:

10.64.16.0/22 dev eth1.1018  proto kernel  scope link  src 10.64.17.2  <-- private1-b-eqiad
10.64.32.0/22 dev eth2.1019  proto kernel  scope link  src 10.64.33.2  <-- private1-c-eqiad

Change 370145 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] phab: fix IP for phab1001-vcs

https://gerrit.wikimedia.org/r/370145

Change 370145 merged by Dzahn:
[operations/dns@master] phab: fix IP for phab1001-vcs

https://gerrit.wikimedia.org/r/370145

Change 370146 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] phabricator: fix IP for git-ssh.eqiad

https://gerrit.wikimedia.org/r/370146

Change 370146 merged by Dzahn:
[operations/puppet@production] phabricator: fix IP for git-ssh.eqiad

https://gerrit.wikimedia.org/r/370146

picked a new IP in the 10.64.16.0/22 network (row B) and used that instead

Change 370144 merged by Dzahn:
[operations/puppet@production] Revert "PHAB: hard-code IP address for smtp"

https://gerrit.wikimedia.org/r/370144

Mentioned in SAL (#wikimedia-operations) [2017-08-04T05:23:21Z] <mutante> phab1001 sudo ip addr del 10.64.32.186/32 dev eth0 (T172478)

Xionox restarted pybal after this .. and then:

23:04 <+icinga-wm> RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy
23:06 <+icinga-wm> RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy

23:08 < twentyafterfour> woo! works, who fixed it? :D

We can now talk to the ssh. Tested from external, IPv4 and IPv6. There is apparently another issue with Phabricator itself, so that sshd doesn't allow cloning, but it's not LVS/pybal anymore now.

Change 370153 had a related patch set uploaded (by Dzahn; owner: 20after4):
[operations/puppet@production] PHAB: move the ssh hook somewhere sshd won't complain about

https://gerrit.wikimedia.org/r/370153

Change 370153 merged by Dzahn:
[operations/puppet@production] PHAB: move the ssh hook somewhere sshd won't complain about

https://gerrit.wikimedia.org/r/370153

00:21 < mutante> twentyafterfour: try again
00:22 < twentyafterfour> mutante: works!!!