Page MenuHomePhabricator

move tools proxy nodes to eqiad1
Closed, ResolvedPublic

Description

The proxy nodes use redis to sync state, so we should be able to do this gracefully. There will be a lot of issues with firewalls and flannel connectivity though.

Event Timeline

Andrew triaged this task as Medium priority.EditedJan 14 2019, 3:46 PM
Andrew created this task.

I have two new nodes, tools-proxy-03 and tools-proxy-04 that are syncing with redis and /should/ be able to serve up webservices.

The 'active' node in eqiad1 is tools-proxy-03; if I substitute its public IP into my /etc/hosts then I can contact SGE-based webservices. K8s services don't work though.

To test: alias tools.wmflabs.org to 185.15.56.5 (aka tools-proxy-03). Then hit a tool with your browser, and check out /var/log/nginx on tools-proxy-03.tools.eqiad.wmflabs

Flannel interface isn't showing up right now:
on tools-proxy-01:

4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether e2:c6:20:3b:7e:43 brd ff:ff:ff:ff:ff:ff
    inet 192.168.175.0/17 scope global flannel.1
       valid_lft forever preferred_lft forever
    inet6 fe80::e0c6:20ff:fe3b:7e43/64 scope link
       valid_lft forever preferred_lft forever

That's not there on proxy-03

Flannel interface is now there. ferm needed a restart on the flannel etcd servers (and I added a sec group to allow the port across regions)

Flannel and kube-proxy both work now, but the connection is still not possible. Checking kube workers.

Ok, tools-proxy-03 can now reach flannel network nodes in eqiad!
I had to add UDP port 8472 to both ends of the setup.

Mentioned in SAL (#wikimedia-cloud) [2019-01-14T22:03:04Z] <bstorm_> T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad

Mentioned in SAL (#wikimedia-cloud) [2019-01-14T22:03:53Z] <bstorm_> T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r

Set my hosts file to point at 03, and it works like a charm. This is ready to go.

Restarted flannel on proxy-04, and it got it's flanneld interface up.

Change 484335 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: change the default proxy options to real proxy servers

https://gerrit.wikimedia.org/r/484335

Mentioned in SAL (#wikimedia-cloud) [2019-01-15T18:29:50Z] <bstorm_> T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding

Change 484550 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: pin python3 requests to backports for jessie on proxies

https://gerrit.wikimedia.org/r/484550

Change 484550 merged by Bstorm:
[operations/puppet@production] toolforge: pin python3 requests to backports for jessie on proxies

https://gerrit.wikimedia.org/r/484550

Brief story of https://gerrit.wikimedia.org/r/484550 and the bug it fixes:

kube2proxy runs as a service on the active urlproxy instance for Toolforge. This python script connects to the Kubernetes API and watches for pods with the tools.wmflabs.org/webservice=true label starting and stopping. The watching is done via the /api/v1/watch/services endpoint of the API service using the python3 requests library. The code here functionally looks like this:

import requests
import json

session = requests.Session()
session.auth = KubeAuth(bearer_token)  # adds the 'Authorization:' header needed to see into all namespaces
resp = session.get(
    'https://our-k8s-master:6443/api/v1/watch/services',
    params={
        'labelSelector': 'tools.wmflabs.org/webservice=true',
        'resourceVersion': 0,
        'watch': True
    },
    stream=True
)
for line in resp.iter_lines():
    yield json.loads(line.decode('utf-8'))

With requests v2.4.3, the iter_lines() polling does not pass the latest response line out to the calling code until a new line comes into the backing buffer. With requests v2.7.0 and later, this buffer draining problem seems to be fixed.

It appears that this was known on 2015-09-30 when requests v2.7.0 was installed on both tools-proxy-01 and tools-proxy-02 using pip. This was actually done via Puppet per T111916: Add support to dynamicproxy for kubernetes based web services and rOPUPb4b3223e49d9: dynamicproxy: add support for kubernetes.

Yes, it looks like it's being read from /usr/local (the requests library) so it was definitely imported from pip3...

Let's put in a fixed package in our repos instead....

There is no trace on that ticket of that being done, but in rOPUP242439e352a2: toollabs: Cleanup kube2proxy the pip installer was removed from the Puppet manifest.

When @Andrew built the new tools-proxy-03 and tools-proxy-04 instances, the Puppet manifest installed python3-requests v2.4.3 from the Debian Jessie apt repo which has the bug.

Found a DNS A record in Designate for www.tools.wmflabs.org. Changed to be a CNAME of tools.wmflabs.org to reduce the things that need to be fiddled with when failing the proxy over to a new host. https://tools.wmflabs.org/sal/log/AWhS1kv2zCcrHSwqte3k

webservicemonitor needed restarting on tools-services-02 to get it to re-read /etc/active-proxy and start talking to the new active proxy https://tools.wmflabs.org/sal/log/AWhTUtUZzCcrHSwqtjgq

@Bstorm found several instances in both the kubernetes and grid engine worker pools that did not have the expected service groups applied via Horizon. This kept jobs on those nodes from being reachable from the new proxy even after we had found that to be a problem and opened more ports in the service group.

Logging the hiera setting change done at 2019-01-15T14:28:23 here to help with recreating the timeline of events: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Tools&diff=next&oldid=1813249

Change 484609 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: kube2proxy: validate requests library version

https://gerrit.wikimedia.org/r/484609

Change 484609 merged by Bstorm:
[operations/puppet@production] toolforge: kube2proxy: validate requests library version

https://gerrit.wikimedia.org/r/484609

The old proxies tools-proxy-01 and tools-proxy-02 are now shutdown. Let's give this a few days and then delete them if there's no incident.

Change 484335 merged by Bstorm:
[operations/puppet@production] toolforge: change the default proxy options to real proxy servers

https://gerrit.wikimedia.org/r/484335

Another entanglement we missed:

[20:46]  <chicocvenancio>	!log paws moving paws_public  proxy_pass to https://172.16.6.39 in paws-proxy-01

@Fuzheado pointed Paws-public was offline today on IRC, I changed the paws-public host on paws-proxy-01 to point to one of the new proxies.

Until T195217#4230520 is done paws-proxy-01 handles PAWS ingress after receiving requests from the tools proxies. It sends requests to PAWS to the PAWS k8s cluster nodeports and to PAWS-public it sends it back to a hardcoded ip of one of the proxies to allow for rendering of the files (and some lua vodoo to translate globalids into usernames).

Things have been doing fine without the old proxies.

Change 499803 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] jessie-backports: create new component for kube2proxy

https://gerrit.wikimedia.org/r/499803

Change 499803 merged by Jbond:
[operations/puppet@production] jessie-backports: create new component for kube2proxy

https://gerrit.wikimedia.org/r/499803