The proxy nodes use redis to sync state, so we should be able to do this gracefully. There will be a lot of issues with firewalls and flannel connectivity though.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T204530 cloudvps: tools and toolsbeta trusty deprecation | |||
Resolved | • Bstorm | T213711 move tools proxy nodes to eqiad1 | |||
Resolved | Chicocvenancio | T214613 Move paws-proxy-01 to eqiad1-r |
Event Timeline
I have two new nodes, tools-proxy-03 and tools-proxy-04 that are syncing with redis and /should/ be able to serve up webservices.
The 'active' node in eqiad1 is tools-proxy-03; if I substitute its public IP into my /etc/hosts then I can contact SGE-based webservices. K8s services don't work though.
To test: alias tools.wmflabs.org to 185.15.56.5 (aka tools-proxy-03). Then hit a tool with your browser, and check out /var/log/nginx on tools-proxy-03.tools.eqiad.wmflabs
Flannel interface isn't showing up right now:
on tools-proxy-01:
4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default link/ether e2:c6:20:3b:7e:43 brd ff:ff:ff:ff:ff:ff inet 192.168.175.0/17 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::e0c6:20ff:fe3b:7e43/64 scope link valid_lft forever preferred_lft forever
That's not there on proxy-03
Flannel interface is now there. ferm needed a restart on the flannel etcd servers (and I added a sec group to allow the port across regions)
Flannel and kube-proxy both work now, but the connection is still not possible. Checking kube workers.
Ok, tools-proxy-03 can now reach flannel network nodes in eqiad!
I had to add UDP port 8472 to both ends of the setup.
Mentioned in SAL (#wikimedia-cloud) [2019-01-14T22:03:04Z] <bstorm_> T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
Mentioned in SAL (#wikimedia-cloud) [2019-01-14T22:03:53Z] <bstorm_> T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
Change 484335 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: change the default proxy options to real proxy servers
Mentioned in SAL (#wikimedia-cloud) [2019-01-15T18:29:50Z] <bstorm_> T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
Change 484550 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolforge: pin python3 requests to backports for jessie on proxies
Change 484550 merged by Bstorm:
[operations/puppet@production] toolforge: pin python3 requests to backports for jessie on proxies
Brief story of https://gerrit.wikimedia.org/r/484550 and the bug it fixes:
kube2proxy runs as a service on the active urlproxy instance for Toolforge. This python script connects to the Kubernetes API and watches for pods with the tools.wmflabs.org/webservice=true label starting and stopping. The watching is done via the /api/v1/watch/services endpoint of the API service using the python3 requests library. The code here functionally looks like this:
import requests import json session = requests.Session() session.auth = KubeAuth(bearer_token) # adds the 'Authorization:' header needed to see into all namespaces resp = session.get( 'https://our-k8s-master:6443/api/v1/watch/services', params={ 'labelSelector': 'tools.wmflabs.org/webservice=true', 'resourceVersion': 0, 'watch': True }, stream=True ) for line in resp.iter_lines(): yield json.loads(line.decode('utf-8'))
With requests v2.4.3, the iter_lines() polling does not pass the latest response line out to the calling code until a new line comes into the backing buffer. With requests v2.7.0 and later, this buffer draining problem seems to be fixed.
It appears that this was known on 2015-09-30 when requests v2.7.0 was installed on both tools-proxy-01 and tools-proxy-02 using pip. This was actually done via Puppet per T111916: Add support to dynamicproxy for kubernetes based web services and rOPUPb4b3223e49d9: dynamicproxy: add support for kubernetes.
There is no trace on that ticket of that being done, but in rOPUP242439e352a2: toollabs: Cleanup kube2proxy the pip installer was removed from the Puppet manifest.
When @Andrew built the new tools-proxy-03 and tools-proxy-04 instances, the Puppet manifest installed python3-requests v2.4.3 from the Debian Jessie apt repo which has the bug.
Found a DNS A record in Designate for www.tools.wmflabs.org. Changed to be a CNAME of tools.wmflabs.org to reduce the things that need to be fiddled with when failing the proxy over to a new host. https://tools.wmflabs.org/sal/log/AWhS1kv2zCcrHSwqte3k
webservicemonitor needed restarting on tools-services-02 to get it to re-read /etc/active-proxy and start talking to the new active proxy https://tools.wmflabs.org/sal/log/AWhTUtUZzCcrHSwqtjgq
@Bstorm found several instances in both the kubernetes and grid engine worker pools that did not have the expected service groups applied via Horizon. This kept jobs on those nodes from being reachable from the new proxy even after we had found that to be a problem and opened more ports in the service group.
Logging the hiera setting change done at 2019-01-15T14:28:23 here to help with recreating the timeline of events: https://wikitech.wikimedia.org/w/index.php?title=Hiera:Tools&diff=next&oldid=1813249
Change 484609 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: kube2proxy: validate requests library version
Change 484609 merged by Bstorm:
[operations/puppet@production] toolforge: kube2proxy: validate requests library version
The old proxies tools-proxy-01 and tools-proxy-02 are now shutdown. Let's give this a few days and then delete them if there's no incident.
Change 484335 merged by Bstorm:
[operations/puppet@production] toolforge: change the default proxy options to real proxy servers
Another entanglement we missed:
[20:46] <chicocvenancio> !log paws moving paws_public proxy_pass to https://172.16.6.39 in paws-proxy-01
@Fuzheado pointed Paws-public was offline today on IRC, I changed the paws-public host on paws-proxy-01 to point to one of the new proxies.
Until T195217#4230520 is done paws-proxy-01 handles PAWS ingress after receiving requests from the tools proxies. It sends requests to PAWS to the PAWS k8s cluster nodeports and to PAWS-public it sends it back to a hardcoded ip of one of the proxies to allow for rendering of the files (and some lua vodoo to translate globalids into usernames).
Change 499803 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] jessie-backports: create new component for kube2proxy
Change 499803 merged by Jbond:
[operations/puppet@production] jessie-backports: create new component for kube2proxy