Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Resolved | Krinkle | T134871 Move stream.wikimedia.org (rcstream) behind cache_misc | |||
Resolved | BBlack | T134870 Support websockets in cache_misc | |||
Declined | BBlack | T107749 HTTP/1.1 keepalive for local nginx->varnish conns |
Event Timeline
Change 287956 had a related patch set uploaded (by BBlack):
cache_misc: add stream.wm.o
Change 294307 had a related patch set uploaded (by BBlack):
cache_misc: pass all stream.wm.o
Change 294323 had a related patch set uploaded (by BBlack):
stream: use hash(X-Client-IP) for backend selection
This seems to be working now. It's fully-configured on cache_misc other than switching the DNS resolution for stream.wm.o to cache_misc, and can be tested by hacking local DNS resolution.
I've been testing it with a minimal python client as shown in https://wikitech.wikimedia.org/wiki/RCStream#Python , but with the constructor changed from socketIO_client.SocketIO('stream.wikimedia.org', 80) to socketIO_client.SocketIO('https://stream.wikimedia.org').
If I change the constructor back to using port 80, the client fails with: websocket._exceptions.WebSocketBadStatusException: Handshake status 301, indicating it can't handle the HTTP->HTTPS redirect before upgrading HTTP to websockets.
The current stream.wm.o (which uses LVS to talk directly to rcs100[12]) doesn't redirect on port 80, allowing cleartext rcstream connections or encrypted ones (client's choice, but I guess they have to be explicit in their config).
Is 301->HTTPS not legal for websockets for some reason? Are lots of clients going to break if we do this regardless?
We've got ~13 days until we need to renew the existing SSL cert (or not, if we can switch to cache_misc).
Looking into the Websockets RFC ( https://tools.ietf.org/html/rfc6455 ), it says in section 4.1:
Once the client's opening handshake has been sent, the client MUST wait for a response from the server before sending any further data. The client MUST validate the server's response as follows: 1. If the status code received from the server is not 101, the client handles the response per HTTP [RFC2616] procedures. In particular, the client might perform authentication if it receives a 401 status code; the server might redirect the client using a 3xx status code (but clients are not required to follow them), etc. Otherwise, proceed as follows.
Another datapoint, in nginx logs on rcs1001, most successful operations seem to be non-SSL:
root@rcs1001:/var/log/nginx# grep -v rcstream_status rcstream_access.log|grep -c socket.io 1254 root@rcs1001:/var/log/nginx# grep -v rcstream_status rcstream_ssl_access.log|grep -c socket.io 77
Obviously, if we can't fix the existing non-SSL clients in a timely fashion (or can't assume they can handle redirects), our other option is to punch a temporary hole in cache_misc and make it not force redirects for stream.wm.o (and set a timeline for removing the hole).
Tried the sample JS client code too, from https://wikitech.wikimedia.org/wiki/RCStream#JavaScript . Same basic results. It works fine if I prepend https:// to the connect() string, but otherwise fails (and it fails silently, too, heh. Go nodejs!).
Change 294346 had a related patch set uploaded (by BBlack):
frontend VCL: stream.wm.o TLS exception
With the TLS hole punched above, the clients do work correctly with plain HTTP. This is status-quo, as the current service also allows non-TLS and most clients are using non-TLS today. I think at this point it's worth taking this route - it will get rcstream moved to standard termination, at which point we can:
- remove its public LVS service
- remove the HTTPS listener on rcs100x
- not renew the cert that's expiring in 13 days.
- open a separate ticket about switching off unencrypted HTTP for rcstream (reverting https://gerrit.wikimedia.org/r/#/c/294346/ ) at a later date after announcements and validation, etc.
What's needed now is some confirmation beyond my manual testing with the sample python and js client code from wikitech. Can someone confirm that real clients work (with local DNS hacks to remap stream.wm.o -> misc-web-lb)?
We just tested this via /etc/hosts entry + http://codepen.io/Krinkle/pen/laucI/?editors=0010 and it worked. The implementations that we are familiar with are all based on one of these three clients (node.js, Python, frontend js) so we should be good. The plan of record (per IRC discussion) is for me to announce the change on wikitech-l tomorrow (Thursday, June 16), giving people two days' advance notice, and using the opportunity to remind people to use https. @BBlack will do the actual switch at his convenience.
Change 295385 had a related patch set uploaded (by BBlack):
stream.wm.o: move to cache_misc in DNS
Change 298525 had a related patch set uploaded (by BBlack):
Remove old rcstream public LVS config
Change 298530 had a related patch set uploaded (by BBlack):
Remove stream-lb.eqiad hostname
Change 298564 had a related patch set uploaded (by BBlack):
Remove old rcstream public LVS config in conftool-data
Change 298566 had a related patch set uploaded (by BBlack):
remove rcstream lvs::realserver config
Change 298564 merged by BBlack:
Remove old rcstream public LVS config in conftool-data
Change 304023 had a related patch set uploaded (by BBlack):
rcstream: remove internal TLS listener