Page MenuHomePhabricator

Move stream.wikimedia.org (rcstream) behind cache_misc
Closed, ResolvedPublic

Details

Related Gerrit Patches:
operations/puppet : productionrcstream: remove internal TLS listener
operations/dns : masterRemove stream-lb.eqiad hostname
operations/puppet : productionremove rcstream lvs::realserver config
operations/puppet : productionRemove old rcstream public LVS config in conftool-data
operations/puppet : productionRemove old rcstream public LVS config
operations/dns : masterstream.wm.o: move to cache_misc in DNS
operations/puppet : productionfrontend VCL: stream.wm.o TLS exception
operations/puppet : productionstream: use hash(X-Client-IP) for backend selection
operations/puppet : productioncache_misc: pass all stream.wm.o
operations/puppet : productioncache_misc: add stream.wm.o

Event Timeline

BBlack created this task.May 10 2016, 12:27 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 10 2016, 12:27 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
BBlack added a parent task: Unknown Object (Task).May 10 2016, 1:35 PM

Change 287956 had a related patch set uploaded (by BBlack):
cache_misc: add stream.wm.o

https://gerrit.wikimedia.org/r/287956

chasemp triaged this task as Normal priority.May 11 2016, 7:30 PM

Change 287956 merged by BBlack:
cache_misc: add stream.wm.o

https://gerrit.wikimedia.org/r/287956

Change 294307 had a related patch set uploaded (by BBlack):
cache_misc: pass all stream.wm.o

https://gerrit.wikimedia.org/r/294307

Change 294307 merged by BBlack:
cache_misc: pass all stream.wm.o

https://gerrit.wikimedia.org/r/294307

Change 294323 had a related patch set uploaded (by BBlack):
stream: use hash(X-Client-IP) for backend selection

https://gerrit.wikimedia.org/r/294323

Change 294323 merged by BBlack:
stream: use hash(X-Client-IP) for backend selection

https://gerrit.wikimedia.org/r/294323

This seems to be working now. It's fully-configured on cache_misc other than switching the DNS resolution for stream.wm.o to cache_misc, and can be tested by hacking local DNS resolution.

I've been testing it with a minimal python client as shown in https://wikitech.wikimedia.org/wiki/RCStream#Python , but with the constructor changed from socketIO_client.SocketIO('stream.wikimedia.org', 80) to socketIO_client.SocketIO('https://stream.wikimedia.org').

If I change the constructor back to using port 80, the client fails with: websocket._exceptions.WebSocketBadStatusException: Handshake status 301, indicating it can't handle the HTTP->HTTPS redirect before upgrading HTTP to websockets.

The current stream.wm.o (which uses LVS to talk directly to rcs100[12]) doesn't redirect on port 80, allowing cleartext rcstream connections or encrypted ones (client's choice, but I guess they have to be explicit in their config).

Is 301->HTTPS not legal for websockets for some reason? Are lots of clients going to break if we do this regardless?

We've got ~13 days until we need to renew the existing SSL cert (or not, if we can switch to cache_misc).

Looking into the Websockets RFC ( https://tools.ietf.org/html/rfc6455 ), it says in section 4.1:

Once the client's opening handshake has been sent, the client MUST
   wait for a response from the server before sending any further data.
   The client MUST validate the server's response as follows:

   1.  If the status code received from the server is not 101, the
       client handles the response per HTTP [RFC2616] procedures.  In
       particular, the client might perform authentication if it
       receives a 401 status code; the server might redirect the client
       using a 3xx status code (but clients are not required to follow
       them), etc.  Otherwise, proceed as follows.

Another datapoint, in nginx logs on rcs1001, most successful operations seem to be non-SSL:

root@rcs1001:/var/log/nginx# grep -v rcstream_status rcstream_access.log|grep -c socket.io
1254
root@rcs1001:/var/log/nginx# grep -v rcstream_status rcstream_ssl_access.log|grep -c socket.io
77

Obviously, if we can't fix the existing non-SSL clients in a timely fashion (or can't assume they can handle redirects), our other option is to punch a temporary hole in cache_misc and make it not force redirects for stream.wm.o (and set a timeline for removing the hole).

Tried the sample JS client code too, from https://wikitech.wikimedia.org/wiki/RCStream#JavaScript . Same basic results. It works fine if I prepend https:// to the connect() string, but otherwise fails (and it fails silently, too, heh. Go nodejs!).

Change 294346 had a related patch set uploaded (by BBlack):
frontend VCL: stream.wm.o TLS exception

https://gerrit.wikimedia.org/r/294346

Change 294346 merged by BBlack:
frontend VCL: stream.wm.o TLS exception

https://gerrit.wikimedia.org/r/294346

BBlack added a comment.EditedJun 14 2016, 4:25 PM

With the TLS hole punched above, the clients do work correctly with plain HTTP. This is status-quo, as the current service also allows non-TLS and most clients are using non-TLS today. I think at this point it's worth taking this route - it will get rcstream moved to standard termination, at which point we can:

  1. remove its public LVS service
  2. remove the HTTPS listener on rcs100x
  3. not renew the cert that's expiring in 13 days.
  4. open a separate ticket about switching off unencrypted HTTP for rcstream (reverting https://gerrit.wikimedia.org/r/#/c/294346/ ) at a later date after announcements and validation, etc.

What's needed now is some confirmation beyond my manual testing with the sample python and js client code from wikitech. Can someone confirm that real clients work (with local DNS hacks to remap stream.wm.o -> misc-web-lb)?

BBlack moved this task from Triage to In Progress on the Traffic board.Jun 14 2016, 10:01 PM

@ori @Krinkle - any thoughts or pointers on getting this tested more-broadly and then switching before the cert expiry date?

ori raised the priority of this task from Normal to High.Jun 15 2016, 6:45 PM
ori added a project: Performance-Team.
ori added a comment.Jun 15 2016, 11:25 PM

@ori @Krinkle - any thoughts or pointers on getting this tested more-broadly and then switching before the cert expiry date?

We just tested this via /etc/hosts entry + http://codepen.io/Krinkle/pen/laucI/?editors=0010 and it worked. The implementations that we are familiar with are all based on one of these three clients (node.js, Python, frontend js) so we should be good. The plan of record (per IRC discussion) is for me to announce the change on wikitech-l tomorrow (Thursday, June 16), giving people two days' advance notice, and using the opportunity to remind people to use https. @BBlack will do the actual switch at his convenience.

@ori did notification go out? We're now 7 days from cert expiry.

ori moved this task from Inbox to Doing on the Performance-Team board.Jun 20 2016, 6:05 PM
ori added a comment.Jun 20 2016, 7:14 PM

@ori did notification go out? We're now 7 days from cert expiry.

It did now :) https://lists.wikimedia.org/pipermail/wikitech-l/2016-June/085928.html

Change 295385 had a related patch set uploaded (by BBlack):
stream.wm.o: move to cache_misc in DNS

https://gerrit.wikimedia.org/r/295385

Change 295385 merged by BBlack:
stream.wm.o: move to cache_misc in DNS

https://gerrit.wikimedia.org/r/295385

Krinkle closed this task as Resolved.Jul 7 2016, 10:31 PM
Krinkle claimed this task.

Change 298525 had a related patch set uploaded (by BBlack):
Remove old rcstream public LVS config

https://gerrit.wikimedia.org/r/298525

Change 298530 had a related patch set uploaded (by BBlack):
Remove stream-lb.eqiad hostname

https://gerrit.wikimedia.org/r/298530

Change 298564 had a related patch set uploaded (by BBlack):
Remove old rcstream public LVS config in conftool-data

https://gerrit.wikimedia.org/r/298564

Change 298566 had a related patch set uploaded (by BBlack):
remove rcstream lvs::realserver config

https://gerrit.wikimedia.org/r/298566

Change 298525 merged by BBlack:
Remove old rcstream public LVS config

https://gerrit.wikimedia.org/r/298525

Change 298564 merged by BBlack:
Remove old rcstream public LVS config in conftool-data

https://gerrit.wikimedia.org/r/298564

Change 298566 merged by BBlack:
remove rcstream lvs::realserver config

https://gerrit.wikimedia.org/r/298566

Change 298530 merged by BBlack:
Remove stream-lb.eqiad hostname

https://gerrit.wikimedia.org/r/298530

Change 304023 had a related patch set uploaded (by BBlack):
rcstream: remove internal TLS listener

https://gerrit.wikimedia.org/r/304023

Change 304023 merged by BBlack:
rcstream: remove internal TLS listener

https://gerrit.wikimedia.org/r/304023