This includes puppetizing the svc, https, routing and lvs in prod.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | Xqt | T125197 Give hint to the current socketIO_client in ImportError of rcstream.py | |||
Declined | None | T91393 RCStream is not accessible from python client due to using socket-io 1.0 while only socket-io 0.9 is offered | |||
Declined | None | T68232 Upgrade RCStream backend to use socket.io 1.0 protocol | |||
Resolved | Ottomata | T130651 EventStreams | |||
Resolved | Ottomata | T148470 Kafka SSE Prototype | |||
Resolved | Ottomata | T148779 Prepare eventstreams (with KafkaSSE) for deployment | |||
Resolved | Ottomata | T143925 Productionize and deploy Public EventStreams | |||
Declined | Ottomata | T145805 Support node cluster sticky-session in service-runner | |||
Declined | Ottomata | T148043 Fix consumer.disconnect() node-rdkafka bug |
Event Timeline
How to get this into REST API, and at what path?
/api/rest_v1/stream/* -> eventstreams.svc.$site.wmnet/v1/stream/*
This needs to be done in some nginx or varnish VCL config in puppet.
Hm, possibly the domain / path at which this will be hosted is controversial. To keep things centralized, let's discuss on the parent ticket.
Change 320690 had a related patch set uploaded (by Ottomata):
Deploy EventStreams on scb and configure LVS service in eqiad
Change 320781 had a related patch set uploaded (by Ottomata):
Remove codfw production targets, use scb1001 as canary
(Changed my mind again, let's discuss the domain vs path vs nginx vs varnish routing stuff here.)
Over in https://gerrit.wikimedia.org/r/#/c/320690/, I'm preparing LVS and service deployment for EventStreams. We need to decide how a request will actually make it to this service. We can't host this in varnish cache_text, since maintaining an open pipe there will hold up a precious varnish thread on a high volume cache cluster. Other options:
- custom nginx path: /api/rest_v1/stream/* with special routing rule in nginx, bypassing varnish altogether
- varnish misc domain: (event)?streams.wikimedia.org/v1/stream/* routed by cache_misc varnish, as socket.io RCStream is done now.
@BBlack prefers varnish misc domain, @GWicke and @mobrovac prefer custom nginx path (or some variant).
Discuss!
We're definitely not doing the custom nginx path-routing thing. It's just too much of an edge-case, and I don't want to have to support that down the road. We also have no existing way to secure that traffic from a remote DC.
Nginx only exists in our stack because varnish lacks TLS support and nginx is one of the only performant options that meets all of our TLS needs. It could (will?) be replaced by one of several different alternative solutions that are being planned down the road (e.g. future TLS support in Varnish itself, ATS, apache->varnish, haproxy->varnish, etc). The current nginx config knows nothing about routing; it only reverse-proxies into the local varnish instance to paper over its lack of a TLS stack, and it really needs to stay that simple for now. Even if we hacked it in, we wouldn't have IPSec covering that traffic from cache DCs (and we can't IPSec into LVS, either).
Another way to think of that is that we only offer a certain menu of standard ways to route traffic from the outside world into the applayer, and bypassing varnish completely isn't currently on that menu.
Going though varnish in pipe-mode is an option we've deployed on cache_misc for the existing stream service because it needs websocket support, and we'll likely do the same (sometime soon?) for phabricator's notification service. We can do that for the new eventstream service as well, and it's what I was expecting.
It's not the most efficient thing to do (piping through varnish "pointlessly" from a technical perspective), but it keeps things standardized and simple, which gives us the flexibility to keep improving architecture down the road. Future efficiency improvements in our traffic routing, to not send requests through pointless software layers or network hops in all sorts of cases besides this one, are planned for the Future, but that's implementation details and that work still has unmet dependencies.
The only real argument here is about whether we deploy piping on cache_text for this stream service (for a specific path within RestBase's /api/rest_v1/, using RB as yet another layer of proxy?), vs using the existing piping we have on cache_misc to eventstream.svc directly.
I don't feel comfortable putting support for such pipes in the high-traffic clusters (cache_text) at this time, though. It adds a lot of risk there. The rationale for wanting to put it in cache_text AFAIK is to proxy eventstream through RestBase instead of using it directly, and using the normal /api/rest_v1/ path namespace and documentation integration and so-on. There's also a middle-ground where cache_misc could route eventstream traffic to restbase.svc (rather than eventstream.svc directly), but the public URLs would still be on the separate evenstream.wikimedia.org hostname. I don't think there's much to gain in going down that path, though, vs going direct from cache_misc to eventstream.svc?
From the API product perspective, it would be preferable to integrate event streams into the uniform REST API. The main benefit is in a uniform API layout and documentation, both of which make this API easier to discover and use. Separate service domains are always awkward to document in API portals, and clutter the overall API documentation at the cost of other APIs. This does not mean that the traffic would pass through RESTBase or Varnish in a production setup, but it could as a fall-back for third party users or development. From a user perspective, all that matters is that the documentation and URL schema is uniform.
From the cache infrastructure perspective, I understand and sympathize with the desire to maintain the option of moving away from Nginx, considering that the concrete request is about a currently relatively low volume service. We can make compromises on the product side to accommodate this.
I do wonder though if not having any kind of streaming or non-Varnish routing support will be sustainable longer term. For example, we are working on streaming content composition in ServiceWorkers on behalf of authenticated clients without SW support. By returning the first chunk of (static) content early, time to first byte is significantly reduced. These responses are not cacheable, and perform best when we can stream directly to the client. This service is targeted at /wiki/{Title} on mobile domains. Granted, this is in an early stage right now, but it might also not be the only thing that will require or strongly benefit from streaming & low cache layer overheads.
Change 320781 merged by Ottomata:
Remove codfw production targets, use scb1001 as canary
Change 321940 had a related patch set uploaded (by Ottomata):
Add eventstreams.svc.eqiad.wmnet
Change 320690 merged by Ottomata:
Deploy EventStreams on scb and configure LVS service in eqiad
Change 322721 had a related patch set uploaded (by Ottomata):
Add eventstreams to scb node conftool configuration
Change 322721 merged by Ottomata:
Add eventstreams to scb node conftool configuration
Change 322726 had a related patch set uploaded (by Ottomata):
Add eventstreams to list of lvs realserver ips for scb
Change 322726 merged by Ottomata:
Add eventstreams to list of lvs realserver ips for scb
Change 322732 had a related patch set uploaded (by Ottomata):
Allow lvs service monitoring to specify critical parameter for monitoring::service
Change 322732 merged by Ottomata:
Allow lvs service monitoring to specify critical parameter for monitoring::service
Change 322924 had a related patch set uploaded (by Ottomata):
Fix for evenstreams icinga http lvs alert
Change 322931 had a related patch set uploaded (by Ottomata):
Add eventstreams.svc.codfw.wmnet
Change 322935 had a related patch set uploaded (by Ottomata):
Configure eventstreams in codfw backed by analytics-eqiad Kafka
Change 322935 merged by Ottomata:
Configure eventstreams in codfw backed by analytics-eqiad Kafka
Change 322954 had a related patch set uploaded (by Ottomata):
Add eventstreams.wikimedia.org to cache misc
Change 327046 had a related patch set uploaded (by Ottomata):
Update README with /v2 docs
Change 327113 had a related patch set uploaded (by Ottomata):
Add rdkafka_config deployment var to eventstreams service module and role
Change 327114 had a related patch set uploaded (by Ottomata):
Add rdkafka_config deployment variable to config.yaml.j2 template
Change 327550 had a related patch set uploaded (by BBlack):
cache_misc: stream.wm.o subpathing for eventstreams
Change 328193 had a related patch set uploaded (by BBlack):
TLS: reduce scope of stream.wm.o redirect exception
Change 327114 merged by Ottomata:
Add rdkafka_config deployment variable to config.yaml.j2 template
Change 327113 merged by Ottomata:
Add rdkafka_config deployment var to eventstreams service module and role
Change 329239 had a related patch set uploaded (by Ottomata):
Increment request metrics for particular streams
cache_misc for this are all implemented and live now. The config declaration is now:
'stream.wikimedia.org' => { 'director' => 'eventstreams', 'caching' => 'pipe', 'subpaths' => { '^/(socket\.io|rc(stream_status)?)(/|$)' => { 'director' => 'rcstream', 'caching' => 'websockets', }, }, },
The legacy HTTPS-enforcement exception also now only applies to the rcstream path regexes; HTTPS should be enforced for other subpaths (the new eventstream stuff).
Change 334393 had a related patch set uploaded (by Ottomata):
Configure recentchange stream endpoint in EventStreams
Change 334393 merged by Ottomata:
Configure recentchange stream endpoint in EventStreams
Change 322954 abandoned by Krinkle:
Add eventstreams.wikimedia.org to cache misc
Reason:
Obsolete by I45c960aa609f7. Thanks Brandon!