Page MenuHomePhabricator

Productionize and deploy Public EventStreams
Closed, ResolvedPublic8 Estimated Story Points

Description

This includes puppetizing the svc, https, routing and lvs in prod.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+14 -0
operations/puppetproduction+17 -4
operations/puppetproduction+12 -2
operations/puppetproduction+4 -2
mediawiki/services/eventstreamsmaster+4 -0
operations/puppetproduction+22 -10
mediawiki/services/eventstreams/deploymaster+3 -0
mediawiki/services/eventstreamsmaster+4 -4
mediawiki/services/eventstreamsmaster+2 -2
operations/puppetproduction+29 -5
operations/dnsmaster+2 -0
operations/puppetproduction+0 -1
operations/puppetproduction+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+4 -4
operations/puppetproduction+54 -8
operations/dnsmaster+3 -0
mediawiki/services/eventstreams/deploymaster+1 -5
Show related patches Customize query in gerrit

Event Timeline

Ottomata renamed this task from Productionize Public Event Stream Prototype to Productionize Public EventStreams.Nov 9 2016, 8:55 PM
Ottomata updated the task description. (Show Details)
Ottomata changed the point value for this task from 0 to 8.
Ottomata edited projects, added Analytics-Kanban; removed Analytics.

How to get this into REST API, and at what path?

/api/rest_v1/stream/* -> eventstreams.svc.$site.wmnet/v1/stream/*
This needs to be done in some nginx or varnish VCL config in puppet.

Ottomata renamed this task from Productionize Public EventStreams to Productionize and deploy Public EventStreams.Nov 9 2016, 9:26 PM
Ottomata updated the task description. (Show Details)

Hm, possibly the domain / path at which this will be hosted is controversial. To keep things centralized, let's discuss on the parent ticket.

Change 320690 had a related patch set uploaded (by Ottomata):
Deploy EventStreams on scb and configure LVS service in eqiad

https://gerrit.wikimedia.org/r/320690

Change 320781 had a related patch set uploaded (by Ottomata):
Remove codfw production targets, use scb1001 as canary

https://gerrit.wikimedia.org/r/320781

(Changed my mind again, let's discuss the domain vs path vs nginx vs varnish routing stuff here.)

Over in https://gerrit.wikimedia.org/r/#/c/320690/, I'm preparing LVS and service deployment for EventStreams. We need to decide how a request will actually make it to this service. We can't host this in varnish cache_text, since maintaining an open pipe there will hold up a precious varnish thread on a high volume cache cluster. Other options:

  • custom nginx path: /api/rest_v1/stream/* with special routing rule in nginx, bypassing varnish altogether
  • varnish misc domain: (event)?streams.wikimedia.org/v1/stream/* routed by cache_misc varnish, as socket.io RCStream is done now.

@BBlack prefers varnish misc domain, @GWicke and @mobrovac prefer custom nginx path (or some variant).

Discuss!

We're definitely not doing the custom nginx path-routing thing. It's just too much of an edge-case, and I don't want to have to support that down the road. We also have no existing way to secure that traffic from a remote DC.

Nginx only exists in our stack because varnish lacks TLS support and nginx is one of the only performant options that meets all of our TLS needs. It could (will?) be replaced by one of several different alternative solutions that are being planned down the road (e.g. future TLS support in Varnish itself, ATS, apache->varnish, haproxy->varnish, etc). The current nginx config knows nothing about routing; it only reverse-proxies into the local varnish instance to paper over its lack of a TLS stack, and it really needs to stay that simple for now. Even if we hacked it in, we wouldn't have IPSec covering that traffic from cache DCs (and we can't IPSec into LVS, either).

Another way to think of that is that we only offer a certain menu of standard ways to route traffic from the outside world into the applayer, and bypassing varnish completely isn't currently on that menu.

Going though varnish in pipe-mode is an option we've deployed on cache_misc for the existing stream service because it needs websocket support, and we'll likely do the same (sometime soon?) for phabricator's notification service. We can do that for the new eventstream service as well, and it's what I was expecting.

It's not the most efficient thing to do (piping through varnish "pointlessly" from a technical perspective), but it keeps things standardized and simple, which gives us the flexibility to keep improving architecture down the road. Future efficiency improvements in our traffic routing, to not send requests through pointless software layers or network hops in all sorts of cases besides this one, are planned for the Future, but that's implementation details and that work still has unmet dependencies.

The only real argument here is about whether we deploy piping on cache_text for this stream service (for a specific path within RestBase's /api/rest_v1/, using RB as yet another layer of proxy?), vs using the existing piping we have on cache_misc to eventstream.svc directly.

I don't feel comfortable putting support for such pipes in the high-traffic clusters (cache_text) at this time, though. It adds a lot of risk there. The rationale for wanting to put it in cache_text AFAIK is to proxy eventstream through RestBase instead of using it directly, and using the normal /api/rest_v1/ path namespace and documentation integration and so-on. There's also a middle-ground where cache_misc could route eventstream traffic to restbase.svc (rather than eventstream.svc directly), but the public URLs would still be on the separate evenstream.wikimedia.org hostname. I don't think there's much to gain in going down that path, though, vs going direct from cache_misc to eventstream.svc?

From the API product perspective, it would be preferable to integrate event streams into the uniform REST API. The main benefit is in a uniform API layout and documentation, both of which make this API easier to discover and use. Separate service domains are always awkward to document in API portals, and clutter the overall API documentation at the cost of other APIs. This does not mean that the traffic would pass through RESTBase or Varnish in a production setup, but it could as a fall-back for third party users or development. From a user perspective, all that matters is that the documentation and URL schema is uniform.

From the cache infrastructure perspective, I understand and sympathize with the desire to maintain the option of moving away from Nginx, considering that the concrete request is about a currently relatively low volume service. We can make compromises on the product side to accommodate this.

I do wonder though if not having any kind of streaming or non-Varnish routing support will be sustainable longer term. For example, we are working on streaming content composition in ServiceWorkers on behalf of authenticated clients without SW support. By returning the first chunk of (static) content early, time to first byte is significantly reduced. These responses are not cacheable, and perform best when we can stream directly to the client. This service is targeted at /wiki/{Title} on mobile domains. Granted, this is in an early stage right now, but it might also not be the only thing that will require or strongly benefit from streaming & low cache layer overheads.

Change 320781 merged by Ottomata:
Remove codfw production targets, use scb1001 as canary

https://gerrit.wikimedia.org/r/320781

Change 321940 had a related patch set uploaded (by Ottomata):
Add eventstreams.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/321940

Change 321940 merged by Alexandros Kosiaris:
Add eventstreams.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/321940

Change 320690 merged by Ottomata:
Deploy EventStreams on scb and configure LVS service in eqiad

https://gerrit.wikimedia.org/r/320690

Change 322721 had a related patch set uploaded (by Ottomata):
Add eventstreams to scb node conftool configuration

https://gerrit.wikimedia.org/r/322721

Change 322721 merged by Ottomata:
Add eventstreams to scb node conftool configuration

https://gerrit.wikimedia.org/r/322721

Change 322726 had a related patch set uploaded (by Ottomata):
Add eventstreams to list of lvs realserver ips for scb

https://gerrit.wikimedia.org/r/322726

Change 322726 merged by Ottomata:
Add eventstreams to list of lvs realserver ips for scb

https://gerrit.wikimedia.org/r/322726

Change 322732 had a related patch set uploaded (by Ottomata):
Allow lvs service monitoring to specify critical parameter for monitoring::service

https://gerrit.wikimedia.org/r/322732

Change 322732 merged by Ottomata:
Allow lvs service monitoring to specify critical parameter for monitoring::service

https://gerrit.wikimedia.org/r/322732

Change 322924 had a related patch set uploaded (by Ottomata):
Fix for evenstreams icinga http lvs alert

https://gerrit.wikimedia.org/r/322924

Change 322924 merged by Ottomata:
Fix for evenstreams icinga http lvs alert

https://gerrit.wikimedia.org/r/322924

Change 322931 had a related patch set uploaded (by Ottomata):
Add eventstreams.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/322931

Change 322931 merged by Ottomata:
Add eventstreams.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/322931

Change 322935 had a related patch set uploaded (by Ottomata):
Configure eventstreams in codfw backed by analytics-eqiad Kafka

https://gerrit.wikimedia.org/r/322935

Change 322935 merged by Ottomata:
Configure eventstreams in codfw backed by analytics-eqiad Kafka

https://gerrit.wikimedia.org/r/322935

Change 322954 had a related patch set uploaded (by Ottomata):
Add eventstreams.wikimedia.org to cache misc

https://gerrit.wikimedia.org/r/322954

Change 327044 had a related patch set uploaded (by Ottomata):
Move /v1 to /v2

https://gerrit.wikimedia.org/r/327044

Change 327046 had a related patch set uploaded (by Ottomata):
Update README with /v2 docs

https://gerrit.wikimedia.org/r/327046

Change 327046 merged by Ottomata:
Update README with /v2 docs

https://gerrit.wikimedia.org/r/327046

Change 327113 had a related patch set uploaded (by Ottomata):
Add rdkafka_config deployment var to eventstreams service module and role

https://gerrit.wikimedia.org/r/327113

Change 327114 had a related patch set uploaded (by Ottomata):
Add rdkafka_config deployment variable to config.yaml.j2 template

https://gerrit.wikimedia.org/r/327114

Change 327550 had a related patch set uploaded (by BBlack):
cache_misc: stream.wm.o subpathing for eventstreams

https://gerrit.wikimedia.org/r/327550

Change 328193 had a related patch set uploaded (by BBlack):
TLS: reduce scope of stream.wm.o redirect exception

https://gerrit.wikimedia.org/r/328193

Change 327114 merged by Ottomata:
Add rdkafka_config deployment variable to config.yaml.j2 template

https://gerrit.wikimedia.org/r/327114

Change 327113 merged by Ottomata:
Add rdkafka_config deployment var to eventstreams service module and role

https://gerrit.wikimedia.org/r/327113

Change 329239 had a related patch set uploaded (by Ottomata):
Increment request metrics for particular streams

https://gerrit.wikimedia.org/r/329239

Change 329239 merged by Ottomata:
Increment request metrics streams

https://gerrit.wikimedia.org/r/329239

Change 328193 merged by BBlack:
TLS: reduce scope of stream.wm.o redirect exception

https://gerrit.wikimedia.org/r/328193

Change 327550 merged by BBlack:
cache_misc: stream.wm.o subpathing for eventstreams

https://gerrit.wikimedia.org/r/327550

cache_misc for this are all implemented and live now. The config declaration is now:

'stream.wikimedia.org'               => {
    'director' => 'eventstreams',
    'caching'  => 'pipe',
    'subpaths' => {
        '^/(socket\.io|rc(stream_status)?)(/|$)' => {
            'director' => 'rcstream',
            'caching'  => 'websockets',
        },
    },
},

The legacy HTTPS-enforcement exception also now only applies to the rcstream path regexes; HTTPS should be enforced for other subpaths (the new eventstream stuff).

YESSSSSSSSSSSSSSSSS awesome! Thank you!

Change 334393 had a related patch set uploaded (by Ottomata):
Configure recentchange stream endpoint in EventStreams

https://gerrit.wikimedia.org/r/334393

Change 334393 merged by Ottomata:
Configure recentchange stream endpoint in EventStreams

https://gerrit.wikimedia.org/r/334393

Change 322954 abandoned by Krinkle:
Add eventstreams.wikimedia.org to cache misc

Reason:
Obsolete by I45c960aa609f7. Thanks Brandon!

https://gerrit.wikimedia.org/r/322954