Page MenuHomePhabricator

Set up eventgate-logging-external in production
Closed, ResolvedPublic21 Estimated Story Points

Description

Helm Chart:

Helm chart TLS support:

LVS for eventgate-logging-external.svc

Discovery for eventgate-logging-external.discovery.wmnet

Public URL for stream.wikimedia.org/producer/logging:

Details

ProjectBranchLines +/-Subject
operations/deployment-chartsmaster+3 -3
operations/puppetproduction+6 -6
operations/dnsmaster+3 -0
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+74 -9
operations/puppetproduction+3 -3
operations/puppetproduction+13 -0
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+108 -0
operations/deployment-chartsmaster+266 -109
operations/dnsmaster+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+7 -6
operations/puppetproduction+51 -14
operations/deployment-chartsmaster+115 -106
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+110 -101
operations/deployment-chartsmaster+92 -3
operations/dnsmaster+4 -0
operations/puppetproduction+14 -0
operations/deployment-chartsmaster+184 -0
operations/deployment-chartsmaster+147 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Heard some preferences from SREs to use a separate domain. I'd like to use the same domain for all the external eventgate instances then. Let's bike shed a name.

We could re-use stream.wikimedia.org. Currently, this EventStreams at stream.wikimedia.org/v2/stream. stream.wikimedia.org/v1 is deprecated, but was originally RCStream. I can't quite think of a good place to put eventgate endpoints in stream.wikimedia.org. The EventGate API uses /v1/events. We could do something like /v2/logging/events and /v2/analytics/events, but would using in public URI /v2 to route to an internal /v1/events URI be confusing? We could use /v1 publicly, but then it sounds like that is deprecated since we have /v2/stream.

So maybe not stream.wikimedia.org. What else? I heard from @Joe he didn't like 'beacon'. Can't say I do either.

Ideas:

  • event(s).wikimedia.org (seems consfusable with a real life 'Event' e.g. Hackathon)
  • eventgate.wikimedia.org (don't really like putting the software name here)
  • intake.wikimedia.org
  • inlet.wikimedia.org
  • logging.wikimedia.org (too generic?)

I don't really love any of these. Anybody got any better ideas?

  • stream-intake.wikimedia.org

?

More brain bouncing with Jason. I think we like:

  • stream.wikimedia.org/produce/$instance/* -> eventgate-$instance-external/*

Then all API paths are just forwarded to the instance. Logging client would POST to https://stream.wikimedia.org/produce/logging/v1/events.

It might also be nice to one day use this convention for EventStreams too, e.g. https://stream.wikimedia.org/consume/v2/stream/revision-create (Maybe we'd need an EventStreams $instance in there too, TBD). Let's defer that idea for later though.

Moving forward with this unless there are objections.

I think this is great because it makes the API actually readable in a meaningful way that maps to our documentation and the concepts/abstractions used by MEP and Kafka.
As long as we continue to name our instances properly, we'll have really nice idiomatic endpoint URLs like

stream.wikimedia.org/produce/logging/v1/events
stream.wikimedia.org/produce/analytics/v1/events

Plus the fact that the URL all the way up to the instance name is purely routing, and everything after that is just the API that the instance uses... definitely a win for clarity.

Change 551247 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Public cache routing for eventgate-logging-external

https://gerrit.wikimedia.org/r/551247

Change 551253 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] TLS envoyproxy support for eventgate chart

https://gerrit.wikimedia.org/r/551253

Change 551263 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Enable TLS envoyproxy for eventgate-logging-external instances

https://gerrit.wikimedia.org/r/551263

Ok, TLS everywhere, right? Got some new patches up. I'd like to start merging these next week, so I'll summarize them again with proper order of operations (mostly for my own sanity).

We need to update the newly deployed eventgate-logging-external instances to use the TLS envoyproxy. That first.

TLS helm changes:

  1. TLS support in eventgate chart: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551253
  2. Enable TLS for eventgate-logging-external: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551263

Deploy eventgate-logging-external with changes.

LVS:

  1. DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/550914
  2. Puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550922

Discovery:

  1. Puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550923
  2. DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/550915

Public URL path routing:

  1. https://gerrit.wikimedia.org/r/c/operations/puppet/+/551247

Hmmm...@fgiunchedi do your various Kafka logging producers use Kafka TLS? We haven't done that yet in any EventGate instances...perhaps we should eh?

Hmmm...@fgiunchedi do your various Kafka logging producers use Kafka TLS? We haven't done that yet in any EventGate instances...perhaps we should eh?

Yes if you are producing to kafka-logging it'll be all TLS, we did this from the get go since we get traffic to kafka-logging from PoPs too.

More brain bouncing with Jason. I think we like:

  • stream.wikimedia.org/produce/$instance/* -> eventgate-$instance-external/*

Then all API paths are just forwarded to the instance. Logging client would POST to https://stream.wikimedia.org/produce/logging/v1/events.

It might also be nice to one day use this convention for EventStreams too, e.g. https://stream.wikimedia.org/consume/v2/stream/revision-create (Maybe we'd need an EventStreams $instance in there too, TBD). Let's defer that idea for later though.

Moving forward with this unless there are objections.

+1 on my side to route based on instance name

Change 551610 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Kafka producer TLS support for eventgate charts

https://gerrit.wikimedia.org/r/551610

Another patch: deploying puppet CA cert to eventgate pods for kafka producer TLS:

@Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. Can you find some time to review? I'll add them all to ticket description for easier reference.

Ottomata updated the task description. (Show Details)

Change 550914 merged by Alexandros Kosiaris:
[operations/dns@master] Add LVS entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550914

@Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. Can you find some time to review? I'll add them all to ticket description for easier reference.

I 've reviewed most them, at least partly you should be unblocked

Thank you! I like your suggestions on the kafka producer TLS one, will implement. Joe can help with the rest today I think.

Change 551253 merged by Ottomata:
[operations/deployment-charts@master] TLS envoyproxy support for eventgate chart

https://gerrit.wikimedia.org/r/551253

Change 552271 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate 0.0.13 - envoyproxy tls support

https://gerrit.wikimedia.org/r/552271

Change 552271 merged by Ottomata:
[operations/deployment-charts@master] eventgate 0.0.13 - envoyproxy tls support

https://gerrit.wikimedia.org/r/552271

Change 551263 merged by Ottomata:
[operations/deployment-charts@master] Enable TLS envoyproxy for eventgate-logging-external instances

https://gerrit.wikimedia.org/r/551263

Change 552318 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-0.0.15 - Fix services app selector for envoyproxy tls port

https://gerrit.wikimedia.org/r/552318

Change 552318 merged by Ottomata:
[operations/deployment-charts@master] eventgate-0.0.15 - Fix services app selector for envoyproxy tls port

https://gerrit.wikimedia.org/r/552318

Ok thanks for the help today @akosiaris and @Joe, HTTPS via envoyproxy is finally working! I will be off tomorrow but working on Monday. Could yall merge and deploy the LVS and discovery changes by the time start working Monday morning?

Change 550922 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add LVS for eventgate-logging-external using TLS port

https://gerrit.wikimedia.org/r/550922

Change 553352 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix type 'evengate' -> 'eventgate' in conftool-data eqiad

https://gerrit.wikimedia.org/r/553352

Change 553352 merged by Ottomata:
[operations/puppet@production] Fix type 'evengate' -> 'eventgate' in conftool-data eqiad

https://gerrit.wikimedia.org/r/553352

Change 553355 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use https:// for eventgate-logging-external ProxyFetch LVS check

https://gerrit.wikimedia.org/r/553355

Change 553355 merged by Giuseppe Lavagetto:
[operations/puppet@production] Use https:// for eventgate-logging-external ProxyFetch LVS check

https://gerrit.wikimedia.org/r/553355

Change 550923 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add discovery for eventgate-logging-external

https://gerrit.wikimedia.org/r/550923

Change 550915 merged by Giuseppe Lavagetto:
[operations/dns@master] Add discovery entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550915

Change 551610 merged by Ottomata:
[operations/deployment-charts@master] Kafka producer TLS support for eventgate charts

https://gerrit.wikimedia.org/r/551610

Change 554144 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Enable Kafka Producer TLS for eventgate-logging-external

https://gerrit.wikimedia.org/r/554144

Change 554144 merged by Ottomata:
[operations/deployment-charts@master] Enable Kafka Producer TLS for eventgate-logging-external

https://gerrit.wikimedia.org/r/554144

Change 554148 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Use Kafka TLS port for eventgate-logging-external

https://gerrit.wikimedia.org/r/554148

Change 554148 merged by Ottomata:
[operations/deployment-charts@master] Use Kafka TLS port for eventgate-logging-external

https://gerrit.wikimedia.org/r/554148

@akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging. My main app isn't coming up, and I suspect it is because it can't reach Kafka at the 9093 TLS port on the logstash hosts. I noticed we don't have the logstash IPv6 entries in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610/8/helmfile.d/admin/staging/calico/default-kubernetes-policy.yaml, but do we need/want them?

Mentioned in SAL (#wikimedia-operations) [2019-12-03T08:19:08Z] <akosiaris> apply calico rules for eventgate-logging-external. T236386

@akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging. My main app isn't coming up, and I suspect it is because it can't reach Kafka at the 9093 TLS port on the logstash hosts.

Absolutely correct. I had not applied the calico rules. It's now done and the deployment on staging has progressed.

akosiaris@deploy1001:/srv/deployment-charts/helmfile.d/services/staging/eventgate-logging-external$ kubectl get pods -w
NAME                                          READY   STATUS             RESTARTS   AGE
eventgate-logging-external-6d67bbd95b-9qsb4   2/3     CrashLoopBackOff   223        11h
eventgate-logging-external-786db65597-kv5fx   3/3     Running            13         12h
tiller-deploy-5585496747-6xrm6                1/1     Running            0          12h
eventgate-logging-external-6d67bbd95b-9qsb4   2/3   Running   224   11h

note, the next lines are after I apply the change (I haven't interrupted kubectl get pods -w though)

eventgate-logging-external-6d67bbd95b-9qsb4   3/3   Running   224   11h
eventgate-logging-external-786db65597-kv5fx   3/3   Terminating   13    12h
eventgate-logging-external-786db65597-kv5fx   0/3   Terminating   13    12h
eventgate-logging-external-786db65597-kv5fx   0/3   Terminating   13    12h

I noticed we don't have the logstash IPv6 entries in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610/8/helmfile.d/admin/staging/calico/default-kubernetes-policy.yaml, but do we need/want them?

Yes, it the logstash hosts are to be addressed by DNS names, we do.

Change 554295 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Add IPv6 calico rules for eventgate-logging-external -> kafka

https://gerrit.wikimedia.org/r/554295

Change 551247 merged by Ottomata:
[operations/puppet@production] Public cache routing for eventgate-logging-external

https://gerrit.wikimedia.org/r/551247

Change 554312 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Rename director to eventgate_logging_external

https://gerrit.wikimedia.org/r/554312

Change 554312 merged by Ottomata:
[operations/puppet@production] Rename director to eventgate_logging_external

https://gerrit.wikimedia.org/r/554312

Change 554318 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Route all /produce/logging/* to eventgate-logging-external

https://gerrit.wikimedia.org/r/554318

Sigh, it turns out SRE wants us to not rewrite paths. So if we use path based routing, the app needs to handle whatever comes in from the public request, i.e. EventGate would need to understand /produce/logging/v1/events. I'd prefer not to make EventGate change its API based on the installation of it, so we are back to using domains to route public -> app backend. Back to the domain name bikeshed!

How about:

  • events-logging.wikimedia.org/v1/events
  • events-analytics.wikimedia.org/v1/events

? Is events-logging to confusing with EventLogging? we could do logging-events.wm.org and analytics-events.wm.org?

  • events-<instance>.wikimedia.org/v1/events
  • <instance>-events.wikimedia.org/v1/events
  • intake-<instance>.wikimedia.org/v1/events
  • <instance>-intake.wikimedia.org/v1/events

Where <instance> is either 'logging' or 'analytics'.

Thoughts @jlinehan? ATM I kinda like intake-logging & intake-analytics.

  • logging-sink & analytics-sink (or sink-*) ?
  • events-<instance>.wikimedia.org/v1/events
  • <instance>-events.wikimedia.org/v1/events
  • intake-<instance>.wikimedia.org/v1/events
  • <instance>-intake.wikimedia.org/v1/events

Where <instance> is either 'logging' or 'analytics'.

ATM I kinda like intake-logging & intake-analytics.

I like those two too, fwiw. Having "events" in the URI twice like that is a little much.

Side question: Is /v1/events plural with the intention that eventually EventGate will support batches of events in the same request? Otherwise wouldn't it make more sense to have /v1/event?

Is /v1/events plural with the intention that eventually EventGate will support batches of events in the same request?

It does support that, just POST an array.

@jlinehan thoughts? I'm considering moving forward with intake-{analytics,logging}.

Hey also, before I go through with this; is there any issue with CORS here? If we go with a separate (non wikimedia) domain, will client side JS have a problem submitting to this domain? I guess not if we use navigator.sendBeacon? Ping @Nuria, @Pchelolo

@jlinehan thoughts? I'm considering moving forward with intake-{analytics,logging}.

Yeah, the lesser of the weevils strikes me as intake-<instance>.wikimedia.org/v1/events also. It's a far cry from what we had before, and now reads pretty weird to me, but let's do our best. Whatever word we use in the public endpoints, I will be trying to use that word consistently in the documentation. So I think everybody is okay with the word intake being what the EventGate thing is, but just making sure. Changing the URL is easy, that's true, but changing the documentation and people's habits can be annoying, hence all the bikeshedding.

Changing the URL is easy

is not really that easy :/ Possible though.

Ooook.

Change 554295 merged by Alexandros Kosiaris:
[operations/deployment-charts@master] Add IPv6 calico rules for eventgate-logging-external -> kafka

https://gerrit.wikimedia.org/r/554295

Change 556411 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Add intake-{logging,analytics}.wikimedia.org

https://gerrit.wikimedia.org/r/556411

Change 554318 abandoned by Ottomata:
Route all /produce/logging/* to eventgate-logging-external

Reason:
Will be using domain names, not paths, to route to external eventgate services.

https://gerrit.wikimedia.org/r/554318

Change 556413 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Public routing from intake-logging.wikimedia.org

https://gerrit.wikimedia.org/r/556413

Change 556411 merged by Ottomata:
[operations/dns@master] Add intake-{logging,analytics}.wikimedia.org

https://gerrit.wikimedia.org/r/556411

Change 556413 merged by Ottomata:
[operations/puppet@production] Public routing from intake-logging.wikimedia.org

https://gerrit.wikimedia.org/r/556413

Change 557104 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Bump eventgate-logging-external eventgate image to 2019-12-13-200604-production

https://gerrit.wikimedia.org/r/557104

Change 557104 merged by jenkins-bot:
[operations/deployment-charts@master] Bump eventgate-logging-external eventgate image to 2019-12-13-200604-production

https://gerrit.wikimedia.org/r/557104

Ottomata set the point value for this task to 21.
Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.