Page MenuHomePhabricator

Set up eventgate-logging-external in production
Open, MediumPublic

Description

Helm Chart:

Helm chart TLS support:

LVS for eventgate-logging-external.svc

Discovery for eventgate-logging-external.discovery.wmnet

Public URL for stream.wikimedia.org/producer/logging:

Details

Related Gerrit Patches:
operations/puppet : productionRoute all /produce/logging/* to eventgate-logging-external
operations/puppet : productionRename director to eventgate_logging_external
operations/puppet : productionPublic cache routing for eventgate-logging-external
operations/deployment-charts : masterAdd IPv6 calico rules for eventgate-logging-external -> kafka
operations/deployment-charts : masterUse Kafka TLS port for eventgate-logging-external
operations/deployment-charts : masterEnable Kafka Producer TLS for eventgate-logging-external
operations/deployment-charts : masterKafka producer TLS support for eventgate charts
operations/dns : masterAdd discovery entries for eventgate-logging-external
operations/puppet : productionAdd discovery for eventgate-logging-external
operations/puppet : productionUse https:// for eventgate-logging-external ProxyFetch LVS check
operations/puppet : productionFix type 'evengate' -> 'eventgate' in conftool-data eqiad
operations/puppet : productionAdd LVS for eventgate-logging-external using TLS port
operations/deployment-charts : mastereventgate-0.0.15 - Fix services app selector for envoyproxy tls port
operations/deployment-charts : masterEnable TLS envoyproxy for eventgate-logging-external instances
operations/deployment-charts : mastereventgate 0.0.13 - envoyproxy tls support
operations/deployment-charts : masterTLS envoyproxy support for eventgate chart
operations/dns : masterAdd LVS entries for eventgate-logging-external
operations/puppet : productionk8s: Add eventgate-logging-external stanzas
operations/deployment-charts : masterNamespaces for eventgate-logging-external
operations/deployment-charts : masterAdd eventgate-logging-external instance

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ping @akosiaris for namespaces :D

Change 550840 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] Namespaces for eventgate-logging-external

https://gerrit.wikimedia.org/r/550840

Change 550845 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] k8s: Add eventgate-logging-external stanzas

https://gerrit.wikimedia.org/r/550845

Change 550840 merged by jenkins-bot:
[operations/deployment-charts@master] Namespaces for eventgate-logging-external

https://gerrit.wikimedia.org/r/550840

Change 550845 merged by Alexandros Kosiaris:
[operations/puppet@production] k8s: Add eventgate-logging-external stanzas

https://gerrit.wikimedia.org/r/550845

Namespaces and tokens have been created and populated. @Ottomata, you are clear for deployment. I am guessing after that we need LVS, discovery, public endpoint exposing.

akosiaris triaged this task as Medium priority.Thu, Nov 14, 7:11 PM
Ottomata added a subscriber: ema.EditedThu, Nov 14, 7:40 PM

Yeah. Public endpoint! Which begs the question @fgiunchedi ...what should this endpoint be? Since this will be MediaWiki JS POSTing... do we just need routing from a cache frontend (or from MW?) domain to do the right thing? We'll have the same question for eventgate-analytics-external too.

I'll go ahead and set up the LVS and discovery stuff, but I think we will need routing rules in varnish / ATS to just pass the HTTP request off to the discovery URL, right? Ping @ema @CDanis ?

Change 550914 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Add LVS entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550914

Change 550915 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Add discovery entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550915

Change 550922 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add LVS for eventgate-logging-external

https://gerrit.wikimedia.org/r/550922

Change 550923 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add discovery for eventgate-logging-external

https://gerrit.wikimedia.org/r/550923

Yeah. Public endpoint! Which begs the question @fgiunchedi ...what should this endpoint be? Since this will be MediaWiki JS POSTing... do we just need routing from a cache frontend (or from MW?) domain to do the right thing? We'll have the same question for eventgate-analytics-external too.

My expectation was routing from caches without involving MW. If keeping the same domain but a different path is doable then I'd say we should go for it, and there are no obvious downsides (e.g. will cookies be sent?). If not then a separate (third level) domain will work too I think, modulo CORS.

Ottomata added a comment.EditedFri, Nov 15, 2:44 PM

We have to answer this question for eventgate-analytics-external too. If we do a separate domain, perhaps the same for both of them? beacon.wikimedia.org/v1/{logging,analytics}?

Thoughts @jlinehan @Milimetric @Krinkle ?

Ottomata renamed this task from Create new eventgate-logging deployment in k8s with helmfile to Set up eventgate-logging-external in production.Fri, Nov 15, 4:20 PM
Ottomata added a subscriber: Joe.

Heard some preferences from SREs to use a separate domain. I'd like to use the same domain for all the external eventgate instances then. Let's bike shed a name.

We could re-use stream.wikimedia.org. Currently, this EventStreams at stream.wikimedia.org/v2/stream. stream.wikimedia.org/v1 is deprecated, but was originally RCStream. I can't quite think of a good place to put eventgate endpoints in stream.wikimedia.org. The EventGate API uses /v1/events. We could do something like /v2/logging/events and /v2/analytics/events, but would using in public URI /v2 to route to an internal /v1/events URI be confusing? We could use /v1 publicly, but then it sounds like that is deprecated since we have /v2/stream.

So maybe not stream.wikimedia.org. What else? I heard from @Joe he didn't like 'beacon'. Can't say I do either.

Ideas:

  • event(s).wikimedia.org (seems consfusable with a real life 'Event' e.g. Hackathon)
  • eventgate.wikimedia.org (don't really like putting the software name here)
  • intake.wikimedia.org
  • inlet.wikimedia.org
  • logging.wikimedia.org (too generic?)

I don't really love any of these. Anybody got any better ideas?

  • stream-intake.wikimedia.org

?

More brain bouncing with Jason. I think we like:

  • stream.wikimedia.org/produce/$instance/* -> eventgate-$instance-external/*

Then all API paths are just forwarded to the instance. Logging client would POST to https://stream.wikimedia.org/produce/logging/v1/events.

It might also be nice to one day use this convention for EventStreams too, e.g. https://stream.wikimedia.org/consume/v2/stream/revision-create (Maybe we'd need an EventStreams $instance in there too, TBD). Let's defer that idea for later though.

Moving forward with this unless there are objections.

I think this is great because it makes the API actually readable in a meaningful way that maps to our documentation and the concepts/abstractions used by MEP and Kafka.
As long as we continue to name our instances properly, we'll have really nice idiomatic endpoint URLs like

stream.wikimedia.org/produce/logging/v1/events
stream.wikimedia.org/produce/analytics/v1/events

Plus the fact that the URL all the way up to the instance name is purely routing, and everything after that is just the API that the instance uses... definitely a win for clarity.

Change 551247 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Public cache routing for eventgate-logging-external

https://gerrit.wikimedia.org/r/551247

Change 551253 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] TLS envoyproxy support for eventgate chart

https://gerrit.wikimedia.org/r/551253

Change 551263 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Enable TLS envoyproxy for eventgate-logging-external instances

https://gerrit.wikimedia.org/r/551263

Ok, TLS everywhere, right? Got some new patches up. I'd like to start merging these next week, so I'll summarize them again with proper order of operations (mostly for my own sanity).

We need to update the newly deployed eventgate-logging-external instances to use the TLS envoyproxy. That first.

TLS helm changes:

  1. TLS support in eventgate chart: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551253
  2. Enable TLS for eventgate-logging-external: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551263

Deploy eventgate-logging-external with changes.

LVS:

  1. DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/550914
  2. Puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550922

Discovery:

  1. Puppet: https://gerrit.wikimedia.org/r/c/operations/puppet/+/550923
  2. DNS: https://gerrit.wikimedia.org/r/c/operations/dns/+/550915

Public URL path routing:

  1. https://gerrit.wikimedia.org/r/c/operations/puppet/+/551247

Hmmm...@fgiunchedi do your various Kafka logging producers use Kafka TLS? We haven't done that yet in any EventGate instances...perhaps we should eh?

Hmmm...@fgiunchedi do your various Kafka logging producers use Kafka TLS? We haven't done that yet in any EventGate instances...perhaps we should eh?

Yes if you are producing to kafka-logging it'll be all TLS, we did this from the get go since we get traffic to kafka-logging from PoPs too.

More brain bouncing with Jason. I think we like:

  • stream.wikimedia.org/produce/$instance/* -> eventgate-$instance-external/*

Then all API paths are just forwarded to the instance. Logging client would POST to https://stream.wikimedia.org/produce/logging/v1/events.
It might also be nice to one day use this convention for EventStreams too, e.g. https://stream.wikimedia.org/consume/v2/stream/revision-create (Maybe we'd need an EventStreams $instance in there too, TBD). Let's defer that idea for later though.
Moving forward with this unless there are objections.

+1 on my side to route based on instance name

Change 551610 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Kafka producer TLS support for eventgate charts

https://gerrit.wikimedia.org/r/551610

Another patch: deploying puppet CA cert to eventgate pods for kafka producer TLS:

Restricted Application added projects: Operations, Services. · View Herald TranscriptMon, Nov 18, 7:02 PM

@Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. Can you find some time to review? I'll add them all to ticket description for easier reference.

Ottomata updated the task description. (Show Details)Mon, Nov 18, 7:14 PM
Ottomata added a subscriber: BBlack.
Ottomata updated the task description. (Show Details)Mon, Nov 18, 7:16 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Mon, Nov 18, 7:20 PM
mpopov added a subscriber: mpopov.Tue, Nov 19, 9:15 PM

Change 550914 merged by Alexandros Kosiaris:
[operations/dns@master] Add LVS entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550914

Ottomata updated the task description. (Show Details)Wed, Nov 20, 3:08 PM

@Joe @akosiaris @ema I'd like to move forward with these patches this week, hopefully sooner rather than later. Can you find some time to review? I'll add them all to ticket description for easier reference.

I 've reviewed most them, at least partly you should be unblocked

Thank you! I like your suggestions on the kafka producer TLS one, will implement. Joe can help with the rest today I think.

Change 551253 merged by Ottomata:
[operations/deployment-charts@master] TLS envoyproxy support for eventgate chart

https://gerrit.wikimedia.org/r/551253

Change 552271 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate 0.0.13 - envoyproxy tls support

https://gerrit.wikimedia.org/r/552271

Change 552271 merged by Ottomata:
[operations/deployment-charts@master] eventgate 0.0.13 - envoyproxy tls support

https://gerrit.wikimedia.org/r/552271

Change 551263 merged by Ottomata:
[operations/deployment-charts@master] Enable TLS envoyproxy for eventgate-logging-external instances

https://gerrit.wikimedia.org/r/551263

Change 552318 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-0.0.15 - Fix services app selector for envoyproxy tls port

https://gerrit.wikimedia.org/r/552318

Change 552318 merged by Ottomata:
[operations/deployment-charts@master] eventgate-0.0.15 - Fix services app selector for envoyproxy tls port

https://gerrit.wikimedia.org/r/552318

Ottomata updated the task description. (Show Details)Thu, Nov 21, 6:45 PM

Ok thanks for the help today @akosiaris and @Joe, HTTPS via envoyproxy is finally working! I will be off tomorrow but working on Monday. Could yall merge and deploy the LVS and discovery changes by the time start working Monday morning?

Change 550922 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add LVS for eventgate-logging-external using TLS port

https://gerrit.wikimedia.org/r/550922

Change 553352 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Fix type 'evengate' -> 'eventgate' in conftool-data eqiad

https://gerrit.wikimedia.org/r/553352

Change 553352 merged by Ottomata:
[operations/puppet@production] Fix type 'evengate' -> 'eventgate' in conftool-data eqiad

https://gerrit.wikimedia.org/r/553352

Change 553355 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use https:// for eventgate-logging-external ProxyFetch LVS check

https://gerrit.wikimedia.org/r/553355

Change 553355 merged by Giuseppe Lavagetto:
[operations/puppet@production] Use https:// for eventgate-logging-external ProxyFetch LVS check

https://gerrit.wikimedia.org/r/553355

Ottomata updated the task description. (Show Details)Wed, Nov 27, 4:29 PM

Change 550923 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add discovery for eventgate-logging-external

https://gerrit.wikimedia.org/r/550923

Change 550915 merged by Giuseppe Lavagetto:
[operations/dns@master] Add discovery entries for eventgate-logging-external

https://gerrit.wikimedia.org/r/550915

Ottomata updated the task description. (Show Details)Fri, Nov 29, 3:39 PM

Change 551610 merged by Ottomata:
[operations/deployment-charts@master] Kafka producer TLS support for eventgate charts

https://gerrit.wikimedia.org/r/551610

Change 554144 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Enable Kafka Producer TLS for eventgate-logging-external

https://gerrit.wikimedia.org/r/554144

Change 554144 merged by Ottomata:
[operations/deployment-charts@master] Enable Kafka Producer TLS for eventgate-logging-external

https://gerrit.wikimedia.org/r/554144

Change 554148 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Use Kafka TLS port for eventgate-logging-external

https://gerrit.wikimedia.org/r/554148

Change 554148 merged by Ottomata:
[operations/deployment-charts@master] Use Kafka TLS port for eventgate-logging-external

https://gerrit.wikimedia.org/r/554148

@akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging. My main app isn't coming up, and I suspect it is because it can't reach Kafka at the 9093 TLS port on the logstash hosts. I noticed we don't have the logstash IPv6 entries in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610/8/helmfile.d/admin/staging/calico/default-kubernetes-policy.yaml, but do we need/want them?

Mentioned in SAL (#wikimedia-operations) [2019-12-03T08:19:08Z] <akosiaris> apply calico rules for eventgate-logging-external. T236386

@akosiaris I merged and applied https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610 in staging. My main app isn't coming up, and I suspect it is because it can't reach Kafka at the 9093 TLS port on the logstash hosts.

Absolutely correct. I had not applied the calico rules. It's now done and the deployment on staging has progressed.

akosiaris@deploy1001:/srv/deployment-charts/helmfile.d/services/staging/eventgate-logging-external$ kubectl get pods -w
NAME                                          READY   STATUS             RESTARTS   AGE
eventgate-logging-external-6d67bbd95b-9qsb4   2/3     CrashLoopBackOff   223        11h
eventgate-logging-external-786db65597-kv5fx   3/3     Running            13         12h
tiller-deploy-5585496747-6xrm6                1/1     Running            0          12h
eventgate-logging-external-6d67bbd95b-9qsb4   2/3   Running   224   11h

note, the next lines are after I apply the change (I haven't interrupted kubectl get pods -w though)

eventgate-logging-external-6d67bbd95b-9qsb4   3/3   Running   224   11h
eventgate-logging-external-786db65597-kv5fx   3/3   Terminating   13    12h
eventgate-logging-external-786db65597-kv5fx   0/3   Terminating   13    12h
eventgate-logging-external-786db65597-kv5fx   0/3   Terminating   13    12h

I noticed we don't have the logstash IPv6 entries in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/551610/8/helmfile.d/admin/staging/calico/default-kubernetes-policy.yaml, but do we need/want them?

Yes, it the logstash hosts are to be addressed by DNS names, we do.

Change 554295 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Add IPv6 calico rules for eventgate-logging-external -> kafka

https://gerrit.wikimedia.org/r/554295

Ottomata updated the task description. (Show Details)Tue, Dec 3, 2:24 PM

Change 551247 merged by Ottomata:
[operations/puppet@production] Public cache routing for eventgate-logging-external

https://gerrit.wikimedia.org/r/551247

Change 554312 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Rename director to eventgate_logging_external

https://gerrit.wikimedia.org/r/554312

Change 554312 merged by Ottomata:
[operations/puppet@production] Rename director to eventgate_logging_external

https://gerrit.wikimedia.org/r/554312

Change 554318 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Route all /produce/logging/* to eventgate-logging-external

https://gerrit.wikimedia.org/r/554318

Sigh, it turns out SRE wants us to not rewrite paths. So if we use path based routing, the app needs to handle whatever comes in from the public request, i.e. EventGate would need to understand /produce/logging/v1/events. I'd prefer not to make EventGate change its API based on the installation of it, so we are back to using domains to route public -> app backend. Back to the domain name bikeshed!

How about:

  • events-logging.wikimedia.org/v1/events
  • events-analytics.wikimedia.org/v1/events

? Is events-logging to confusing with EventLogging? we could do logging-events.wm.org and analytics-events.wm.org?

  • events-<instance>.wikimedia.org/v1/events
  • <instance>-events.wikimedia.org/v1/events
  • intake-<instance>.wikimedia.org/v1/events
  • <instance>-intake.wikimedia.org/v1/events

Where <instance> is either 'logging' or 'analytics'.

Thoughts @jlinehan? ATM I kinda like intake-logging & intake-analytics.

  • logging-sink & analytics-sink (or sink-*) ?
mpopov added a comment.Thu, Dec 5, 7:50 PM
  • events-<instance>.wikimedia.org/v1/events
  • <instance>-events.wikimedia.org/v1/events
  • intake-<instance>.wikimedia.org/v1/events
  • <instance>-intake.wikimedia.org/v1/events

Where <instance> is either 'logging' or 'analytics'.
ATM I kinda like intake-logging & intake-analytics.

I like those two too, fwiw. Having "events" in the URI twice like that is a little much.

Side question: Is /v1/events plural with the intention that eventually EventGate will support batches of events in the same request? Otherwise wouldn't it make more sense to have /v1/event?

Is /v1/events plural with the intention that eventually EventGate will support batches of events in the same request?

It does support that, just POST an array.