Page MenuHomePhabricator

Make eventstreams-internal available to WMF staff without an ssh tunnel
Closed, ResolvedPublic

Description

eventstreams-internal is a non-public deployment of EventStreams that has access to all Event Platform streams. Currently can only be accessed by users with a WMF production ssh account via ssh tunneling.

Exposing this at a public domain with proper auth would allow Data Platform users to explore stream documentation and schemas using OpenAPI docs (e.g. this), as well as view live stream data in their browsers.


Original ticket description from Luca:

While deploying the new version of eventstreams, I noticed that the internal endpoint seems not used in ages:

https://grafana.wikimedia.org/d/znIuUcsWz/eventstreams?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&var-service=eventstreams-internal&from=now-6M&to=now

From the logs on logstash (both eqiad and codfw) I don't see anything relevant either, maybe I am missing something but I am wondering if we should undeploy to simplify the maintenance.

Let me know!

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Ottomata renamed this task from Traffic for eventstreams-internal seems to be zero for the past months to Make eventstreams-internal available to WMF staff without an ssh tunnel.Oct 25 2024, 1:43 PM
Ottomata updated the task description. (Show Details)

@BTullis do you think it would be possible to add authentication and a public domain to this service?

Yes, I think that would be quite feasible. We could use the CAS/SSO implementation and authenticate to it using the OIDC protocol, as we are with Superset and Airflow and DataHub and MPIC.
There would need to be an LDAP group to whom the rights would be given, equivalent to the analytics-privatedata-users POSIX group, I suppose. We are already configuring a number of new LDAP groups with cross-checking in T375729, so I can't see a big problem with this part.

Then the next question would be where to do the authentcation.
We could either:

...or

  • add an authenticating reverse proxy using envoy or some other kind of service.

At first glance, the envoy based solution looks pretty neat and tidy, given that we already have envoy installed in every pod.

  • add an authenticating reverse proxy using envoy or some other kind of service.

At first glance, the envoy based solution looks pretty neat and tidy, given that we already have envoy installed in every pod.

Envoy Gateway is a different thing from Envoy Proxy -- the former is a management layer around the latter. We only run Envoy Proxy.

But we're using oauth2-proxy on k8s in the aux cluster to front trace.wikimedia.org and it's been working fine for that -- link to our config. Other teams also use it on bare metal for access to Thanos and a few other pieces of infra.

But we're using oauth2-proxy on k8s in the aux cluster to front trace.wikimedia.org and it's been working fine for that -- link to our config. Other teams also use it on bare metal for access to Thanos and a few other pieces of infra.

Nice! Thanks for that info.

Would it be valuable to move MPIC to using oauth2-proxy for consistency with these other systems?

Gehel triaged this task as Medium priority.Mar 7 2025, 8:42 AM
Gehel moved this task from Incoming to Misc on the Data-Platform-SRE board.
JMeybohm moved this task from Inbox to Radar (Awareness) on the ServiceOps new board.
JMeybohm subscribed.

Moving this to radar on ServiceOps new side sine we won't be driving an implementation here. Let us know if we can help with something.

Plan:

  1. Take @JAllemandou diff for httpd_cas modularisation and finish it. For test, applying it and compare with functional deployment.
  2. Adapt httpd_cas for eventstreams chart and deploy next version of chart.
/usr/bin/helmfile \
  --file /home/atsuko/src/operations/deployment-charts/helmfile.d/dse-k8s-services/turnilo/helmfile.yaml \
  --chart /home/atsuko/src/operations/deployment-charts/charts/turnilo \
  --environment dse-k8s-eqiad diff --color --detailed-exitcode --context 5
atsuko changed the task status from Open to In Progress.May 8 2026, 1:41 PM

Change #1283791 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] Add auth_proxy.httpd_cas module

https://gerrit.wikimedia.org/r/1283791

Change #1285739 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] Migrated turnilo to auth_proxy.httpd_cas module

https://gerrit.wikimedia.org/r/1285739

Change #1283791 merged by jenkins-bot:

[operations/deployment-charts@master] Add auth_proxy.httpd_cas module

https://gerrit.wikimedia.org/r/1283791

Change #1285739 merged by jenkins-bot:

[operations/deployment-charts@master] Migrated turnilo to auth_proxy.httpd_cas module

https://gerrit.wikimedia.org/r/1285739

Going to push updates for turnilo and start converting eventstreams-internal

Plan

  1. update openstream chart so it will have external httpd_cas port that is exposed to 30443
  2. we create all DNS records (public and private)
  3. we deploy the app, which sets up ingress
  4. we check that https://<service>.discovery.wmnet:30443 works
  5. we enable the ATS proxy and edge caching configuration

Change #1288502 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] deployment_server: add eventstreams-internal to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1288502

Change #1288832 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1288832

Change #1288502 merged by Atsuko:

[operations/puppet@production] deployment_server: add eventstreams-internal to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1288502

Change #1288832 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng/dse-k8s: add eventstreams-internal to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1288832

Created the namespaces in dse-k8s-eqiad, unblocked testing of the new chart/configuration.

Change #1289357 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] draft: upgrade eventstream chart to ingress

https://gerrit.wikimedia.org/r/1289357

Change #1289978 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] eventstreams: convert configs

https://gerrit.wikimedia.org/r/1289978

Change #1289979 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] eventstreams: copy eventstreams-internal to dse

https://gerrit.wikimedia.org/r/1289979

Change #1289978 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: convert configs

https://gerrit.wikimedia.org/r/1289978

Got service working on dpe-k8s-eqiad, need to do:

  1. Register service with idp (Application Not Authorized to Use CAS)
  2. Figure production certificate (for now it generated internal one)
  3. Finish diffs massaging and review, so far I
    • edited some of the vendor templates
    • didn't backport debug functionality yet

but otherwise it is matter of a few hours of work.

Change #1293663 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] idp: adding stream-internal.w.o to allowed services

https://gerrit.wikimedia.org/r/1293663

Change #1293663 merged by Atsuko:

[operations/puppet@production] idp: adding stream-internal.w.o to allowed services

https://gerrit.wikimedia.org/r/1293663

Change #1294257 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] httpd-cas: config option to disable httpd-cas

https://gerrit.wikimedia.org/r/1294257

Change #1294257 merged by jenkins-bot:

[operations/deployment-charts@master] httpd-cas: config option to disable httpd-cas

https://gerrit.wikimedia.org/r/1294257

Change #1289357 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: upgrade chart to ingress and idp

https://gerrit.wikimedia.org/r/1289357

Change #1289979 merged by jenkins-bot:

[operations/deployment-charts@master] eventstreams: copy eventstreams-internal to dse

https://gerrit.wikimedia.org/r/1289979

Change #1294326 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/dns@master] Provision stream-internal.w.o

https://gerrit.wikimedia.org/r/1294326

Change #1294327 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] trafficserver: enable stream-internal.w.o

https://gerrit.wikimedia.org/r/1294327

Change #1294326 merged by Atsuko:

[operations/dns@master] Provision stream-internal.w.o

https://gerrit.wikimedia.org/r/1294326

Change #1294327 merged by Atsuko:

[operations/puppet@production] trafficserver: enable stream-internal.w.o

https://gerrit.wikimedia.org/r/1294327

Cleanup:

  1. Check monitorings
  2. Remove eventstream-internal from main k8s
  3. Remove extra configs from turnilo and eventstream
  4. Downgrade eventstreams-internal.discovery.wmnet from lvs to ingress lb
Gehel changed the task status from In Progress to Open.Fri, May 29, 8:40 AM

Change #1295405 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/deployment-charts@master] Cleanup old values for turnilo and eventstreams

https://gerrit.wikimedia.org/r/1295405

Change #1295406 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] Cleanup eventstream-internal

https://gerrit.wikimedia.org/r/1295406

Change #1295409 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] service: move eventstreams-internal to lvs_setup

https://gerrit.wikimedia.org/r/1295409

Change #1295410 had a related patch set uploaded (by Atsuko; author: Atsuko):

[operations/puppet@production] service: move eventstreams-internal to service_setup

https://gerrit.wikimedia.org/r/1295410

  1. Slience id 7d0bff61-73e8-4e1f-b324-c107b5b54adc
  2. DNS removed
  3. lvs_setup, A:dnsbox OK
  4. lvs removed
  5. removed eventstreams-internal from staging, eqiad, codfw k8s
  6. merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1295405, applied admin-ng
  7. merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1295406, applied

Change #1295409 merged by Atsuko:

[operations/puppet@production] service: move eventstreams-internal to lvs_setup

https://gerrit.wikimedia.org/r/1295409

Change #1295410 merged by Atsuko:

[operations/puppet@production] service: move eventstreams-internal to service_setup

https://gerrit.wikimedia.org/r/1295410

Change #1295405 merged by jenkins-bot:

[operations/deployment-charts@master] Cleanup old values for turnilo and eventstreams

https://gerrit.wikimedia.org/r/1295405

Change #1295406 merged by Atsuko:

[operations/puppet@production] Cleanup eventstream-internal

https://gerrit.wikimedia.org/r/1295406

Thank you so so so so much everyone! This has already been useful to me many times!