Page MenuHomePhabricator

Support for multiple SSO thanos-web backends
Closed, ResolvedPublic

Description

In T323913: Move thanos-sso away from CNAME discovery.wmnet we moved away from dns-controlled thanos-fe hosts for thanos-web (i.e. thanos.w.o) and to a conftool-controlled service.

The move solved the problem of easily swapping backends (e.g. for maintenance) though it created a nuisance: namely that we currently need to (remember to) have a single thanos-fe host pooled at a time per site.

This is because mod_auth_cas / SSO sessions are not shared between hosts, and backend selection is (for all intended purposes) random. Therefore, if we have more than one backend pooled then clients (in this case ATS, by way of the frontend cdn) could land on different hosts and not have their SSO session stored there.

After a chat/brainstorm with @Muehlenhoff here's my current thinking:

  1. We need an authenticating proxy in front of thanos anyways since thanos web serving doesn't ship authentication/authorization natively
  2. We want said proxy to be compatible with our SSO _and_ have a mechanism to share sessions between different hosts

To this end, we have at least a couple of solutions:

  1. Keep mod_auth_cas and find a way to make it share its sessions, as of March 2023 the module supports filesystem storage only. In other words some form of shared filesystem between all backends would be needed.
  2. Find an authenticating/authorizing proxy (including an apache module) that can talk SAML or OIDC (or CAS really!) and supports sharing sessions among hosts natively (e.g. with memcached)

Event Timeline

Something I came across that works (on paper! I haven't tried it yet) is this: https://oauth2-proxy.github.io/oauth2-proxy i.e. an OIDC RP authenticating reverse proxy. Session data is stored in cookies by default, so that would solve the problem

I'm curious if the recent maglev hashing 'mh' inclusion/migration in T263797 provides any improvement here. On paper using 'mh' scheduler should address session stickiness better than 'sh' did.

Has pooling multiple backends been tested since thanos-web was migrated to the 'mh' scheduler?

I'm curious if the recent maglev hashing 'mh' inclusion/migration in T263797 provides any improvement here. On paper using 'mh' scheduler should address session stickiness better than 'sh' did.

Has pooling multiple backends been tested since thanos-web was migrated to the 'mh' scheduler?

IIRC I tried and even mh didn't work because varnish -> ats backend selection is random, therefore the source IP that LVS sees on ats -> backend connection is random too.

I'm curious if the recent maglev hashing 'mh' inclusion/migration in T263797 provides any improvement here. On paper using 'mh' scheduler should address session stickiness better than 'sh' did.

Has pooling multiple backends been tested since thanos-web was migrated to the 'mh' scheduler?

IIRC I tried and even mh didn't work because varnish -> ats backend selection is random, therefore the source IP that LVS sees on ats -> backend connection is random too.

This isn't true anymore for ~half of our sites as pointed out by @Vgutierrez. Specifically, once T288106: Experiment with single backend CDN nodes is fully rolled out then the same client IP (external) will result in the same backend IP (internal) making backend requests. With that in place then I believe we could get away with a client-ip based hashing and SSO working to a decent extent.

The proper solution / nail in the coffin is of course to go stateless for sessions (cfr my comment at https://phabricator.wikimedia.org/T331512#8688330)

I've been looking at the oauth2-proxy docs and my understanding is the following:

./oauth2-proxy \
  --provider oidc \
  --provider-display-name "Wikimedia SSO" \
  --client-id thanos_test \
  --client-secret REDACTED \
  --redirect-url https://thanos.monitoring.wmflabs.org/oauth2/callback \
  --code-challenge-method plain \
  --oidc-issuer-url https://idp-test.wikimedia.org/oidc \
  --cookie-secure true \
  --cookie-secret REDACTED \
  --cookie-domain monitoring.wmflabs.org \
  --email-domain wikimedia.org \
  --upstream http://localhost:16902
  • Configure the oidc client on the cas side
  • Configure apache on the thanos side to reverse proxy requests to oauth2-proxy, which then forwards to thanos itself

Change 972701 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: idp_test entry for thanos OIDC

https://gerrit.wikimedia.org/r/972701

Change 972701 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: idp_test entry for thanos OIDC

https://gerrit.wikimedia.org/r/972701

I did a test on the o11y Pontoon stack (no puppetization yet) though things seems to work as expected: I can login on https://thanos.monitoring.wmflabs.org which will do the OIDC challenge/redirect to idp-test.w.o and issue me a _oauth2_proxy cookie with my session.

Debian package for oauth2-proxy now lives at https://gitlab.wikimedia.org/repos/sre/oauth2-proxy (debian-wikimedia branch)

Change 973739 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add thanos_oidc to idp

https://gerrit.wikimedia.org/r/973739

Change 973740 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] oauth2_proxy: new module

https://gerrit.wikimedia.org/r/973740

Change 973741 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add oidc support via oauth2-proxy

https://gerrit.wikimedia.org/r/973741

Change 973739 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add thanos_oidc to idp

https://gerrit.wikimedia.org/r/973739

Change 973740 merged by Filippo Giunchedi:

[operations/puppet@production] oauth2_proxy: new module

https://gerrit.wikimedia.org/r/973740

Change 973741 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add oidc support via oauth2-proxy

https://gerrit.wikimedia.org/r/973741

Change 974477 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: enable Thanos OIDC SSO

https://gerrit.wikimedia.org/r/974477

Change 974477 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: enable Thanos OIDC SSO

https://gerrit.wikimedia.org/r/974477

Change 974498 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: disable auth_cas when running in OIDC SSO mode

https://gerrit.wikimedia.org/r/974498

Change 974498 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: disable auth_cas when running in OIDC SSO mode

https://gerrit.wikimedia.org/r/974498

Change 975770 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] oauth2_proxy: add blackbox checks

https://gerrit.wikimedia.org/r/975770

Change 975770 merged by Filippo Giunchedi:

[operations/puppet@production] oauth2_proxy: add blackbox checks

https://gerrit.wikimedia.org/r/975770

Change 975812 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] profile: adjust oidc probe name

https://gerrit.wikimedia.org/r/975812

Change 975812 merged by Filippo Giunchedi:

[operations/puppet@production] profile: adjust oidc probe name

https://gerrit.wikimedia.org/r/975812

fgiunchedi claimed this task.

With the probes in place I'm calling this done!

Change 984146 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] oauth2_proxy: skip provider button

https://gerrit.wikimedia.org/r/984146

Change 984146 merged by Filippo Giunchedi:

[operations/puppet@production] oauth2_proxy: skip provider button

https://gerrit.wikimedia.org/r/984146

Change 984515 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] oauth2_proxy: update probe definition

https://gerrit.wikimedia.org/r/984515

Change 984515 merged by Filippo Giunchedi:

[operations/puppet@production] oauth2_proxy: update probe definition

https://gerrit.wikimedia.org/r/984515