Page MenuHomePhabricator

Move thanos-sso away from CNAME discovery.wmnet
Closed, ResolvedPublic

Description

Today I was in the process of upgrading Thanos in T303154 and while depooling thanos-query kinda worked as expected (grafana kept the connections, to be investigated later) I noticed that thanos.w.o web interface did not switch. After a bit of digging I finally remembered about T151009: Provide authenticated access to Thanos native web interface which means web requests for thanos.w.o are backed thanos-sso.discovery.wmnet which is a CNAME to a single host (because sso sessions are not shared among hosts, at least as of Jul 2020)

The schema works fine, though it complicates failover and isn't intuitive with how the rest of thanos works. Therefore I think we should:

  • Investigate whether nowadays we can share sso sessions, cc @jbond @Muehlenhoff as they would know. If that's possible/supported then we can flip back to thanos-query.discovery.wmnet to proxy thanos.w.o
  • If the above isn't possible/desired, then move thanos-sso.discovery.wmnet to be another service IP with hashing based on client IP, this way we can confctl pool/depool like we do for thanos-swift and thanos-query

Event Timeline

(thinking out loud) actually going with shared-nothing seems like a better option here, even though thanos.w.o isn't super critical (i.e. option #2 )

Change 861396 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Add thanos-web.svc and discovery

https://gerrit.wikimedia.org/r/861396

Change 861411 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] conftool: add thanos-web service

https://gerrit.wikimedia.org/r/861411

Change 861412 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add thanos-web to catalog and frontend

https://gerrit.wikimedia.org/r/861412

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.

Change 861396 merged by Filippo Giunchedi:

[operations/dns@master] Add thanos-web.svc and discovery

https://gerrit.wikimedia.org/r/861396

Change 861827 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Revert thanos-web discovery record

https://gerrit.wikimedia.org/r/861827

Change 861827 merged by Filippo Giunchedi:

[operations/dns@master] Revert thanos-web discovery record

https://gerrit.wikimedia.org/r/861827

Change 861411 merged by Filippo Giunchedi:

[operations/puppet@production] conftool: add thanos-web service

https://gerrit.wikimedia.org/r/861411

Change 861412 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add thanos-web to catalog and frontend

https://gerrit.wikimedia.org/r/861412

Change 862258 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: move thanos-web to lvs_setup

https://gerrit.wikimedia.org/r/862258

Change 862258 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: move thanos-web to lvs_setup

https://gerrit.wikimedia.org/r/862258

Mentioned in SAL (#wikimedia-operations) [2022-11-30T15:35:12Z] <godog> roll-restart pybal on lvs[21]020 to pick up thanos-web service and then on lvs1019 lvs2009 - T323913

Investigate whether nowadays we can share sso sessions,

The CAS cookie used by mod_auth_cas is stored locally under /var/cache/apache2/mod_auth_cas/thanos.wikimedia.org. If this were written to some shared storage it might work, but we haven't tested this yet.

Investigate whether nowadays we can share sso sessions,

The CAS cookie used by mod_auth_cas is stored locally under /var/cache/apache2/mod_auth_cas/thanos.wikimedia.org. If this were written to some shared storage it might work, but we haven't tested this yet.

Thank you that makes sense! I ended up going with sticky requests via sh scheduler, we'll see how that goes

Change 862843 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: set thanos-web service to production

https://gerrit.wikimedia.org/r/862843

Change 862843 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: set thanos-web service to production

https://gerrit.wikimedia.org/r/862843

Change 862937 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hiera: replace thanos-sso with thanos-web

https://gerrit.wikimedia.org/r/862937

Change 862939 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: remove thanos-sso

https://gerrit.wikimedia.org/r/862939

Change 862937 merged by Filippo Giunchedi:

[operations/puppet@production] hiera: replace thanos-sso with thanos-web

https://gerrit.wikimedia.org/r/862937

I was a little too hasty/optimistic in imagining the outcome; in the sense that even for pass traffic from varnish-frontend to ats-backend the selection is still random and thus even sh scheduler won't work because the client IP is random.

At any rate, we now have thanos-web discovery service which does simplify operations compared to merging DNS patches to move thanos.wikimedia.org to a different host. I've updated https://wikitech.wikimedia.org/wiki/Thanos#Pool_/_depool_a_site accordingly.

Change 862939 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: remove thanos-sso

https://gerrit.wikimedia.org/r/862939

I'll call this resolved since we have moved away from thanos-sso CNAME and into thanos-web for conftool instead.

fgiunchedi claimed this task.

Change 864663 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add note re: thanos-web and scheduler: sh and SSO

https://gerrit.wikimedia.org/r/864663

Change 864663 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add note re: thanos-web and scheduler: sh and SSO

https://gerrit.wikimedia.org/r/864663