Page MenuHomePhabricator

toolforge: kubernetes can't revoke certificates
Closed, ResolvedPublic

Description

Toolforge kubernetes uses x509 certificates as the main authentication mechanism. But as of today, it cannot revoke such certs, or check a CRL, as documented in various places upstream, and in the internet:

We should keep this in mind when dealing with stuff like T363983: [toolforge] Investigate authentication.

Some options we could explore to mitigate the consequences of this limitation include:

  • issue x509 certs with a very short lifetime, like 1 day -- this will put some additional pressure on maintain-kubeusers
  • maybe we can instrument a poor-man CRL-like mechanism via Custom Admission controllers, or Kyverno -- this is unknown as of today

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
maintain-kubeusers: bump to 0.0.173-20250513143459-bcd0852brepos/cloud/toolforge/toolforge-deploy!779group_203_bot_f4d95069bb2675e4ce1fff090c1c1620bump_maintain-kubeusersmain
kubecerts: stop renewing certs on a random basisrepos/cloud/toolforge/maintain-kubeusers!69aborreroarturo-248-kubecerts-stop-renemain
maintain-kubeusers: bump to 0.0.164-20240710151051-a518392erepos/cloud/toolforge/toolforge-deploy!405ghostbump_maintain-kubeusersmain
kubecerts: only randomly select certs for renew in case they are oldrepos/cloud/toolforge/maintain-kubeusers!57aborreroarturo-156-kubecerts-only-randmain
maintain-kubeusers: bump to 0.0.163-20240710124028-b9245afdrepos/cloud/toolforge/toolforge-deploy!402ghostbump_maintain-kubeusersmain
kubecerts: have certificates lifetime to be max 10 days, renew them oftenrepos/cloud/toolforge/maintain-kubeusers!55aborreroarturo-306-kubecerts-have-certmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
ResolvedLucasWerkmeister
Resolvedmatmarex
ResolvedLegoktm
ResolvedLegoktm
In Progressdcaro
Resolveddcaro
In Progresskomla
Resolveddcaro
Resolveddcaro
Resolveddcaro
Opendcaro
Resolveddcaro
Opendcaro
Opendcaro
Resolveddcaro
ResolvedSlst2020
OpenNone
Resolved aborrero

Event Timeline

dcaro triaged this task as High priority.Jul 2 2024, 1:23 PM

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/402

maintain-kubeusers: bump to 0.0.163-20240710124028-b9245afd

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/405

maintain-kubeusers: bump to 0.0.164-20240710151051-a518392e

We have taken the following measures:

  • reduced the lifetime of the certificates to 10 days for new certificates
  • certs will get renewed 2 days before they expire
  • on each maintain-kubeusers loop, pick a few random old 1year lifetime certificates, and renew them with 10 days lifetime

So far the system is stable and everything seems working just fine.

The first certificates with 10 days lifetime were generated on 2024-07-10, so on 2024-07-18 we should check to see if the 2 days before expiration renewal is working as expected.

Leaving the ticket open until then.

there are some risks with lower lifetime values, for example 2 or 5 days:

  • if there is a problem with maintain-kubeusers, or with the k8s certificate renewal process, or some other bug, we will get a lot (if not all) certificates expired, meaning users wont be able to run their stuff on Toolforge.
  • lower lifetime values give us little margin to react and fix things. It could be a weekend, or the end of year holiday break.

So, I would rather:

  • keep a slightly higher lifetime value -- will give us some additional margin to react to failures
  • keep some amount of randomization on certificate renewal -- so not all certificates expire on the same day, and not all users are affected by a potential failure, keeping the 'blast radius' a bit smaller

The first certificates with 10 days lifetime were generated on 2024-07-10, so on 2024-07-18 we should check to see if the 2 days before expiration renewal is working as expected.

Leaving the ticket open until then.

Working as expected:

image.png (361×1 px, 119 KB)

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/779

maintain-kubeusers: bump to 0.0.173-20250513143459-bcd0852b