Page MenuHomePhabricator

toolforge: new maintain-kubeusers takes long time to loop over all the accounts to reconcile them
Closed, ResolvedPublic

Description

The new refactored maintain-kubeusers takes a long time to loop over all the accounts (about 3.5k) to reconcile them.

I believe this is because for each account, the state configmap is queried. However, this is mostly read, so maybe we can explore how to cache this.

One consequence is that the daemon misses the livenessprove deadline and gets restarted often.

Approaches to explore:

  • introduce some caching for state configmaps
  • parallelization, i.e, check for multiple accounts in different async tasks
  • some combination of both
  • minimize amount of filesystem checks (NFS-induced latency)
  • move the liveness probe check response inside the reconciliation loop
  • drop sleep(1) in certificate generation logic -- can't be done, as cert generation will reliably fail if there is no such delay

Event Timeline

aborrero changed the task status from Open to In Progress.Jun 4 2024, 9:08 AM
aborrero triaged this task as High priority.
aborrero moved this task from Backlog to Doing on the User-aborrero board.

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/302

maintain-kubeusers: bump to 0.0.136-20240604092234-e8d7cdd4

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/303

maintain-kubeusers: bump to 0.0.137-20240604102906-d1b2d380

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/30

kubeconfig: store some state about kubeconfig in the configmap to save NFS hits

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/30

kubeconfig: store some state about kubeconfig in the configmap to save NFS hits

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/304

maintain-kubeusers: bump to 0.0.138-20240604120147-29f8d0f2

with the latest changes we are down to ~3 minutes per noop loop.

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/312

maintain-kubeusers: bump to 0.0.142-20240605144945-15057389

investigating error in the last patch in toolsbeta:

 starting a run                                                                                                                                                                                                                                                                           │
│ account: 'bd808' resource: 'homedir' needs create                                                                                                                                                                                                                                        │
│ Traceback (most recent call last):                                                                                                                                                                                                                                                       │
│   File "/app/maintain_kubeusers_cli.py", line 6, in <module>                                                                                                                                                                                                                             │
│     runpy.run_module("maintain_kubeusers", run_name="__main__")                                                                                                                                                                                                                          │
│   File "<frozen runpy>", line 229, in run_module                                                                                                                                                                                                                                         │
│   File "<frozen runpy>", line 88, in _run_code                                                                                                                                                                                                                                           │
│   File "/app/maintain_kubeusers/__main__.py", line 6, in <module>                                                                                                                                                                                                                        │
│     main()                                                                                                                                                                                                                                                                               │
│   File "/app/maintain_kubeusers/cli.py", line 166, in main                                                                                                                                                                                                                               │
│     do_run(                                                                                                                                                                                                                                                                              │
│   File "<decorator-gen-1>", line 2, in do_run                                                                                                                                                                                                                                            │
│   File "/opt/lib/python/site-packages/prometheus_client/context_managers.py", line 80, in wrapped                                                                                                                                                                                        │
│     return func(*args, **kwargs)                                                                                                                                                                                                                                                         │
│            ^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                         │
│   File "/app/maintain_kubeusers/cli.py", line 91, in do_run                                                                                                                                                                                                                              │
│     admin.reconcile()                                                                                                                                                                                                                                                                    │
│   File "/app/maintain_kubeusers/user.py", line 53, in reconcile                                                                                                                                                                                                                          │
│     self.resources.reconcile()                                                                                                                                                                                                                                                           │
│   File "/app/maintain_kubeusers/resources/resources.py", line 217, in reconcile                                                                                                                                                                                                          │
│     self._reconcile_create(                                                                                                                                                                                                                                                              │
│   File "/app/maintain_kubeusers/resources/resources.py", line 157, in _reconcile_create                                                                                                                                                                                                  │
│     resource_data = resource.do_create(resource_data)                                                                                                                                                                                                                                    │
│                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                    │
│   File "/app/maintain_kubeusers/resources/homedir.py", line 71, in do_create                                                                                                                                                                                                             │
│     self._copy_skel(mode=mode)                                                                                                                                                                                                                                                           │
│   File "/app/maintain_kubeusers/resources/homedir.py", line 50, in _copy_skel                                                                                                                                                                                                            │
│     shutil.copy(orig_path, dest_path)                                                                                                                                                                                                                                                    │
│   File "/usr/lib/python3.11/shutil.py", line 419, in copy                                                                                                                                                                                                                                │
│     copyfile(src, dst, follow_symlinks=follow_symlinks)                                                                                                                                                                                                                                  │
│   File "/usr/lib/python3.11/shutil.py", line 258, in copyfile                                                                                                                                                                                                                            │
│     with open(dst, 'wb') as fdst:                                                                                                                                                                                                                                                        │
│          ^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                 │
│ FileNotFoundError: [Errno 2] No such file or directory: '/home/bd808/.bashrc'                                                                                                                                                                                                            │
│ Stream closed EOF for maintain-kubeusers/maintain-kubeusers-54c487bd6c-4d4xc (maintain-kubeusers)

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/313

maintain-kubeusers: bump to 0.0.144-20240606095929-cf148997

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/38

homedir: remove needs_create() filesystem check after all accounts have state

aborrero merged https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/38

homedir: remove needs_create() filesystem check after all accounts have state

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/314

maintain-kubeusers: bump to 0.0.145-20240606123146-6710dc2f

last noop loop took 1.51 mins (about 90 seconds).

If we have 3200 accounts, 3200/90 = 35 accounts/s

While we can definitely improve performance, I think the software verifying that all accounts are reconciled in a 35 accounts/second is good enough for now.

NOTE: the loop for actually modifying resources takes longer, but we (I) don't see any problem with that ATM.