Page MenuHomePhabricator

Improve cirrus reindex orchestrator to limit its impact on k8s API response times
Open, Needs TriagePublic

Description

The cirrus-reindex-orchestrator is a tool that is able to run multiple reindex of wiki indices in parallel.
It is limited to 8 shards/cluster in parallel which means that a single reindex is happening on large wikis (commons) but could run up to 8 mwscript in parallel for small wikis.
Unfortunately the deployment of multiple mwscript-k8s is causing some impact on the k8s api response times:

Capture d’écran du 2025-12-05 15-59-40.png (1×3 px, 670 KB)

We can see the timing degrading as big wikis get reindexed first and while more smaller wikis are getting processed concurrently the pressure on the k8s resources increases.

We could investigate ways to make this process less impactful on the k8s APIs:

  • investigate using --local_dblist, it's possibly acceptable for small wikis?
  • complete refactor and prefer using the mediawiki API to return the mapping/index config and schedule the reindex from pythons instead of the maint script
  • workaround: review the concurrency limits and make the process slower overall
  • possible small optimizations: the cleanup of helm deployments is not batched, perhaps it could help a bit to batch the cleanups (if helmfile destroy on muliple releases at once can help)
  • other ideas?

AC:

  • running a full reindex does not cause the k8s API response times to increase

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I was trying to get to the root of the High Kubernetes API latency (LIST secrets) on k8s@codfw alert we've been seeing since Dec 1st which ultimately lead me to the cirrus_reindexer.reindex_all. The problem here is that mwscript-k8s deployments (and cleanups) issue helm commands that call LIST on /secrets for the mw-script namespace (because helm stores the state of releases in secret objects). That namespace does have a couple of hundred secrets and since LIST calls can't be properly filtered server side the apiserver has to return all objects for every call.

I don't think it's an immediate problem to other workloads since LIST /secret calls for other namespaces should still be fast but it ofc. slows down the reindexing and it fires an alert on SRE end.