mwscript-k8s creates too many resources
Open, HighPublic
Actions

Assigned To

None

Authored By

	JMeybohm
	Oct 9 2024, 11:14 AM

Description

We had an outage that is most likely related to mwscript creating quite a few k8s resources, from the last two days there are:

root@deploy1003:~# helm -n mw-script list |wc -l
257
root@deploy1003:~# kubectl -n mw-script get secret |wc -l
1883
root@deploy1003:~# kubectl -n mw-script get networkpolicies. |wc -l
1882
root@deploy1003:~# kubectl -n mw-script get jobs |wc -l
1877
root@deploy1003:~# kubectl -n mw-script get pods |wc -l
1875
root@deploy1003:~# kubectl -n mw-script get configmaps |wc -l
13170

We suspect that:

The high number of networkpolicies led to a calico outage
The high number of secret objects leads to helm-state-metrics being OOM killed (LIST /secrets calls piling up on the api servers)
Multiple cert-manager components being OOM killed (not sure why yet)

Most of the objects are potentially redundant and the same for each job, we should try to consolidate those in a way.

It was also noted that foreachwiki currently launches mwscript once for each wiki, which further increases the amount of jobs in the mw-script namespace.

Details

Subject	Repo	Branch	Lines +/-
deployment_server: Add release to mwscript-k8s -ojson output	operations/puppet	production	+1 -0
deployment_server: Read Helm secrets in `mwscript-cleanup`	operations/puppet	production	+44 -38
[stub] mwscript-k8s: Add concurrency limiting via poolcounter	operations/puppet	production	+62 -1
mw-script: deduplicate resources	operations/deployment-charts	master	+5 -0
mediawiki: deduplicate network policies and configmaps	operations/deployment-charts	master	+15 -9
deployment_server: Add `helm list` pagination to mwscript-cleanup	operations/puppet	production	+26 -15
Bump cert-manager, cfssl-issuer and helm-state metrics resources	operations/deployment-charts	master	+23 -13
Align calico resource settings for codfw and eqiad	operations/deployment-charts	master	+6 -6
Bump cert-manager, cfssl-issuer and helm-state metrics resources	operations/deployment-charts	master	+45 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T376795 mwscript-k8s creates too many resources
		Resolved		JMeybohm	T376976 Remove memory limits from critical cluster components (calico)

Event Timeline

JMeybohm created this task.Oct 9 2024, 11:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2024, 11:14 AM

JMeybohm triaged this task as High priority.Oct 9 2024, 11:14 AM

JMeybohm updated the task description. (Show Details)Oct 9 2024, 11:18 AM

Change #1078796 had a related patch set uploaded (by JMeybohm; author: Giuseppe Lavagetto):

[operations/puppet@production] [stub] mwscript-k8s: Add concurrency limiting via poolcounter

https://gerrit.wikimedia.org/r/1078796

gerritbot added a project: Patch-For-Review.Oct 9 2024, 12:02 PM

Change #1078927 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Bump cert-manager, cfssl-issuer and helm-state metrics resources

https://gerrit.wikimedia.org/r/1078927

Change #1078928 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Align calico resource settings for codfw and eqiad

https://gerrit.wikimedia.org/r/1078928

Change #1078927 merged by jenkins-bot:

[operations/deployment-charts@master] Bump cert-manager, cfssl-issuer and helm-state metrics resources

https://gerrit.wikimedia.org/r/1078927

Change #1078928 merged by jenkins-bot:

[operations/deployment-charts@master] Align calico resource settings for codfw and eqiad

https://gerrit.wikimedia.org/r/1078928

akosiaris added a subscriber: RLazarus.Oct 9 2024, 2:06 PM

Change #1078957 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Bump cert-manager, cfssl-issuer and helm-state metrics resources

https://gerrit.wikimedia.org/r/1078957

Change #1078957 merged by JMeybohm:

[operations/deployment-charts@master] Bump cert-manager, cfssl-issuer and helm-state metrics resources

https://gerrit.wikimedia.org/r/1078957

mwscript-cleanup already ran, but did not clean up anything probably because the releases are too fresh. Not sure if it's okay to lower EXPIRY_TIME in the script and re-run it. But we should probably clean up most of the releases before the next foreachwiki call

Change #1078988 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add `helm list` pagination to mwscript-cleanup

https://gerrit.wikimedia.org/r/1078988

Change #1078989 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] do not submit: T376795 cleanups

https://gerrit.wikimedia.org/r/1078989

Change #1078988 merged by RLazarus:

[operations/puppet@production] deployment_server: Add `helm list` pagination to mwscript-cleanup

https://gerrit.wikimedia.org/r/1078988

Change #1079314 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

https://gerrit.wikimedia.org/r/1079314

CDanis subscribed.Oct 10 2024, 4:23 PM

Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.org/goto/u3dyc3kHg?orgId=1

In T376795#10218645, @CDanis wrote:

Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.org/goto/u3dyc3kHg?orgId=1

Yeah. We have them on the Kubernetes API dashboard as well. I was thinking about adding an alert to fire when there is a burst of new objects being added. Not sure yet if that's a good idea

Change #1079314 abandoned by RLazarus:

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

Reason:

Scott correctly points out I've got this the wrong way around -- the more likely issue is I'm iterating over the list while destroying releases, so the offset is invalid. 🤦 Another, better fix to follow.

https://gerrit.wikimedia.org/r/1079314

Change #1079314 restored by RLazarus:

[operations/puppet@production] deployment_server: Tweak mwscript-cleanup `helm list` pagination

https://gerrit.wikimedia.org/r/1079314

akosiaris mentioned this in T376976: Remove memory limits from critical cluster components (calico).Oct 11 2024, 7:14 AM

Change #1079444 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mediawiki: deduplicate network policies and configmaps

https://gerrit.wikimedia.org/r/1079444

Change #1079445 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mw-script: deduplicate resources

https://gerrit.wikimedia.org/r/1079445

I've confirmed in the staging cluster that creating 1k network-policies (even if they don't apply to anything) bumps the avg calico-node memory usage from ~130 to ~390MB. CPU usage also increases quite significantly.

Adding 6k configmaps does not really do anything to calico, cert-manager, helm-state metrics. It might have an impact on the kubelet when the configmaps are actually used, though. I've not tested that but given the fact that this would spread across all kubelets I don't think it's an issue.

Adding 1k secrets (e.g. helm releases) though does bring down cert-manager and helm-state metrics as expected.

JMeybohm closed subtask T376976: Remove memory limits from critical cluster components (calico) as Resolved.Oct 15 2024, 2:43 PM

Change #1079314 merged by RLazarus:

[operations/puppet@production] deployment_server: Read Helm secrets in `mwscript-cleanup`

https://gerrit.wikimedia.org/r/1079314

akosiaris added a project: SRE-OnFire.Oct 18 2024, 2:39 PM

Change #1078989 abandoned by RLazarus:

[operations/puppet@production] do not submit: T376795 cleanups

Reason:

Incident is over, no longer needed

https://gerrit.wikimedia.org/r/1078989

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Oct 23 2024, 10:56 AM

• dcausse subscribed.Wed, Dec 4, 3:08 PM

The search platform team is working on migrating a set of tools from mwscript to mwscript-k8s (T378382). One of them (the one used to orchestrate a full reindex of the search indices) need to invoke mwscript multiple times (with 8 possibly ran in parallel) and can trigger up to 2100 runs (nb wikis*2) in about 3.5 days (best case so far). We might also run this tool for all the 3 search clusters we operate increasing the total number of invocations to 6300 and a max of 24 runs in parallel (within 1 to 2 weeks). I believe that given the past incident this might possibly lead to similar issues if we naively switch to mwscript-k8s? If yes what are our options?
Would this be an option if our tool took care of cleaning the k8s resources after each run? mwscript-k8s could possibly output the release it constructs in its json output so that the parent script invocating it could run helmfile destroy like what mwscript-cleanup is doing (or even adapt mwscript-cleanup to allow cleaning a single release).

Yes, naively this would be too many invocations at present.

We could easily add the release name to the mwscript-k8s json output -- in the meantime, you can also get it from the job's annotations via the kube api:

$ kubectl get job ${JOB_NAME?} -o jsonpath={metadata.labels.release}

My suggestion would be to use helmfile destroy rather than adapt mwscript-cleanup, because your criteria for "do we still need this?" will be different. Just make sure to think carefully about those criteria -- destroying the release means you won't have any debugging data left on Kubernetes: logs, timestamps, exit status, etc. Cleaning up early is a fine way to work around the resource-overload problem here, as long as you don't clean up anything you still wanted to look at.

Change #1101607 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add release to mwscript-k8s -ojson output

https://gerrit.wikimedia.org/r/1101607

Change #1101607 merged by RLazarus:

[operations/puppet@production] deployment_server: Add release to mwscript-k8s -ojson output

https://gerrit.wikimedia.org/r/1101607

mwscript-k8s creates too many resourcesOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

mwscript-k8s creates too many resources
Open, HighPublic
Actions

Related Objects
Search...