Page MenuHomePhabricator
Paste P17115

kubestage1001 helm, tiller and IO debugging
ActivePublic

Authored by Jelto on Aug 31 2021, 12:44 PM.
Referenced Files
F34626700: kubestage1001 helm, tiller and IO debugging
Aug 31 2021, 12:48 PM
F34626699: kubestage1001 helm, tiller and IO debugging
Aug 31 2021, 12:47 PM
F34626698: kubestage1001 helm, tiller and IO debugging
Aug 31 2021, 12:44 PM
Subscribers
kubestage1001 has lot of IO (~95% utilization) and network traffic spices since 2021-08-27 15:00:
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=kubestage1001&var-datasource=thanos&var-cluster=kubernetes&from=1630022400000&to=1630454399000
Two pods in ci namespace are ImagePullBackOffing. So I assume they pull the image quite often
jelto@kubestagemaster1001:~$ kubectl get pod -n ci -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mediawiki-bruce-565648dddd-bfdhk 0/2 ImagePullBackOff 1 3d21h 10.64.75.203 kubestage1001.eqiad.wmnet <none> <none>
mediawiki-bruce-f78c5cd48-cnwlk 0/2 ImagePullBackOff 1 3d21h 10.64.75.199 kubestage1001.eqiad.wmnet <none> <none>
tiller-6dcbb48666-v2vbk 1/1 Running 0 3d21h 10.64.75.210 kubestage1001.eqiad.wmnet <none> <none>
Pods of mediawiki-bruce deployment have been created the same time the IO and network traffic increased.
pods also have failed readiness probes:
Readiness probe failed: HTTP probe failed with statuscode: 503
I assume the deployment is trying to do a rolling update of the pods, as one pod has a newer mediawiki image specified:
jelto@kubestagemaster1001:~$ kubectl get pods -n ci -o=jsonpath='{range .items[*]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[*]}{.image}{" "}{end}{end}{"\n"}'
mediawiki-bruce-565648dddd-bfdhk: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2021-08-04-134912-webserver docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-23-184619-publish
mediawiki-bruce-f78c5cd48-cnwlk: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2021-08-04-134912-webserver docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-25-145508-publish
tiller-6dcbb48666-v2vbk: docker-registry.discovery.wmnet/tiller:2.16.7-3
Suggestion: temporary scale down mediawiki-bruce deployment to 0 replicas and see if load on kubestage1001 reduces and other services are reachable again (like tiller).