kubestage1001 helm, tiller and IO debugging
ActivePublic
Actions

Authored by Jelto on Aug 31 2021, 12:44 PM.

Tags

Referenced Files

	F34626700: kubestage1001 helm, tiller and IO debugging
	Aug 31 2021, 12:48 PM

	F34626699: kubestage1001 helm, tiller and IO debugging
	Aug 31 2021, 12:47 PM

	F34626698: kubestage1001 helm, tiller and IO debugging
	Aug 31 2021, 12:44 PM

Subscribers

JMeybohm

	kubestage1001 has lot of IO (~95% utilization) and network traffic spices since 2021-08-27 15:00:
	https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=kubestage1001&var-datasource=thanos&var-cluster=kubernetes&from=1630022400000&to=1630454399000

	Two pods in ci namespace are ImagePullBackOffing. So I assume they pull the image quite often

	jelto@kubestagemaster1001:~$ kubectl get pod -n ci -o wide
	NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
	mediawiki-bruce-565648dddd-bfdhk 0/2 ImagePullBackOff 1 3d21h 10.64.75.203 kubestage1001.eqiad.wmnet <none> <none>
	mediawiki-bruce-f78c5cd48-cnwlk 0/2 ImagePullBackOff 1 3d21h 10.64.75.199 kubestage1001.eqiad.wmnet <none> <none>
	tiller-6dcbb48666-v2vbk 1/1 Running 0 3d21h 10.64.75.210 kubestage1001.eqiad.wmnet <none> <none>

	Pods of mediawiki-bruce deployment have been created the same time the IO and network traffic increased.

	pods also have failed readiness probes:
	Readiness probe failed: HTTP probe failed with statuscode: 503

	I assume the deployment is trying to do a rolling update of the pods, as one pod has a newer mediawiki image specified:

	jelto@kubestagemaster1001:~$ kubectl get pods -n ci -o=jsonpath='{range .items[]}{"\n"}{.metadata.name}{":\t"}{range .spec.containers[]}{.image}{" "}{end}{end}{"\n"}'

	mediawiki-bruce-565648dddd-bfdhk: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2021-08-04-134912-webserver docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-23-184619-publish
	mediawiki-bruce-f78c5cd48-cnwlk: docker-registry.discovery.wmnet/restricted/mediawiki-webserver:2021-08-04-134912-webserver docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2021-08-25-145508-publish
	tiller-6dcbb48666-v2vbk: docker-registry.discovery.wmnet/tiller:2.16.7-3


	Suggestion: temporary scale down mediawiki-bruce deployment to 0 replicas and see if load on kubestage1001 reduces and other services are reachable again (like tiller).

Event Timeline

Jelto created this paste.Aug 31 2021, 12:44 PM

Jelto edited the content of this paste. (Show Details)Aug 31 2021, 12:47 PM

Jelto edited the content of this paste. (Show Details)

https://logstash.wikimedia.org/goto/4f7e1531e4a4e93ce3b17b39350092a9

kubestage1001 helm, tiller and IO debuggingActivePublicActions

Event Timeline

kubestage1001 helm, tiller and IO debugging
ActivePublic
Actions