Maniphest T196589

Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Jun 6 2018, 9:36 PM

Description

Following the rolling restart of the Cloud-VPS infrastructure on 2018-06-06, a large number of Kubernetes powered webservices in Toolforge remained in an unavailable state. Some spot checking revealed that a large number of pods (the unit of work for running a webservice Docker container on Kubernetes) were in the CrashLoopBackOff. This means that the pod had started, died, and been restarted several times. See initial list at:

P7220 Tools stuck in CrashLoopBackOff after cluster reboots

1	$ sudo kubectl get --all-namespaces pods --sort-by='.status.containerStatuses[0].restartCount' -o wide \| grep CrashLoopBackOff \| awk '{print $1}' \| sort \| uniq
2
3	commonsinterwiki
4	commons-mass-description-test
5	comprende
6	contribgraph
7	copypatrol
8	corenlp
9	costar
10	coverage
11	csfd
12	csp-report
13	data-design-demo
14	deep-learning-services
15	delinker
16	deskana
17	detox
18	devys
19	dewkin
20	dnbtools
21	durl-shortener
22	earwigbot
23	earwig-dev
24	embeddeddata
25	enwnbot
26	everythingisconnected
27	fab-proxy
28	faces
29	farhangestan
30	featured-article
31	filedupes
32	file-reuse
33	file-reuse-piwik
34	file-reuse-test
35	file-siblings
36	fist
37	five-million
38	flickr2commons
39	ft
40	geograph2commons
41	gerrit-reviewer-bot
42	giraffe
43	glam2commons
44	glamtools
45	globalprefs
46	gmt
47	grantmetrics
48	grantmetrics-test
49	grid-jobs
50	gsoc
51	gs
52	lziad
53	magnus-toolserver
54	magog
55	makeref
56	maplayers-demo
57	massmailer
58	massviews-test
59	mathqa
60	matthewrbowker
61	matthewrbowker-dev
62	media-reports
63	mediawiki-mirror
64	meetbot
65	merge2pdf
66	metamine
67	metricslibrary
68	missingtopics
69	monumental
70	most-readable-pages
71	most-wanted
72	multidesc
73	mwpackages
74	my-first-django-oauth-app
75	my-first-flask-oauth-tool
76	mzmcbride
77	nagf
78	newusers
79	ninthcircuit
80	niosh
81	nli-wiki
82	noclaims
83	not-in-the-other-language
84	nppdash
85	oabot-wd-game
86	oauth-hello-world
87	oauthtest
88	olympics
89	oojs-ui
90	opendatasets
91	openstack-browser-dev
92	order-user-by-reg
93	ores
94	orphantalk
95	outreachy-user-contribution-tool
96	outreachy-user-ranking-tool
97	pagecounts
98	pagepile
99	paste
100	paws-support
101	peachy-docs
102	phabricator-bug-status
103	phabulous
104	piagetbot
105	plagiabot
106	platypus-qa
107	position-holder-history
108	prism
109	proneval-gsoc17
110	ptable
111	pub
112	pywikibot
113	pywikibot-testwiki
114	pywikipedia
115	quarrybot-enwiki
116	query
117	quick-intersection
118	quickstatements
119	r96340-bot
120	rank
121	readmore
122	reasonator
123	recitation-bot
124	toolserver
125	toolserver-home-archive
126	tooltranslate
127	toolviews
128	tour
129	translatemplate
130	twltools
131	typoscan
132	uploadhelper-ir
133	url2commons
134	usage
135	userrank
136	usualsuspects
137	vendor
138	verification-pages
139	versions
140	video2commons
141	video2commons-test
142	watroles
143	wdmm
144	wdq2sparql
145	wdq-checker
146	wd-rank
147	widar
148	wikidata-exports
149	wikidata-game
150	wikidata-reconcile
151	wikidipendenza
152	wikifactmine-api
153	wikiinfo
154	wikilovesdownloads
155	wikipedia-readability
156	wikiradio
157	wikishootme
158	wikisoba
159	wikisource-penguin-classics
160	wikitext-deprecation
161	wiki-todo
162	wits
163	wlmuk
164	wm-commons-emoji-bot
165	ws2wd
166	wscontest
167	ws-google-ocr
168	www-portal-builder
169	xtools-autoedits
170	xtools-dev
171	xtools-ec
172	xtools-pages
173	yabbr
174	yellowbot
175	zhaofeng-test
176	zoomproof
177	zppixbot

Spot checking found that a large number of these looping pods were failing due to a missing mount of the /etc/wmcs-project into the Docker container. The webservice-runner command checks this file to determine which project it is running in (tools vs tools-beta). When not found the webservice-runner script dies which in turn kills the Docker container.

Related Objects

Mentioned In: T196568: 502 Bad Gateway using multiple WMF Labs Tools
Mentioned Here: P7220 Tools stuck in CrashLoopBackOff after cluster reboots

Event Timeline

bd808 created this task.Jun 6 2018, 9:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2018, 9:36 PM

Triaging as high initially. I have been trying to automate restarts for pods in the CrashLoopBackOff state with some success. The initial 175 went down to 59 after the first pass. A second pass of restarts is happening now.

After the second round of mass restarts, 48 tools are still in CrashLoopBackOff state.

Here is an example of one that seems to be due to the missing mount:

$ sudo become superzerocool
$ kubectl get po
NAME                            READY     STATUS             RESTARTS   AGE
superzerocool-109243355-jz0iu   0/1       CrashLoopBackOff   45         92d
$ kubectl logs po/superzerocool-109243355-jz0iu
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 4, in <module>
    from toollabs.common import Tool
  File "/usr/lib/python2.7/dist-packages/toollabs/common/__init__.py", line 1, in <module>
    from toollabs.common.tool import Tool
  File "/usr/lib/python2.7/dist-packages/toollabs/common/tool.py", line 6, in <module>
    with open('/etc/wmcs-project', 'r') as _projectfile:
IOError: [Errno 2] No such file or directory: '/etc/wmcs-project'

Mentioned in SAL (#wikimedia-cloud) [2018-06-06T22:00:41Z] <bd808> Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)

Stryn subscribed.Jun 6 2018, 10:04 PM

On the third pass I used this helper script to only restart webservices that are logging the missing mounted file:

/tmp/klogs

#!/usr/bin/env bash
kubectl logs po/$(kubectl get po|grep CrashLoopBackOff|awk '{print $1}') |
grep wmcs-project &&
webservice restart

After this ran there are 19 pods left in CraashLoopBackOff which seems like a much more reasonable number of broken webservices. The still broken webservices are:

anagrimes
android-maven-repo
articlerequest-dev
best-image
bldrwnsch
commonshelper
enwnbot
ft
himo
loltools
lyan
not-in-the-other-language
r96340-bot
readmore
sparqlblocks
toolschecker
verification-pages
wikidata-exports
wikisource-penguin-classics
wm-commons-emoji-bot

bd808 mentioned this in T196568: 502 Bad Gateway using multiple WMF Labs Tools.Jun 6 2018, 10:39 PM

Mentioned in SAL (#wikimedia-cloud) [2018-06-07T00:06:16Z] <bd808> Restarted webservice. Stuck in CrashLoopBackOff due to T196589

After more automated and manual cleanup, there are currently 0 pods in CrashLoopBackOff state. Lowering priority of task now. I'll keep it open to see if we have more occurrences of the /etc/wmcs-project mount failure in the near future.

It should not. Prior to the reboots there was one case when I deleted the pod but not deployment, and the deployment rebooted the pod using old deployment configuration, but with new docker image, causing the failure. /etc/wmcs-project is specified as a mount option in the webservice command code to configure the deployment afaict, so webservice restart should update the deployment configuration and fix them.

In T196589#4263382, @zhuyifei1999 wrote:

the deployment rebooted the pod using old deployment configuration, but with new docker image, causing the failure.

This actually makes a lot of sense about how we ended up with so many tools in this state. As the reboots rolled across the cluster the Kubernetes reconciliation loop would have replaced pods that disappeared with new ones based on the existing deployment. The new pods would use the latest Docker image, which in turn would contain the version of webservice-runner that required /etc/wmcs-project. Any deployment that predated the addition of that file mount would then cause the crashing problem.

What this does not clearly explain is why some tools required multiple webservice restart cycles in order to end up with a deployment that provided the needed mount. I guess I should look into the restart logic we use for the Kubernetes backend and see if it always or only conditionally tears down and recreates the entire deployment.

bd808 claimed this task.Jun 7 2018, 4:54 PM

In T196589#4264779, @bd808 wrote:

I guess I should look into the restart logic we use for the Kubernetes backend and see if it always or only conditionally tears down and recreates the entire deployment.

webservice restart calls the request_stop() and request_start() methods on the backend (kubernetes in the case we are interested in here). The KubernetesBackend's implementation of request_stop() tries to delete the Service, Deployment, ReplicaSet, and Pod in that order which would clean up everything in the namespace that webservice start initially created. To me this says that only a single webservice restart should have been necessary for each tool.

There have bee a few more CrashLoopBackOff pods, but none of them were caused by the /etc/wmcs-project mount.

• Vvjjkkii renamed this task from Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart to iibaaaaaaa.Jul 1 2018, 1:05 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed bd808 as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

Jeff_G renamed this task from iibaaaaaaa to Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart.Jul 2 2018, 1:16 AM

Jeff_G closed this task as Resolved.

Jeff_G assigned this task to bd808.

Jeff_G lowered the priority of this task from High to Medium.

Jeff_G removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

Jeff_G updated the task description. (Show Details)

Jeff_G added a subscriber: Aklapper.

Jeff_G subscribed.

Some kubernetes webservices stuck in CrashLoopBackOff after cluster restartClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart
Closed, ResolvedPublic
Actions