Page MenuHomePhabricator

Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart
Closed, ResolvedPublic

Description

Following the rolling restart of the Cloud-VPS infrastructure on 2018-06-06, a large number of Kubernetes powered webservices in Toolforge remained in an unavailable state. Some spot checking revealed that a large number of pods (the unit of work for running a webservice Docker container on Kubernetes) were in the CrashLoopBackOff. This means that the pod had started, died, and been restarted several times. See initial list at:

1$ sudo kubectl get --all-namespaces pods --sort-by='.status.containerStatuses[0].restartCount' -o wide | grep CrashLoopBackOff | awk '{print $1}' | sort | uniq
2
3commonsinterwiki
4commons-mass-description-test
5comprende
6contribgraph
7copypatrol
8corenlp
9costar
10coverage
11csfd
12csp-report
13data-design-demo
14deep-learning-services
15delinker
16deskana
17detox
18devys
19dewkin
20dnbtools
21durl-shortener
22earwigbot
23earwig-dev
24embeddeddata
25enwnbot
26everythingisconnected
27fab-proxy
28faces
29farhangestan
30featured-article
31filedupes
32file-reuse
33file-reuse-piwik
34file-reuse-test
35file-siblings
36fist
37five-million
38flickr2commons
39ft
40geograph2commons
41gerrit-reviewer-bot
42giraffe
43glam2commons
44glamtools
45globalprefs
46gmt
47grantmetrics
48grantmetrics-test
49grid-jobs
50gsoc
51gs
52lziad
53magnus-toolserver
54magog
55makeref
56maplayers-demo
57massmailer
58massviews-test
59mathqa
60matthewrbowker
61matthewrbowker-dev
62media-reports
63mediawiki-mirror
64meetbot
65merge2pdf
66metamine
67metricslibrary
68missingtopics
69monumental
70most-readable-pages
71most-wanted
72multidesc
73mwpackages
74my-first-django-oauth-app
75my-first-flask-oauth-tool
76mzmcbride
77nagf
78newusers
79ninthcircuit
80niosh
81nli-wiki
82noclaims
83not-in-the-other-language
84nppdash
85oabot-wd-game
86oauth-hello-world
87oauthtest
88olympics
89oojs-ui
90opendatasets
91openstack-browser-dev
92order-user-by-reg
93ores
94orphantalk
95outreachy-user-contribution-tool
96outreachy-user-ranking-tool
97pagecounts
98pagepile
99paste
100paws-support
101peachy-docs
102phabricator-bug-status
103phabulous
104piagetbot
105plagiabot
106platypus-qa
107position-holder-history
108prism
109proneval-gsoc17
110ptable
111pub
112pywikibot
113pywikibot-testwiki
114pywikipedia
115quarrybot-enwiki
116query
117quick-intersection
118quickstatements
119r96340-bot
120rank
121readmore
122reasonator
123recitation-bot
124toolserver
125toolserver-home-archive
126tooltranslate
127toolviews
128tour
129translatemplate
130twltools
131typoscan
132uploadhelper-ir
133url2commons
134usage
135userrank
136usualsuspects
137vendor
138verification-pages
139versions
140video2commons
141video2commons-test
142watroles
143wdmm
144wdq2sparql
145wdq-checker
146wd-rank
147widar
148wikidata-exports
149wikidata-game
150wikidata-reconcile
151wikidipendenza
152wikifactmine-api
153wikiinfo
154wikilovesdownloads
155wikipedia-readability
156wikiradio
157wikishootme
158wikisoba
159wikisource-penguin-classics
160wikitext-deprecation
161wiki-todo
162wits
163wlmuk
164wm-commons-emoji-bot
165ws2wd
166wscontest
167ws-google-ocr
168www-portal-builder
169xtools-autoedits
170xtools-dev
171xtools-ec
172xtools-pages
173yabbr
174yellowbot
175zhaofeng-test
176zoomproof
177zppixbot

Spot checking found that a large number of these looping pods were failing due to a missing mount of the /etc/wmcs-project into the Docker container. The webservice-runner command checks this file to determine which project it is running in (tools vs tools-beta). When not found the webservice-runner script dies which in turn kills the Docker container.

Event Timeline

bd808 triaged this task as High priority.Jun 6 2018, 9:39 PM

Triaging as high initially. I have been trying to automate restarts for pods in the CrashLoopBackOff state with some success. The initial 175 went down to 59 after the first pass. A second pass of restarts is happening now.

After the second round of mass restarts, 48 tools are still in CrashLoopBackOff state.

Here is an example of one that seems to be due to the missing mount:

$ sudo become superzerocool
$ kubectl get po
NAME                            READY     STATUS             RESTARTS   AGE
superzerocool-109243355-jz0iu   0/1       CrashLoopBackOff   45         92d
$ kubectl logs po/superzerocool-109243355-jz0iu
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 4, in <module>
    from toollabs.common import Tool
  File "/usr/lib/python2.7/dist-packages/toollabs/common/__init__.py", line 1, in <module>
    from toollabs.common.tool import Tool
  File "/usr/lib/python2.7/dist-packages/toollabs/common/tool.py", line 6, in <module>
    with open('/etc/wmcs-project', 'r') as _projectfile:
IOError: [Errno 2] No such file or directory: '/etc/wmcs-project'

Mentioned in SAL (#wikimedia-cloud) [2018-06-06T22:00:41Z] <bd808> Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)

On the third pass I used this helper script to only restart webservices that are logging the missing mounted file:

/tmp/klogs
#!/usr/bin/env bash
kubectl logs po/$(kubectl get po|grep CrashLoopBackOff|awk '{print $1}') |
grep wmcs-project &&
webservice restart

After this ran there are 19 pods left in CraashLoopBackOff which seems like a much more reasonable number of broken webservices. The still broken webservices are:

anagrimes
android-maven-repo
articlerequest-dev
best-image
bldrwnsch
commonshelper
enwnbot
ft
himo
loltools
lyan
not-in-the-other-language
r96340-bot
readmore
sparqlblocks
toolschecker
verification-pages
wikidata-exports
wikisource-penguin-classics
wm-commons-emoji-bot

Mentioned in SAL (#wikimedia-cloud) [2018-06-07T00:06:16Z] <bd808> Restarted webservice. Stuck in CrashLoopBackOff due to T196589

bd808 lowered the priority of this task from High to Medium.Jun 7 2018, 12:33 AM

After more automated and manual cleanup, there are currently 0 pods in CrashLoopBackOff state. Lowering priority of task now. I'll keep it open to see if we have more occurrences of the /etc/wmcs-project mount failure in the near future.

It should not. Prior to the reboots there was one case when I deleted the pod but not deployment, and the deployment rebooted the pod using old deployment configuration, but with new docker image, causing the failure. /etc/wmcs-project is specified as a mount option in the webservice command code to configure the deployment afaict, so webservice restart should update the deployment configuration and fix them.

the deployment rebooted the pod using old deployment configuration, but with new docker image, causing the failure.

This actually makes a lot of sense about how we ended up with so many tools in this state. As the reboots rolled across the cluster the Kubernetes reconciliation loop would have replaced pods that disappeared with new ones based on the existing deployment. The new pods would use the latest Docker image, which in turn would contain the version of webservice-runner that required /etc/wmcs-project. Any deployment that predated the addition of that file mount would then cause the crashing problem.

What this does not clearly explain is why some tools required multiple webservice restart cycles in order to end up with a deployment that provided the needed mount. I guess I should look into the restart logic we use for the Kubernetes backend and see if it always or only conditionally tears down and recreates the entire deployment.

I guess I should look into the restart logic we use for the Kubernetes backend and see if it always or only conditionally tears down and recreates the entire deployment.

webservice restart calls the request_stop() and request_start() methods on the backend (kubernetes in the case we are interested in here). The KubernetesBackend's implementation of request_stop() tries to delete the Service, Deployment, ReplicaSet, and Pod in that order which would clean up everything in the namespace that webservice start initially created. To me this says that only a single webservice restart should have been necessary for each tool.

There have bee a few more CrashLoopBackOff pods, but none of them were caused by the /etc/wmcs-project mount.

Vvjjkkii renamed this task from Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart to iibaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed bd808 as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Jeff_G renamed this task from iibaaaaaaa to Some kubernetes webservices stuck in CrashLoopBackOff after cluster restart.Jul 2 2018, 1:16 AM
Jeff_G closed this task as Resolved.
Jeff_G assigned this task to bd808.
Jeff_G lowered the priority of this task from High to Medium.
Jeff_G updated the task description. (Show Details)
Jeff_G added a subscriber: Aklapper.
Jeff_G subscribed.