Page MenuHomePhabricator

Termbox in labs should be able to recover crashes
Closed, ResolvedPublic3 Estimated Story Points

Description

Termbox in labs was down for 218 hours, 53 minutes and 43 seconds. As Wikidata's lovely incident manager this week, I brought it back by googling output of systemd service error which said you need to restart the docker daemon before restarting the termbox daemon otherwise the internal networking of docker server gets confused and errors things like:

ladsgroup@wikidata-misc:~$ sudo /usr/bin/docker run --restart=always --name=systemd_termbox_test -e STATSD_HOST=labmon1001.eqiad.wmnet -e LOGSTASH_HOST=deployment-logstash2.eqiad.wmflabs -e WIKIBASE_REPO=https://wikidata.beta.wmflabs.org/w -e WIKIBASE_REPO_HOSTNAME_ALIAS=wikidata.beta.wmflabs.org -e SSR_PORT=3030 -p=3030:3031 wmde/wikibase-termbox-production:latest
/usr/bin/docker: Error response from daemon: driver failed programming external connectivity on endpoint systemd_termbox_test (13ec853e0d2772b8c7ca414ff56b39a2ca698bc3b6c9688134c31a3550db5a40):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 3030 -j DNAT --to-destination 172.17.0.2:3031 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)).

You need to add something like sudo service docker restart before the updater trying to restart termbox daemon in updater or ansible settings.

Event Timeline

Surprised (after a full 3 seconds of thinking about it) this is not covered by this.

Surprised (after a full 3 seconds of thinking about it) this is not covered by this.

My systemd knowledge is very basic but IIRC the "requires" is basically to tell init on which order it should load services on system restarts and it's not about the service restarts (it makes sense too, if a child process crashes that depends on the parent process, parent process should not crash or restart or whatever, it should be reliable to its childern)

I *think* that the problem was that the docker demon had crashed not that the termbox container had crashed. I think this was probably because the machine was rebooted at some point. I'm not sure we want the updater to constantly restart the docker service on each time.

The only suggestion I have is adding restarting docker to the ansible playbook to make fixing easier if this happened again. I suspect it happened during some maintenance of the VM or host machine.

Otherwise we could even just leave this and not attempt to fix on the basis that the machine should very rarely be hard reset.

Addshore moved this task from Incoming to Ready to estimate on the Wikidata-Campsite board.

We should pick this up to avoid future surprises..
Although marking as Low as we now have monitoring for this in our mattermost.

Did a quick bit of googling in the campsite story time, perhaps this is what is needed?

PropagatesReloadTo=, ReloadPropagatedFrom=¶
A space-separated list of one or more units where reload requests on this unit will be propagated to, or reload requests on the other unit will be propagated to this unit, respectively. Issuing a reload request on a unit will automatically also enqueue a reload request on all units that the reload request shall be propagated to via these two settings.

From https://www.freedesktop.org/software/systemd/man/systemd.unit.html

Addshore set the point value for this task to 3.Dec 10 2019, 1:31 PM
Addshore moved this task from Ready to estimate to Ready to pick up on the Wikidata-Campsite board.

Mentioned in SAL (#wikimedia-cloud) [2019-12-18T19:47:16Z] <Amir1> hard reboot of wikidata-misc to test recovering from crashes T235069

Did a quick bit of googling in the campsite story time, perhaps this is what is needed?

PropagatesReloadTo=, ReloadPropagatedFrom=¶
A space-separated list of one or more units where reload requests on this unit will be propagated to, or reload requests on the other unit will be propagated to this unit, respectively. Issuing a reload request on a unit will automatically also enqueue a reload request on all units that the reload request shall be propagated to via these two settings.

From https://www.freedesktop.org/software/systemd/man/systemd.unit.html

This doesn't work, I tried it. Maybe something else would work. I will try later.

because reload and restart are different things in systemd: https://askubuntu.com/a/479374

Why nothing is as simple as it should be :/

In order to reproduce the issue, I hard restarted wikidata-misc but "sudo service docker restart" doesn't fix the problem while sudo docker stop 157bb9c9303f does. That's interesting.

Change 559967 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[wikibase/termbox@master] Make restarter.sh more robust to docker crashes

https://gerrit.wikimedia.org/r/559967

Change 559967 merged by jenkins-bot:
[wikibase/termbox@master] Make restarter.sh more robust to docker crashes

https://gerrit.wikimedia.org/r/559967