Termbox in labs should be able to recover crashes
Closed, ResolvedPublic3 Estimated Story Points
Actions

Description

Termbox in labs was down for 218 hours, 53 minutes and 43 seconds. As Wikidata's lovely incident manager this week, I brought it back by googling output of systemd service error which said you need to restart the docker daemon before restarting the termbox daemon otherwise the internal networking of docker server gets confused and errors things like:

ladsgroup@wikidata-misc:~$ sudo /usr/bin/docker run --restart=always --name=systemd_termbox_test -e STATSD_HOST=labmon1001.eqiad.wmnet -e LOGSTASH_HOST=deployment-logstash2.eqiad.wmflabs -e WIKIBASE_REPO=https://wikidata.beta.wmflabs.org/w -e WIKIBASE_REPO_HOSTNAME_ALIAS=wikidata.beta.wmflabs.org -e SSR_PORT=3030 -p=3030:3031 wmde/wikibase-termbox-production:latest
/usr/bin/docker: Error response from daemon: driver failed programming external connectivity on endpoint systemd_termbox_test (13ec853e0d2772b8c7ca414ff56b39a2ca698bc3b6c9688134c31a3550db5a40):  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 3030 -j DNAT --to-destination 172.17.0.2:3031 ! -i docker0: iptables: No chain/target/match by that name.
 (exit status 1)).

You need to add something like sudo service docker restart before the updater trying to restart termbox daemon in updater or ansible settings.

Details

	Subject	Repo	Branch	Lines +/-
	Make restarter.sh more robust to docker crashes	wikibase/termbox	master	+5 -2

Customize query in gerrit

Related Objects

Mentioned In: rWBTB92570ff324d6: Make restarter.sh more robust to docker crashes
T235041: Fix termbox ssr on beta

Event Timeline

Ladsgroup created this task.Oct 9 2019, 1:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2019, 1:35 PM

Surprised (after a full 3 seconds of thinking about it) this is not covered by this.

In T235069#5559609, @Pablo-WMDE wrote:

Surprised (after a full 3 seconds of thinking about it) this is not covered by this.

My systemd knowledge is very basic but IIRC the "requires" is basically to tell init on which order it should load services on system restarts and it's not about the service restarts (it makes sense too, if a child process crashes that depends on the parent process, parent process should not crash or restart or whatever, it should be reliable to its childern)

Ladsgroup mentioned this in T235041: Fix termbox ssr on beta .Oct 9 2019, 5:37 PM

I *think* that the problem was that the docker demon had crashed not that the termbox container had crashed. I think this was probably because the machine was rebooted at some point. I'm not sure we want the updater to constantly restart the docker service on each time.

The only suggestion I have is adding restarting docker to the ansible playbook to make fixing easier if this happened again. I suspect it happened during some maintenance of the VM or host machine.

Otherwise we could even just leave this and not attempt to fix on the basis that the machine should very rarely be hard reset.

We should pick this up to avoid future surprises..
Although marking as Low as we now have monitoring for this in our mattermost.

Addshore moved this task from incoming to ready to go on the Wikidata board.Oct 30 2019, 1:53 PM

Did a quick bit of googling in the campsite story time, perhaps this is what is needed?

PropagatesReloadTo=, ReloadPropagatedFrom=¶
A space-separated list of one or more units where reload requests on this unit will be propagated to, or reload requests on the other unit will be propagated to this unit, respectively. Issuing a reload request on a unit will automatically also enqueue a reload request on all units that the reload request shall be propagated to via these two settings.

From https://www.freedesktop.org/software/systemd/man/systemd.unit.html

Addshore set the point value for this task to 3.Dec 10 2019, 1:31 PM

Addshore moved this task from Ready to estimate to Ready to pick up on the Wikidata-Campsite board.

Mentioned in SAL (#wikimedia-cloud) [2019-12-18T19:47:16Z] <Amir1> hard reboot of wikidata-misc to test recovering from crashes T235069

In T235069#5727949, @Addshore wrote:

Did a quick bit of googling in the campsite story time, perhaps this is what is needed?

PropagatesReloadTo=, ReloadPropagatedFrom=¶
A space-separated list of one or more units where reload requests on this unit will be propagated to, or reload requests on the other unit will be propagated to this unit, respectively. Issuing a reload request on a unit will automatically also enqueue a reload request on all units that the reload request shall be propagated to via these two settings.

From https://www.freedesktop.org/software/systemd/man/systemd.unit.html

This doesn't work, I tried it. Maybe something else would work. I will try later.

because reload and restart are different things in systemd: https://askubuntu.com/a/479374

Why nothing is as simple as it should be :/

In order to reproduce the issue, I hard restarted wikidata-misc but "sudo service docker restart" doesn't fix the problem while sudo docker stop 157bb9c9303f does. That's interesting.

Change 559967 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[wikibase/termbox@master] Make restarter.sh more robust to docker crashes

https://gerrit.wikimedia.org/r/559967

gerritbot added a project: Patch-For-Review.Dec 20 2019, 10:45 PM

Ladsgroup claimed this task.Dec 20 2019, 10:46 PM

Ladsgroup edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptDec 20 2019, 10:46 PM

Ladsgroup moved this task from To Do (prioritised from top to bottom) to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Dec 20 2019, 10:46 PM

Change 559967 merged by jenkins-bot:
[wikibase/termbox@master] Make restarter.sh more robust to docker crashes

https://gerrit.wikimedia.org/r/559967

Ladsgroup mentioned this in rWBTB92570ff324d6: Make restarter.sh more robust to docker crashes.Dec 23 2019, 10:08 AM

Maintenance_bot removed a project: Patch-For-Review.Dec 23 2019, 10:10 AM

Addshore closed this task as Resolved.Dec 23 2019, 1:06 PM

Addshore moved this task from Peer Review to Test (Verification) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Termbox in labs should be able to recover crashesClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Termbox in labs should be able to recover crashes
Closed, ResolvedPublic3 Estimated Story Points
Actions