Page MenuHomePhabricator

Codesearch is down
Closed, ResolvedPublic


_health says "down" for all entries. It seems like they're being periodically restarted: every few seconds, a couple of entries read "pre-start" and then go down again.

Event Timeline

Daimona triaged this task as Unbreak Now! priority.Feb 22 2020, 4:44 PM

UBN for visibility

I get this in journal:

Feb 22 16:41:32 codesearch6 docker[11732]: /usr/bin/docker: Error response from daemon: driver failed programming external connectivity on endpoint hound-search (138bd3676b62adf0adf815df25eff3439ef453a24adef983b
Feb 22 16:41:32 codesearch6 docker[11732]:  (exit status 1)).
Feb 22 16:41:32 codesearch6 docker[11732]: time="2020-02-22T16:41:32Z" level=error msg="error waiting for container: context canceled"

Debugging atm

And this in docker journal:

Feb 22 16:52:37 codesearch6 dockerd[624]: time="2020-02-22T16:52:37.347262611Z" level=warning msg="Failed to allocate and map port 6092-6092:  (iptables failed: iptables --wait -t nat -A DOCKER -p tcp -d 0/0 --dport 6092 -j DNAT --to-destination ! -i docker0: iptables: No chain/target/match by that name.\n (exit status 1))"

Mentioned in SAL (#wikimedia-cloud) [2020-02-22T16:54:57Z] <Amir1> hard reboot of codesearch6 (T245920)

A hard reboot fixed it. Things are coming back online slowly but skins is already up. If it's a recurring issue, it will bring it down again soon, then feel free to re-open it and we investigate it in depth.

Thanks @Ladsgroup.

Looking around, it seems like this might just be a general bug in docker that we somehow triggered: