Page MenuHomePhabricator

ircecho doesn't reconnect on failure
Closed, ResolvedPublic

Description

In the last 3 days, icinga-wm was not in the operation channel (and all the others in which it's usually connected to), hence all the IRC notification failed to be delivered.

The error on the host's log are just showing:

Jan  1 23:15:16 einsteinium ircecho[11960]: Error writing: Not connected.Dropping this message: "PROBLEM - Host logstash1008 is DOWN: PING CRITICAL - Pac
ket loss = 100%"
Jan  1 23:15:16 einsteinium ircecho[11960]: Error writing: Not connected.Dropping this message: "PROBLEM - Host chlorine is DOWN: PING CRITICAL - Packet loss = 100%"
Jan  1 23:15:16 einsteinium ircecho[11960]: Error writing: Not connected.Dropping this message: "PROBLEM - Host bohrium is DOWN: PING CRITICAL - Packet loss = 100%"
Jan  1 23:15:16 einsteinium ircecho[11960]: Error writing: Not connected.Dropping this message: "PROBLEM - Host planet1001 is DOWN: PING CRITICAL - Packet loss = 100%"
....

Given the kind of messages, my first guess is that the host had a brief connection issue and ircecho failed to reconnect after it. Also our puppettization was not able to detect this and restart ircecho in the next run.

Event Timeline

Volans triaged this task as High priority.Jan 3 2018, 7:06 PM
Volans added a subscriber: cwdent.

Thanks to @cwdent for notifying us.

After some minor changes here and there I did a gdb on the thing and after forcefully closing the TCP connectio to the IRC server we get

(gdb) bt
#0  0x00007fd89967b010 in sem_wait () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000000000057ba10 in PyThread_acquire_lock (waitflag=1, lock=0x16862b0) at ../Python/thread_pthread.h:324
#2  lock_PyThread_acquire_lock.lto_priv.2568 () at ../Modules/threadmodule.c:52
...

Looks like the main thread, that is the one that runs the IRC bot itself, is stuck waiting a semaphore from some other thread.

A cursory look on the code does seem to have mutexes [1] in order to make the code thread safe

https://github.com/jaraco/irc/blob/60f747a578157b03c952c050c765e2924d4fdc00/irc/client.py#L193

and it is used in various parts of the code so the possibility some deadlock is triggered does exist.

Change 402081 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ircecho: Remove redundant thread

https://gerrit.wikimedia.org/r/402081

Change 402101 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ircecho: Force unbuffered stdin/stdout/stderr

https://gerrit.wikimedia.org/r/402101

Change 402343 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ircecho: Normalize print statements

https://gerrit.wikimedia.org/r/402343

Change 402344 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] ircecho: set EchoNotifier threads as daemon

https://gerrit.wikimedia.org/r/402344

After some experimentation it looks like the main thread is just waiting for the other threads to terminate. This can never happen in normal conditions as they are effectively forever loops. Setting them as daemon threads [1] allows the mainthread to terminate successfully and the entire python application to terminate as well. Then systemd will takeover and restart it

[1] https://docs.python.org/2/library/threading.html#threading.Thread.daemon

Change 402081 merged by Alexandros Kosiaris:
[operations/puppet@production] ircecho: Remove redundant thread

https://gerrit.wikimedia.org/r/402081

Change 402101 merged by Alexandros Kosiaris:
[operations/puppet@production] ircecho: Force unbuffered stdin/stdout/stderr

https://gerrit.wikimedia.org/r/402101

Change 402343 merged by Alexandros Kosiaris:
[operations/puppet@production] ircecho: Normalize print statements

https://gerrit.wikimedia.org/r/402343

Change 402344 merged by Alexandros Kosiaris:
[operations/puppet@production] ircecho: set EchoNotifier threads as daemon

https://gerrit.wikimedia.org/r/402344

akosiaris claimed this task.

This should finally be resolved with the above changes

This comment was removed by Volans.
Volans added a subscriber: Dzahn.

This was a misunderstanding on my side, @Dzahn actually stopped it manually.