The ArcLamp service on webperf1002 was effectively down from 18 December 2018 until 6 February 2019.
I saw that the process was still running, and without errors. After some digging (thanks @aaron, @Volans), we learned the reason it stopped doing anything was due to the TCP connection to Redis having been dropped somehow.
I'll write more about this in a blog post, but the short version is: redis-py doesn't do regular pings as part of its pubsub connection. This means that the for loop we have in xenon-log will just wait forever for a message that will never come, because the connection's dead.
- To make debugger easier next time, make sure gdb-python is installed on all webperf servers. Perhaps on all WMF production servers that have roles involving Python programs?
- Figure out how we can fix the way we use redis-pubsub so that we will detect dropped connections in the future, and letting the program exit in that case. That way, systemd will automatically restart the service and create a new connection.
The code is in puppet (https://github.com/wikimedia/puppet/blob/15e1bd5797a72d041ddf8d2e981a8f0ff4b231b7/modules/arclamp/files/arclamp-log.py) and not the main github repo.