Page MenuHomePhabricator

Create Icinga check for ArcLamp (xenon-log) service health
Open, NormalPublic

Description

The ArcLamp service on webperf1002 was effectively down from 18 December 2018 until 6 February 2019.

I saw that the process was still running, and without errors. After some digging (thanks @aaron, @Volans), we learned the reason it stopped doing anything was due to the TCP connection to Redis having been dropped somehow.

I'll write more about this in a blog post, but the short version is: redis-py doesn't do regular pings as part of its pubsub connection. This means that the for loop we have in xenon-log will just wait forever for a message that will never come, because the connection's dead.

Action items

  • To make debugger easier next time, make sure gdb-python is installed on all webperf servers. Perhaps on all WMF production servers that have roles involving Python programs?
  • Figure out how we can fix the way we use redis-pubsub so that we will detect dropped connections in the future, and letting the program exit in that case. That way, systemd will automatically restart the service and create a new connection.

Note

The code is in puppet (https://github.com/wikimedia/puppet/blob/15e1bd5797a72d041ddf8d2e981a8f0ff4b231b7/modules/arclamp/files/arclamp-log.py) and not the main github repo.

Event Timeline

Krinkle created this task.Feb 10 2019, 9:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2019, 9:35 PM

I recall something like this when testing around with the old python-memcached-relay daemon expirement (before we decided on mcrouter instead). That used non-blocking checks and conditional sleep (polling), but since there is only one server here, the message() timeout could work. Something like https://github.com/andymccurdy/redis-py/issues/631 with try/catch for redis.TimeoutError and resubscribe logic should be doable.

aaron triaged this task as Normal priority.Jun 6 2019, 10:34 AM

Part of annual plan, probably Q2 or Q3.

Krinkle added a subscriber: Gilles.
From T217968

Idea:

  • Icinga check asserting that /srv/xenon/logs/hourly and /srv/xenon/logs/daily have files less than 5 minutes old. (relating to the xenon-log daemon)
  • Icinga check asserting that /srv/xenon/logs/daily and /srv/xenon/svgs/daily have files less than 20 minutes old. (relating to the xenon-generate-svgs 15min cron)
aaron updated the task description. (Show Details)Jun 18 2019, 10:09 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-28T01:22:04Z] <Krinkle> Killing arclamp-log on webperf1002, no flame graphs for three days, presumably mwlog/redis connection dropped again. T215740

Dzahn added a subscriber: Dzahn.Jun 28 2019, 1:39 AM