Page MenuHomePhabricator

Create Icinga check for ArcLamp (xenon-log) service health
Open, MediumPublic

Description

The ArcLamp service on webperf1002 was effectively down from 18 December 2018 until 6 February 2019.

I saw that the process was still running, and without errors. After some digging (thanks @aaron, @Volans), we learned the reason it stopped doing anything was due to the TCP connection to Redis having been dropped somehow.

I'll write more about this in a blog post, but the short version is: redis-py doesn't do regular pings as part of its pubsub connection. This means that the for loop we have in xenon-log will just wait forever for a message that will never come, because the connection's dead.

Action items

  • To make debugger easier next time, make sure gdb-python is installed on all webperf servers. Perhaps on all WMF production servers that have roles involving Python programs?
  • Figure out how we can fix the way we use redis-pubsub so that we will detect dropped connections in the future, and letting the program exit in that case. That way, systemd will automatically restart the service and create a new connection.

Note

The code is in puppet (https://github.com/wikimedia/puppet/blob/15e1bd5797a72d041ddf8d2e981a8f0ff4b231b7/modules/arclamp/files/arclamp-log.py) and not the main github repo.

Event Timeline

Krinkle created this task.Feb 10 2019, 9:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2019, 9:35 PM

I recall something like this when testing around with the old python-memcached-relay daemon expirement (before we decided on mcrouter instead). That used non-blocking checks and conditional sleep (polling), but since there is only one server here, the message() timeout could work. Something like https://github.com/andymccurdy/redis-py/issues/631 with try/catch for redis.TimeoutError and resubscribe logic should be doable.

aaron triaged this task as Medium priority.Jun 6 2019, 10:34 AM

Part of annual plan, probably Q2 or Q3.

Krinkle added a subscriber: Gilles.
From T217968

Idea:

  • Icinga check asserting that /srv/xenon/logs/hourly and /srv/xenon/logs/daily have files less than 5 minutes old. (relating to the xenon-log daemon)
  • Icinga check asserting that /srv/xenon/logs/daily and /srv/xenon/svgs/daily have files less than 20 minutes old. (relating to the xenon-generate-svgs 15min cron)
aaron updated the task description. (Show Details)Jun 18 2019, 10:09 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-28T01:22:04Z] <Krinkle> Killing arclamp-log on webperf1002, no flame graphs for three days, presumably mwlog/redis connection dropped again. T215740

Dzahn added a subscriber: Dzahn.Jun 28 2019, 1:39 AM
Gilles assigned this task to dpifke.Jan 7 2020, 11:39 AM

Change 568732 had a related patch set uploaded (by Krinkle; owner: Ori.livneh):
[operations/puppet@production] arclamp-log: abort if no message received in 30 minutes

https://gerrit.wikimedia.org/r/568732

Change 568732 merged by Alexandros Kosiaris:
[operations/puppet@production] arclamp-log: abort if no message received in 30 minutes

https://gerrit.wikimedia.org/r/568732

Change 608973 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] [WIP] webperf: Enable prometheus-apache-exporter

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608973

fgiunchedi moved this task from Inbox to Radar on the observability board.Mon, Jul 20, 1:28 PM