Page MenuHomePhabricator

wmde-analytics-minutely.service can get stuck (RuntimeMaxSec= has no effect), resulting in missed stats
Closed, ResolvedPublic

Description

It’s possible for wmde-analytics-minutely.service, a minutely-running service on stat1011, to become stuck and not be restarted for well over one minute. We try to prevent this with RuntimeMaxSec=55 in the unit, but as the documentation notes, this actually has no effect for Type=oneshot services. We should find some other solution to make sure the service does not run for longer than a minute (and a new attempt is started every minute).

Original task description below (this specific instance eventually resolved itself after ca. 20 minutes):


Title: Missing stats for number of rows in wb_changes table

Since 15:12 UTC today, the number of rows in the wb_changes table on Wikidata Alerts (link with timestamps) is missing, which caused an alert:

image.png (296×940 px, 22 KB)

This is normally updated by wmde-analytics-minutely.service on stat1011, which is seemingly stuck for 11+ minutes now:

lucaswerkmeister-wmde@stat1011:~$ systemctl status wmde-analytics-minutely.service 
● wmde-analytics-minutely.service - Minutely jobs for wmde analytics infrastructure
     Loaded: loaded (/lib/systemd/system/wmde-analytics-minutely.service; static)
     Active: activating (start) since Tue 2024-07-16 15:19:03 UTC; 11min ago
TriggeredBy: ● wmde-analytics-minutely.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
   Main PID: 1580386 (minutely.sh)
      Tasks: 5 (limit: 153980)
     Memory: 8.7M
        CPU: 299ms
     CGroup: /system.slice/wmde-analytics-minutely.service
             ├─1580386 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts
             ├─1580390 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts
             ├─1580394 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/wikidata/recentChanges.php
             ├─1580729 sh -c echo "wikidata.rc.edits.summary.wbremoveclaims 12 `date -d "2024-07-16 15:18:03" +%s`" | nc -q0 graphite-in.eqiad.wmnet 2003
             └─1580731 nc -q0 graphite-in.eqiad.wmnet 2003

Warning: some journal files were not opened due to insufficient permissions.

(I have no idea why the RuntimeMaxSec= isn’t doing what we want it to do:)

lucaswerkmeister-wmde@stat1011:~$ systemctl cat wmde-analytics-minutely.service
# /lib/systemd/system/wmde-analytics-minutely.service
[Unit]
Description=Minutely jobs for wmde analytics infrastructure
Documentation=https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state

[Service]
Type=oneshot
User=analytics-wmde
ExecStart=/srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts
RuntimeMaxSec=55

A bit earlier, I was also unable to connect to the stats databases – I think the script might be stuck on the same issue:

lucaswerkmeister-wmde@stat1011:~$ analytics-mysql wikidatawiki
ERROR 2002 (HY000): Can't connect to MySQL server on 'dbstore1009.eqiad.wmnet' (115)

(SAL indicates dbstore1009 was under maintenance from T365997 at the time.) In the meantime, I can now connect to dbstore1009 again, but the script hasn’t recovered by itself:

lucaswerkmeister-wmde@stat1011:~$ analytics-mysql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 24668304
Server version: 10.6.16-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql:research@dbstore1009.eqiad.wmnet [wikidatawiki]> ^DBye

Details

Event Timeline

(I have no idea why the RuntimeMaxSec= isn’t doing what we want it to do:)

Ah, that would be why (source):

Note that this setting does not have any effect on Type=oneshot services, as they terminate immediately after activation completed.

In the meantime, it looks like the script finally died:

lucaswerkmeister-wmde@stat1011:~$ systemctl status wmde-analytics-minutely.service 
● wmde-analytics-minutely.service - Minutely jobs for wmde analytics infrastructure
     Loaded: loaded (/lib/systemd/system/wmde-analytics-minutely.service; static)
     Active: inactive (dead) since Tue 2024-07-16 15:35:08 UTC; 39s ago
TriggeredBy: ● wmde-analytics-minutely.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 1628106 ExecStart=/srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts (code=exited, status=0/SUCCESS)
   Main PID: 1628106 (code=exited, status=0/SUCCESS)
        CPU: 444ms

Let’s hope the next timer trigger restarts it and the stats resume.

Looking better now:

image.png (296×940 px, 26 KB)

So let’s rewrite this into a “fix the RuntimeMaxSec=” task.

Lucas_Werkmeister_WMDE renamed this task from Missing stats for number of rows in wb_changes table to wmde-analytics-minutely.service can get stuck (RuntimeMaxSec= has no effect), resulting in missed stats.Jul 16 2024, 3:40 PM
Lucas_Werkmeister_WMDE updated the task description. (Show Details)

According to Unix Stack Exchange, the correct directive for oneshot services is TimeoutStartSec=.

Change #1054603 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/puppet@production] systemd::timer::job: Use TimeoutStartSec=

https://gerrit.wikimedia.org/r/1054603

fgiunchedi subscribed.

I'm removing grafana since turns out this issue is unrelated

Other services potentially affected by fixing the max runtime enforcement:

  • Production:
    • debian-weekly-rebuild, in docker::baseimages, part of profile::docker::builder. I don’t think I have permission to SSH into the host where this timer runs (build2001 AFAICT), but the timeout is set to 86400 seconds (1 day), which feels generous; I’d be surprised if it actually takes that long, so this is probably fine.
    • production-images-weekly-rebuild, also in profile::docker::builder. Also a one-day timeout, so IMHO also probably fine.
    • update-openstack-mirror, in profile::mirrors::openstack, part of role::mirrors. I don’t have permission to SSH into mirror1001 either, but the timeout is 72000 seconds (20 hours), and the command is an rsync, so I’d be quite surprised if it takes that long. Probably fine.
  • Wikimedia Cloud Services:
    • designate_floating_ip_ptr_records_updater, in `openstack::designate::dns_floating_ip_updater, part of role::wmcs::openstack::eqiad1::control and role::wmcs::openstack::codfw1dev::control. The timer runs every 15 minutes, so killing the service after 14m50s is intended to ensure that the service will be regularly (re)started – the same situation as with our wmde-analytics-* services. I don’t have access to the host where this runs either (cloudcontrol1005-7, cloudcontrol2004-6-dev?), so I can’t check the actual runtime.
    • cloud-vps-networktest, in openstack::monitor::networktests, also part of the wmcs/openstack control roles. Also runs every 15 minutes, but the timeout is just 10 minutes. Similar situation to designate_floating_ip_ptr_records_updater, I think.
    • opentofu-infra-diff, in profile::openstack::base::opentofu, also part of the wmcs/openstack control roles. Timeout is just 1800 seconds (half an hour), even though the timer only runs daily. The command is something I don’t know (/usr/local/bin/tofu plan -detailed-exitcode), so I can’t judge if this might be exceeding its runtime.

The cloud instances feel a bit more risky to me than the production ones, so maybe a cloud admin could SSH into one of the cloudcontrol* nodes and check systemctl status designate_floating_ip_ptr_records_updater cloud-vps-networktest opentofu-infra-diff (or the journal) to check that these services don’t usually run for longer than 14m50s, 10m, and 30m respectively?

designate_floating_ip_ptr_records_updater.service took 8m to run
opentofu-infra-diff.service took 3s to run

The other I couldn't find it defined anywhere.

Please go ahead with the change.

Just for search purposes – the error if TimeoutStartSec= is exceeded looks something like this:

Starting T370171.service...
T370171.service: start operation timed out. Terminating.
T370171.service: Main process exited, code=killed, status=15/TERM
T370171.service: Failed with result 'timeout'.
Failed to start T370171.service.

Change #1054603 merged by JHathaway:

[operations/puppet@production] systemd::timer::job: Use TimeoutStartSec=

https://gerrit.wikimedia.org/r/1054603

Lucas_Werkmeister_WMDE claimed this task.

I don’t think it’s really possible to validate / test this fix, given that we don’t normally expect any scripts to hit the max runtime… let’s just close this, and potentially come back to it if there are issues.