It’s possible for wmde-analytics-minutely.service, a minutely-running service on stat1011, to become stuck and not be restarted for well over one minute. We try to prevent this with RuntimeMaxSec=55 in the unit, but as the documentation notes, this actually has no effect for Type=oneshot services. We should find some other solution to make sure the service does not run for longer than a minute (and a new attempt is started every minute).
Original task description below (this specific instance eventually resolved itself after ca. 20 minutes):
Title: Missing stats for number of rows in wb_changes table
Since 15:12 UTC today, the number of rows in the wb_changes table on Wikidata Alerts (link with timestamps) is missing, which caused an alert:
This is normally updated by wmde-analytics-minutely.service on stat1011, which is seemingly stuck for 11+ minutes now:
lucaswerkmeister-wmde@stat1011:~$ systemctl status wmde-analytics-minutely.service ● wmde-analytics-minutely.service - Minutely jobs for wmde analytics infrastructure Loaded: loaded (/lib/systemd/system/wmde-analytics-minutely.service; static) Active: activating (start) since Tue 2024-07-16 15:19:03 UTC; 11min ago TriggeredBy: ● wmde-analytics-minutely.timer Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state Main PID: 1580386 (minutely.sh) Tasks: 5 (limit: 153980) Memory: 8.7M CPU: 299ms CGroup: /system.slice/wmde-analytics-minutely.service ├─1580386 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts ├─1580390 /bin/bash -x /srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts ├─1580394 /usr/bin/php /srv/analytics-wmde/graphite/src/scripts/src/wikidata/recentChanges.php ├─1580729 sh -c echo "wikidata.rc.edits.summary.wbremoveclaims 12 `date -d "2024-07-16 15:18:03" +%s`" | nc -q0 graphite-in.eqiad.wmnet 2003 └─1580731 nc -q0 graphite-in.eqiad.wmnet 2003 Warning: some journal files were not opened due to insufficient permissions.
(I have no idea why the RuntimeMaxSec= isn’t doing what we want it to do:)
lucaswerkmeister-wmde@stat1011:~$ systemctl cat wmde-analytics-minutely.service # /lib/systemd/system/wmde-analytics-minutely.service [Unit] Description=Minutely jobs for wmde analytics infrastructure Documentation=https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [Service] Type=oneshot User=analytics-wmde ExecStart=/srv/analytics-wmde/graphite/src/scripts/cron/minutely.sh /srv/analytics-wmde/graphite/src/scripts RuntimeMaxSec=55
A bit earlier, I was also unable to connect to the stats databases – I think the script might be stuck on the same issue:
lucaswerkmeister-wmde@stat1011:~$ analytics-mysql wikidatawiki ERROR 2002 (HY000): Can't connect to MySQL server on 'dbstore1009.eqiad.wmnet' (115)
(SAL indicates dbstore1009 was under maintenance from T365997 at the time.) In the meantime, I can now connect to dbstore1009 again, but the script hasn’t recovered by itself:
lucaswerkmeister-wmde@stat1011:~$ analytics-mysql wikidatawiki Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 24668304 Server version: 10.6.16-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql:research@dbstore1009.eqiad.wmnet [wikidatawiki]> ^DBye

