Page MenuHomePhabricator

Make mw-cron jobs alert thresholds easily configurable
Open, LowPublic

Description

At the moment, mw-cron jobs alert on every failure. In some cases, e.g. in T416440 and T404280, these failures are expected, and do not require human intervention. In these cases, maintenance on the MariaDB servers caused them to be temporarily read-only, and these crons would have succeeded on their next run.

If we expect that regular maintenance may cause transient failures, we should only alert in the circumstances where a human needs to intervene in order to avert operational risk. It would likely be valuable to see if we could instead alert only when a scheduled cron has been unable to successfully complete execution over some duration, rather than alerting on first failed execution.

Some crons may be particularly sensitive, and users may need to be alerted for any failure. This should be configurable (perhaps in Puppet?).

In the past 90 days, there have been 16 instances of "MediaWiki periodic job ... failed" alerts. Of those 16 instances, several will be addressed by the work in T390972, but that work will not cover the type of transient failure seen in T416440 and T404280.

Event Timeline

Blake triaged this task as Low priority.Thu, Feb 5, 11:52 AM

In addition to update-special-pages, this same failure mode is seen for UpdateMenteeData:

blake@deploy2002:~$ kubectl logs growthexperiments-updatementeedata-s1-29510295-jzlxn --timestamps=true | less
...
2026-02-09T06:17:39.221304883Z enwiki Wikimedia\Rdbms\DBQueryError from line 1226 of /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/Database/Database.php: Error 1290: The MariaDB server is running with the --read-only option so it cannot execute this statement
...

At the moment, it also seems necessary for someone to take manual action to delete the failed job in order to reset the alerting, I think we should try to avoid this requirement if possible in the replacement alerting.

Seeing a different kind of transient, retriable failure today, which apparently occurs when a database server is unpooled or repooled during the run of some maintenance scripts:

blake@deploy2002:~$ kubectl logs jobs/growthexperiments-updatementeedata-s1-29511735 --timestamps=true | less
...
2026-02-10T06:15:01.305147642Z extensions/GrowthExperiments/maintenance/updateMenteeData.php: Start run
2026-02-10T06:15:01.305198993Z extensions/GrowthExperiments/maintenance/updateMenteeData.php: Running on growthexperiments & 
s1
2026-02-10T06:15:01.305206463Z extensions/GrowthExperiments/maintenance/updateMenteeData.php: Running on growthexperiments & 
s1
2026-02-10T06:40:53.284308116Z enwiki InvalidArgumentException from line 80 of /srv/mediawiki/php-1.46.0-wmf.14/includes/libs
/Rdbms/ServerInfo.php: No server with index '0'
2026-02-10T06:40:53.284338359Z enwiki #0 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(9
90): Wikimedia\Rdbms\ServerInfo->getServerInfoStrict(0)
2026-02-10T06:40:53.284342942Z enwiki #1 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(9
09): Wikimedia\Rdbms\LoadBalancer->reallyOpenConnection(0, Object(Wikimedia\Rdbms\DatabaseDomain), Array)
2026-02-10T06:40:53.284346449Z enwiki #2 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(7
98): Wikimedia\Rdbms\LoadBalancer->reuseOrOpenConnectionForNewRef(0, Object(Wikimedia\Rdbms\DatabaseDomain), 4)
2026-02-10T06:40:53.284349666Z enwiki #3 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(1
783): Wikimedia\Rdbms\LoadBalancer->getServerConnection(0, '', 4)
2026-02-10T06:40:53.284355822Z enwiki #4 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/ObjectCache/WANObjectCache.php(1831):
 Wikimedia\Rdbms\LoadBalancer->Wikimedia\Rdbms\{closure}(false, 5, Array, NULL, Array)
2026-02-10T06:40:53.284381510Z enwiki #5 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/ObjectCache/WANObjectCache.php(1641):
 Wikimedia\ObjectCache\WANObjectCache->fetchOrRegenerate('global:rdbms-se...', 5, Object(Closure), Array, Array)
2026-02-10T06:40:53.284387955Z enwiki #6 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(1
774): Wikimedia\ObjectCache\WANObjectCache->getWithSetCallback('global:rdbms-se...', 5, Object(Closure), Array)
2026-02-10T06:40:53.284391202Z enwiki #7 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/LoadBalancer/LoadBalancer.php(1
760): Wikimedia\Rdbms\LoadBalancer->isPrimaryRunningReadOnly()
2026-02-10T06:40:53.284426304Z enwiki #8 /srv/mediawiki/php-1.46.0-wmf.14/includes/libs/Rdbms/ReadOnlyMode.php(55): Wikimedia\Rdbms\LoadBalancer->getReadOnlyReason()
...

This failure mode was discussed in T366857, and it looks like another case where it's acceptable to simply re-run the script in question.

Another transient failure mode we shouldn't be alerted about is evictions due to temporary conditions on a node.

blake@deploy2002:~$ kubectl describe pod testkitchen-updateconfigs-29511235-vtlsh | less
...
Annotations:      comment: Fetch instrument and experiment configs from Test Kitchen and updates the backing store if they have changed
                  pod.kubernetes.io/sidecars: mediawiki-main-tls-proxy,mediawiki-main-rsyslog
Status:           Failed
Reason:           Evicted
Message:          Pod was rejected: The node had condition: [DiskPressure].
...

This will happen sometimes, and is fine - we expect that the next time the job attempts to schedule, it'll schedule on a healthy node. T290444 contains another discussion of why this is not a condition which warrants an alert.