Page MenuHomePhabricator

Maintenance scripts do not emit StatsLib metrics
Closed, ResolvedPublic

Description

Maintenance scripts in extensions do not invoke MediaWikiEntryPoint->restInPeace() unless they specifically invoke doPostOutputShutdown(). Should MaintenanceRunner->shutdown() also invoke StatsFactory->flush()?

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved herron
OpenNone
ResolvedDreamy_Jazz
ResolvedDMburugu
OpenNone
OpenNone
StalledNone
ResolvedNone
ResolvedCyndymediawiksim
ResolvedSgs
ResolvedCyndymediawiksim
ResolvedSgs
ResolvedSgs
ResolvedCyndymediawiksim
ResolvedCyndymediawiksim
ResolvedPRODUCTION ERRORSgs
ResolvedMichael
ResolvedCyndymediawiksim
ResolvedReleasedancy
Resolvedcolewhite

Event Timeline

Reedy renamed this task from Mainenance scripts do not emit StatsLib metrics to Maintenance scripts do not emit StatsLib metrics.Nov 22 2024, 11:31 PM
Michael added a subscriber: Sgs.

This loss in observability is directly impacting our work on our own hypothesis for this quarter. Further, this is a hard blocker for us for migrating some of our existing metrics from statsd to statslib.

Please let us know if there is anything we can do to help moving this along.

Change #1098626 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] Move StatsFactory flush into emitBufferedStats

https://gerrit.wikimedia.org/r/1098626

Further, this is a hard blocker for us for migrating some of our existing metrics from statsd to statslib.

This is also the case for the Trust and Safety Product Team for T359357.

Change #1098626 merged by jenkins-bot:

[mediawiki/core@master] Stats: Move StatsFactory flush into emitBufferedStats

https://gerrit.wikimedia.org/r/1098626

Change #1100442 had a related patch set uploaded (by Dreamy Jazz; author: Cwhite):

[mediawiki/core@wmf/1.44.0-wmf.6] Stats: Move StatsFactory flush into emitBufferedStats

https://gerrit.wikimedia.org/r/1100442

Change #1100442 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.6] Stats: Move StatsFactory flush into emitBufferedStats

https://gerrit.wikimedia.org/r/1100442

Mentioned in SAL (#wikimedia-operations) [2024-12-04T12:59:12Z] <dreamyjazz@deploy2002> Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-04T13:00:51Z] <dreamyjazz@deploy2002> Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-04T13:28:15Z] <dreamyjazz@deploy2002> Started scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedSta

This patch failed to be deployed by scap in a way which I think will block the next train. Adding this task as a blocker for the next train.

Mentioned in SAL (#wikimedia-operations) [2024-12-04T13:33:49Z] <dreamyjazz@deploy2002> dreamyjazz: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedStats"]] synced

Mentioned in SAL (#wikimedia-operations) [2024-12-04T13:42:53Z] <dreamyjazz@deploy2002> Finished scap sync-world: Backport for [[gerrit:1100442|Stats: Move StatsFactory flush into emitBufferedStats (T380609)]], [[gerrit:1100434|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100430|Fix handling of 'last-checked' as 'never' in scanFilesInScanTable.php (T355169)]], [[gerrit:1100449|Revert "Stats: Move StatsFactory flush into emitBufferedSt

Krinkle subscribed.
From #wikimedia-operations:

<Dreamy_Jazz> mergeMessageFileList.php generated PHP notices/warnings: Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.6/includes/debug/logger/monolog/LegacyHandler.php on line 234

This is because Scap runs a small handful of scripts in an enforced-offline environment when building images, on the assumption that these scripts work offline. It seems that one of two things has happened:

  • Possiblility 1: mergeMessageFileList.php has always worked offline, but recently some new stats were introduced that are either specific to classes run by this script or are part of the MediaWiki setup logic or Maintenance base class for all CLI scripts. This didn't cause a problem in production because it was introduced after Scap started enforcing offline mode but before this task fixed the unintentional disablement of StatsFactory in CLI mode.
  • Possiblility 2: The stats that mergeMessageFileList.php emits have existed for years (either specific to it, or shared by all MW/maintenance scripts), but were migrated from BufferedStatsd to StatsFactory shortly before Scap started enforcing offline mode. Thus it works today, but would not once we fix that bug.

Analysis:

  • This is a PHP Warning, not a PHP Fatal Error or Uncaught Exception. It only stops Scap because we are (rightfully) strict about this sort of thing and out of caution abort the deployment. Logically, however, it did not fail the script execution, and would not have exited with non-zero exit code. mergeMessageFileList.php completed successfully and the docker image would have likely been fine as well as-is.
  • The socket warning comes from Logstash handler, not Statsd handler. Yet, the commit that caused it enabled Statsd in Maintenance scripts. I think what we see here is a cascading error. First, enabling Statsd caused some unimportant stat counter to now get flushed correctly at the end of the script. This will have failed because it would not be able to establish a network connection. But, we're not seeing a PHP Warning from the Statsd flush. I suspect one of two things has happened:
    • The Stats flush doesn't emit a PHP Warning but logs its own INFO message directly to Logstash.
    • The Stsats flush silences a PHP Warning, but given that we report both silenced and un-silenced warnings to Logstash, MediaWIki's set_error_handler callback will try to report this to Logstash. Either way, we're trying to write to Logstash which then fails.

Both of these can probably be addressed by WMF_MAINTENANCE_OFFLINE.

Stats are fine to turn off since they're not essential to the application logic. The stable interface there is entirely void, and they are off by default in MediaWiki core and in CI as well.

Logstash I would not turn off entirely, but rather redirect to stderr so as to preserve visibility. Possibly limited to error messages. I'm fairly certainly there's a bunch of logging enabled by default already, so the fact that this hasn't been an issue before is presumably because there are no errors under normal conditions, so it only fails if something else fails first (as we saw today). Perhaps sending error to stderr or something like that would be good. Assuming this runs in docker-build with no networking, that should end up in the same place on-screen as where the above PHP Warning was logged to by php-cli.

Change #1100864 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/mediawiki-config@master] Disable stats collection when WMF_MAINTENANCE_OFFLINE is set

https://gerrit.wikimedia.org/r/1100864

Change #1100864 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable stats collection when WMF_MAINTENANCE_OFFLINE is set

https://gerrit.wikimedia.org/r/1100864

Mentioned in SAL (#wikimedia-operations) [2024-12-10T22:04:22Z] <cjming@deploy2002> Started scap sync-world: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]]

Mentioned in SAL (#wikimedia-operations) [2024-12-10T22:08:04Z] <cjming@deploy2002> cwhite, cjming: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-12-10T22:15:47Z] <cjming@deploy2002> Finished scap sync-world: Backport for [[gerrit:1100864|Disable stats collection when WMF_MAINTENANCE_OFFLINE is set (T380609)]] (duration: 11m 24s)

Thanks to @Dreamy_Jazz for rolling the patch to production! It looks like Prometheus metrics are now working, but the backwards-compatibility bit copyToStatsdAt() isn't working. I'm going to try tracking that down.

colewhite claimed this task.
colewhite triaged this task as High priority.

I've determined the backwards-compatibility is working. The issue was how Graphite/Statsite stores counters and the patch to transition changed the metric type to a Gauge.

Considering this resolved!