Page MenuHomePhabricator

BetaFeatures metrics aggregation and dashboard are failing
Open, Needs TriagePublic

Description

The BetaFeatures Grafana dashboard has been empty since February 2020: https://grafana.wikimedia.org/d/000000259/betafeatures?orgId=1&refresh=5m&from=now-1y&to=now

image.png (332×1 px, 90 KB)

Root cause is that the combined WMDE metrics take too long to execute, and trip the 1-hour timeout. Proposed solution is to optimize the query to finish within the time limit.

Event Timeline

I found one very likely explanation, b7b54659f089491 was deployed on Feb 19 and kills scripts after 1 hour. This means that we're probably running the BetaFeatures script for 1 hour every day, and killing it!

Potential steps forward:

  • Add code to log and report out when a script times out.
  • Raise the time limit.
  • Rewrite the BetaFeatures report to finish within the timeout.
  • Add code to log and report out when a script times out.

It logs error AFAIK. It's in stat1007, I can try to find it for you but I'm sure it logs them.

Thanks, it looks like we have many misbehaving scripts (extra linefeeds are my own):

2020-08-17 03:17:00 wikidata-datamodel-statements_per_entity Script Finished!
sh: 1: Syntax error: EOF in backquote substitution

2020-08-17 03:54:32 wikidata-entityUsage Script Finished!

PHP Fatal error:  Maximum execution time of 3600 seconds exceeded in /srv/analytics-wmde/graphite/src/scripts/lib/WikimediaDb.php on line 68

2020-08-18 03:01:48 wikidata-dumpScanProcessing File not found: /srv/analytics-wmde/graphite/data/20190729/metrics.json
2020-08-18 03:01:48 wikidata-dumpScanProcessing File not found: /srv/analytics-wmde/graphite/data/20190722/metrics.json
2020-08-18 03:01:48 wikidata-dumpScanProcessing File not found: /srv/analytics-wmde/graphite/data/20190717/metrics.json

PHP Warning:  Invalid argument supplied for foreach() in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/sparql/ranks.php on line 37
2020-08-18 03:01:01 wikidata-sparql-ranks Script Finished!

2020-08-17 12:00:01 wikidata-apiLogScanner Script Started!
PHP Fatal error:  Uncaught RuntimeException: File does not exist: /srv/log/mw-log/archive/api/api.log-20200815.gz in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/apiLogScanner.php:105
Stack trace:
#0 /srv/analytics-wmde/graphite/src/scripts/src/wikidata/apiLogScanner.php(52): WikidataApiLogScanner->dieIfFilesDoNotExist()
#1 /srv/analytics-wmde/graphite/src/scripts/src/wikidata/apiLogScanner.php(22): WikidataApiLogScanner->execute()
#2 {main}
  thrown in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/apiLogScanner.php on line 105

It looks like the timeout happens at a random time, out of sync with the surrounding loglines. This makes sense because all jobs are started as background tasks from the shell, so they may be logging concurrently. The shell has no information about which job expired. It's possible that the first timing-out job kills any other long-running jobs. Indeed, I can see that the "betafeature-counts Script Started!" loglines never have a corresponding "Script Finished".

Matching these lines up over the last few days:

1grep 'Script' log/daily.03.log | cut -d " " -f 3-5 | sort | uniq
2
3advancedsearch-userprops Script Started!
4betafeature-counts Script Started!
5catwatch-userprops Script Started!
6echo-statusNotifications Script Started!
7revslider-userprops Script Started!
8rollbackconf-userprops Script Started!
9twocolconflict-userprops Script Started!
10wikibase-dockerStats Script Finished!
11wikibase-dockerStats Script Started!
12wikidata-datamodel-properties_by_datatype Script Finished!
13wikidata-datamodel-properties_by_datatype Script Started!
14wikidata-datamodel-sitelinks_per_item Script Finished!
15wikidata-datamodel-sitelinks_per_item Script Started!
16wikidata-datamodel-sitelinks_per_site Script Finished!
17wikidata-datamodel-sitelinks_per_site Script Started!
18wikidata-datamodel-statements_per_entity Script Finished!
19wikidata-datamodel-statements_per_entity Script Started!
20wikidata-dumpDownloads Script Finished!
21wikidata-dumpDownloads Script Started!
22wikidata-dumpScanProcessing Script Finished!
23wikidata-dumpScanProcessing Script Started!
24wikidata-entityUsage Script Finished!
25wikidata-entityUsage Script Started!
26wikidata-phabricatorTasks Script Finished!
27wikidata-phabricatorTasks Script Started!
28wikidata-showcaseItems Script Finished!
29wikidata-showcaseItems Script Started!
30wikidata-site_stats-active_users Script Finished!
31wikidata-site_stats-active_users Script Started!
32wikidata-site_stats-good_articles Script Finished!
33wikidata-site_stats-good_articles Script Started!
34wikidata-site_stats-lexemes Script Finished!
35wikidata-site_stats-lexemes Script Started!
36wikidata-site_stats-pagelinks_to_namespaces Script Finished!
37wikidata-site_stats-pagelinks_to_namespaces Script Started!
38wikidata-site_stats-pages_by_namespace Script Finished!
39wikidata-site_stats-pages_by_namespace Script Started!
40wikidata-site_stats-page_size Script Finished!
41wikidata-site_stats-page_size Script Started!
42wikidata-site_stats-recent_changes_by_namespace Script Finished!
43wikidata-site_stats-recent_changes_by_namespace Script Started!
44wikidata-site_stats-rolling_rc Script Finished!
45wikidata-site_stats-rolling_rc Script Started!
46wikidata-site_stats-total_edits Script Finished!
47wikidata-site_stats-total_edits Script Started!
48wikidata-site_stats-total_pages Script Finished!
49wikidata-site_stats-total_pages Script Started!
50wikidata-site_stats-user_groups Script Finished!
51wikidata-site_stats-user_groups Script Started!
52wikidata-site_stats-user_languages Script Finished!
53wikidata-site_stats-user_languages Script Started!
54wikidata-site_stats-users Script Finished!
55wikidata-site_stats-users Script Started!
56wikidata-social-facebook Script Finished!
57wikidata-social-facebook Script Started!
58wikidata-social-identica Script Finished!
59wikidata-social-identica Script Started!
60wikidata-social-irc Script Finished!
61wikidata-social-irc Script Started!
62wikidata-social-mail Script Finished!
63wikidata-social-mail Script Started!
64wikidata-social-newsletter Script Finished!
65wikidata-social-newsletter Script Started!
66wikidata-social-techmail Script Finished!
67wikidata-social-techmail Script Started!
68wikidata-social-twitter Script Finished!
69wikidata-social-twitter Script Started!
70wikidata-sparql-constraint-snaks Script Finished!
71wikidata-sparql-constraint-snaks Script Started!
72wikidata-sparql-instanceof Script Finished!
73wikidata-sparql-instanceof Script Started!
74wikidata-sparql-ranks Script Finished!
75wikidata-sparql-ranks Script Started!

All of these scripts are failing to complete, for unknown reasons:

advancedsearch-userprops
betafeature-counts
catwatch-userprops
echo-statusNotifications
revslider-userprops
rollbackconf-userprops
twocolconflict-userprops

I suggest that we get rid of the "daily.*.sh" scripts and instead call each script directly from cron. We could still share a common wrapper around each command, but listing them separately in cron gives built-in visibility about which scripts fail.

awight renamed this task from Why does betafeatures grafana data stop on 2020-02-20? to BetaFeatures metrics aggregation and dashboard are failing.Feb 22 2021, 2:33 PM
awight updated the task description. (Show Details)

I suggest that we get rid of the "daily.*.sh" scripts and instead call each script directly from cron. We could still share a common wrapper around each command, but listing them separately in cron gives built-in visibility about which scripts fail.

Sounds good as these scripts etc are indeed getting more complicated.
The reason for a single cron file in the first place was so that we could avoid touching puppet a bunch.
Many of these scripts could likely use data from other faster sources now (like hadoop)
It sould like we should also do an audit of these scripts and see if we can kill some of them (especially any long failing ones)