Page MenuHomePhabricator

mediawiki_job_wikidata-updateQueryServiceLag failing
Closed, ResolvedPublic3 Estimated Story Points

Description

This systemd timer runs every minute, but the last time it succeeded was Dec 7, 19:41 UTC. Since then it's been failing consistently; journalctl looks like:

Dec 08 16:52:00 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikidata-updateQueryServiceLag.
Dec 08 16:52:01 mwmaint1002 mediawiki_job_wikidata-updateQueryServiceLag[131972]: Failed to get lag from prometheus
Dec 08 16:52:01 mwmaint1002 systemd[1]: mediawiki_job_wikidata-updateQueryServiceLag.service: Main process exited, code=exited, status=1/FAILURE
Dec 08 16:52:01 mwmaint1002 systemd[1]: mediawiki_job_wikidata-updateQueryServiceLag.service: Unit entered failed state.
Dec 08 16:52:01 mwmaint1002 systemd[1]: mediawiki_job_wikidata-updateQueryServiceLag.service: Failed with result 'exit-code'.

SAL shows @RKemper was reimaging WDQS hosts at the time it started failing, not sure if that's related or coincidence.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jbond triaged this task as Medium priority.Dec 9 2020, 12:17 PM
[20:17:19] <mutante>	 ryankemper: I guess it makes sense that "job_wikidata-updateQueryServiceLag" could not run during current work
[20:19:46] <ryankemper>	 mutante: yeah, I'm not familiar with how the job works specifically but that would make sense
[20:20:42] <ryankemper>	 if it's supposed to alert at <50% availability then that might be a bit unexpected because only one node in each `dc x [internal, external]` is being re-imaged at a time
[20:21:12] <ryankemper>	 So for codfw for example there's one codfw wdqs-internal host that would be unable to report and one codfw external wdqs host
[20:21:48] <mutante>	 ryankemper: no, this case is not about availability, it's "one of the mediawiki maintenance 'crons' (that are now systemd timers) failed to run on the maintenance servers
[20:26:03] <mutante>	 ryankemper: what happens is: "maintenance job tries to update what the current lag is like.. tries to get lag data from prometheus and that fails.  now since it's a systemd timer and not a cron it means it's a failed service which then turns into an Icinga alert about "systemd state is bad on a mwmaint server". and nothing clears it.. but this job runs every minute.. so it's more alert than 
[20:26:09] <mutante>	 is appropriate
[20:26:35] <mutante>	 let me just clear that failed service and wait a minute
[20:27:14] <mutante>	 !log mwmaint1002 - systemctl reset-failed to clear icinga alert

There's some context in the description of https://phabricator.wikimedia.org/T269204 that mentions that the counter metric blazegraph_lastupdated is now blazegraph_lastupdated_total, so if the`mediawiki_job_wikidata-updateQueryServiceLag` job has to do with that metric then the re-image is likely the source of the problem

ExecStart=/usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet

It is trying to get data from prometheus about lag.. and that fails.

I ran the command manually as the same user, www-data. The error is simply "Failed to get lag from prometheus".

@mwmaint1002:~# sudo -u www-data /usr/local/bin/mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet
Failed to get lag from prometheus

Job lives here: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/maintenance/updateQueryServiceLag.php#L70

I have a patch open here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/646888 that addresses a different manifestation of this same issue with blazegraph_lastupdated. Currently it looks like we'll want to switch blazegraph_lastupdated from a Counter to a Gauge, but I need to figure out if changing the metric type will break the behavior in updateQueryServiceLag.php or elsewhere.

CBogen set the point value for this task to 3.Dec 14 2020, 4:44 PM
dcausse subscribed.

Seems to be resolved now:

dcausse@mwmaint1002:~$ mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet
Done.

Yep, the alert has cleared. Thanks!