Page MenuHomePhabricator

WikimediaPrometheusQueryServiceLagProvider: PHP Warning: A non-numeric value encountered
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.31

message
PHP Warning: A non-numeric value encountered

Impact

Unclear.

Notes

40+ of these since 1.35.0-wmf.31.

Details

Request ID
7498d5ef1f8de0a661f1c74f
Request URL
n/a
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.31/extensions/Wikidata.org/src/QueryServiceLag/WikimediaPrometheusQueryServiceLagProvider.php(110): MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.35.0-wmf.31/extensions/Wikidata.org/src/QueryServiceLag/WikimediaPrometheusQueryServiceLagProvider.php(59): WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider->getLags()
#2 /srv/mediawiki/php-1.35.0-wmf.31/extensions/Wikidata.org/maintenance/updateQueryServiceLag.php(84): WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider->getLag()
#3 /srv/mediawiki/php-1.35.0-wmf.31/maintenance/doMaintenance.php(105): WikidataOrg\UpdateQueryServiceLag->execute()
#4 /srv/mediawiki/php-1.35.0-wmf.31/extensions/Wikidata.org/maintenance/updateQueryServiceLag.php(107): require_once(string)
#5 /srv/mediawiki/multiversion/MWScript.php(101): require_once(string)
#6 {main}

Event Timeline

brennen triaged this task as Unbreak Now! priority.May 6 2020, 10:03 PM

Just realizing this is due to /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet.

5e3b0d5f?

Presumably this is cron'd? Unclear whether this warrants rollback.

brennen lowered the priority of this task from Unbreak Now! to Needs Triage.May 6 2020, 10:34 PM

After discussion in #wikimedia-operations, unblocking train on the assumption that this is likely to be bad data from Prometheus.

16:25 <brennen> meanwhile:  does T252077 warrant a rollback?
16:26 <Reedy> https://github.com/wikimedia/puppet/blob/6b0dc71f153b6f052eb117c72ed365aaedc12a4d/modules/profile/manifests/mediawiki/maintenance/wikidata.pp#L73
16:26 <Reedy> (it is a cron, yeah)
16:27  * Reedy looks
16:27 <brennen> thx.
16:30 <Reedy> brennen: I'm presuming time() isn't broken in PHP...
16:30 <brennen> well one can hope.
16:30 <Reedy> :)
16:30 <Reedy> So I'm guessing it's bad data from the prometheus service
16:30 <Reedy> No recent changes to the code
16:31 <brennen> yeah, makes sense.  in that case i'll unblock.

I can confirm no rollback or blocking is needed

Looking at P11165 prometheus seems to be returning a NaN value, which is no accounted for in the code.

	      {
        "metric": {
          "__name__": "blazegraph_lastupdated",
          "cluster": "wdqs",
          "instance": "wdqs2001:9193",
          "job": "blazegraph",
          "site": "codfw"
        },
        "value": [
          1588807140.848,
          "NaN"
        ]
      },

Looks like that host is doing something else now.

I think the below is related.

21:07 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer
21:05 ryankemper@cumin1001: END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
21:04 ryankemper@cumin1001: START - Cookbook sre.wdqs.data-transfer

I guess the code needs to account for NaN values and count them as not in the pool of servers to look at.

Addshore triaged this task as Medium priority.May 11 2020, 9:21 AM

Change 597474 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikidata.org@master] WikimediaPrometheusQueryServiceLagProvider Nan values

https://gerrit.wikimedia.org/r/597474

Change 597474 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] WikimediaPrometheusQueryServiceLagProvider Nan values

https://gerrit.wikimedia.org/r/597474

Verified as it is deployed, although we haven't had a real "NaN" situation since