Page MenuHomePhabricator

WikimediaPrometheusQueryServiceLagProvider: PHP Warning: A non-numeric value encountered
Closed, ResolvedPublicPRODUCTION ERROR



MediaWiki version: 1.35.0-wmf.31

PHP Warning: A non-numeric value encountered




40+ of these since 1.35.0-wmf.31.


Request ID
Request URL
Stack Trace
#0 /srv/mediawiki/php-1.35.0-wmf.31/extensions/ MWExceptionHandler::handleError(integer, string, string, integer, array)
#1 /srv/mediawiki/php-1.35.0-wmf.31/extensions/ WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider->getLags()
#2 /srv/mediawiki/php-1.35.0-wmf.31/extensions/ WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider->getLag()
#3 /srv/mediawiki/php-1.35.0-wmf.31/maintenance/doMaintenance.php(105): WikidataOrg\UpdateQueryServiceLag->execute()
#4 /srv/mediawiki/php-1.35.0-wmf.31/extensions/ require_once(string)
#5 /srv/mediawiki/multiversion/MWScript.php(101): require_once(string)
#6 {main}

Event Timeline

brennen created this task.May 6 2020, 9:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2020, 9:52 PM
brennen triaged this task as Unbreak Now! priority.May 6 2020, 10:03 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMay 6 2020, 10:03 PM
brennen added a subscriber: Addshore.EditedMay 6 2020, 10:16 PM

Just realizing this is due to /srv/mediawiki-staging/multiversion/MWScript.php extensions/ --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet.


Presumably this is cron'd? Unclear whether this warrants rollback.

brennen moved this task from Backlog to Logs/Train on the User-brennen board.May 6 2020, 10:17 PM
brennen lowered the priority of this task from Unbreak Now! to Needs Triage.May 6 2020, 10:34 PM

After discussion in #wikimedia-operations, unblocking train on the assumption that this is likely to be bad data from Prometheus.

16:25 <brennen> meanwhile:  does T252077 warrant a rollback?
16:26 <Reedy>
16:26 <Reedy> (it is a cron, yeah)
16:27  * Reedy looks
16:27 <brennen> thx.
16:30 <Reedy> brennen: I'm presuming time() isn't broken in PHP...
16:30 <brennen> well one can hope.
16:30 <Reedy> :)
16:30 <Reedy> So I'm guessing it's bad data from the prometheus service
16:30 <Reedy> No recent changes to the code
16:31 <brennen> yeah, makes sense.  in that case i'll unblock.

I can confirm no rollback or blocking is needed

Looking at P11165 prometheus seems to be returning a NaN value, which is no accounted for in the code.

        "metric": {
          "__name__": "blazegraph_lastupdated",
          "cluster": "wdqs",
          "instance": "wdqs2001:9193",
          "job": "blazegraph",
          "site": "codfw"
        "value": [

Looks like that host is doing something else now.

I think the below is related.

21:07 ryankemper@cumin1001: START - Cookbook
21:05 ryankemper@cumin1001: END (FAIL) - Cookbook (exit_code=99)
21:04 ryankemper@cumin1001: START - Cookbook

I guess the code needs to account for NaN values and count them as not in the pool of servers to look at.

Addshore triaged this task as Medium priority.May 11 2020, 9:21 AM
Restricted Application added a project: User-Addshore. · View Herald TranscriptMay 20 2020, 8:19 AM

Change 597474 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/] WikimediaPrometheusQueryServiceLagProvider Nan values

Change 597474 merged by jenkins-bot:
[mediawiki/extensions/] WikimediaPrometheusQueryServiceLagProvider Nan values

Addshore closed this task as Resolved.Jun 2 2020, 2:25 PM

Verified as it is deployed, although we haven't had a real "NaN" situation since