Page MenuHomePhabricator

Prometheus API error when requesting Blazegraph lag for WDQS
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.36.0-wmf.9

message
WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider::getLags: unexpected result from Prometheus API http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated
cli_argv
/srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --cluster wdqs --prometheus prometheus.svc.eqiad.wmnet --prometheus prometheus.svc.codfw.wmnet

Impact

Notes

  • It started last Friday (2020-09-18)
  • tagging also Discovery-Search (Current work) as there seems to be some work happening on WDQS/Blazegraph in that context as well
  • I cant see anything out of the ordinary on the WDQS grafana board, except that there seems to have been a jump in the load of one of the servers in the evening of the 17th. But that might be unrelated:

Details

Request ID
fedac512e35ba78d262a3aa4
Request URL
n/a
Stack Trace
exception.trace
n/a

Event Timeline

Michael created this task.Sep 23 2020, 5:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2020, 5:31 PM

I have no idea where to even begin to debug this. Maybe one of the subscribers has an idea. Triage by someone in the know would also be great.

addshore@deploy1001:~$ curl http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs","instance":"wdqs1004:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935288"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs","instance":"wdqs1005:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935310"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs","instance":"wdqs1006:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935295"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs","instance":"wdqs1007:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935296"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs-internal","instance":"wdqs1003:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935287"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs-internal","instance":"wdqs1008:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935313"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs-test","instance":"wdqs1009:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"NaN"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs-test","instance":"wdqs1009:9194","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"NaN"]},{"metric":{"__name__":"blazegraph_lastupdated","cluster":"wdqs-test","instance":"wdqs1010:9193","job":"blazegraph","site":"eqiad"},"value":[1600935337.656,"1600935298"]}]}}

prettified

{
	"status": "success",
	"data":
	{
		"resultType": "vector",
		"result": [
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs",
				"instance": "wdqs1004:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935288"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs",
				"instance": "wdqs1005:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935310"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs",
				"instance": "wdqs1006:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935295"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs",
				"instance": "wdqs1007:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935296"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs-internal",
				"instance": "wdqs1003:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935287"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs-internal",
				"instance": "wdqs1008:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935313"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs-test",
				"instance": "wdqs1009:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "NaN"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs-test",
				"instance": "wdqs1009:9194",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "NaN"]
		},
		{
			"metric":
			{
				"__name__": "blazegraph_lastupdated",
				"cluster": "wdqs-test",
				"instance": "wdqs1010:9193",
				"job": "blazegraph",
				"site": "eqiad"
			},
			"value": [1600935337.656, "1600935298"]
		}]
	}
}

I guess this is down to these NaN values.
But also, we don't want to be looking at "cluster": "wdqs-test"

Pablo-WMDE added a comment.EditedSep 24 2020, 12:17 PM

Copying conversations from chat

@Michael wrote

This is probably fine because it lets us know that there is a problem that is not unexpected but needs fixing.
The problem seems to be that this includes wdqs-test servers for which the premise "should always provide correct results" does not hold?

@Addshore wrote

Yes, it looks like wdqs-test is the issue then, and we shouldnt be warning for the clusters that we are not actually going to care about

So, the way I understand it, for entries with "cluster": "wdqs-test", we consider these entries mere logspam.

Change 630205 had a related patch set uploaded (by Michael Große; owner: Michael Große):
[mediawiki/extensions/Wikidata.org@master] Don't log an error on broken timestamp for irrelevant clusters

https://gerrit.wikimedia.org/r/630205

Change 630205 merged by jenkins-bot:
[mediawiki/extensions/Wikidata.org@master] Don't log an error on broken timestamp for irrelevant clusters

https://gerrit.wikimedia.org/r/630205

It is not entirely clear to me what is the expected behaviour. Looking at https://logstash.wikimedia.org/goto/cb4883b25c11556429eed17ec7e78dcd there seem to just be a handful of {method}: Request to Prometheus API {apiUrl} failed with {error} errors over last 24 hours. They all seem to be related to some network problems

WikidataOrg\QueryServiceLag\WikimediaPrometheusQueryServiceLagProvider::getLags: Request to Prometheus API http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=blazegraph_lastupdated failed with There was a problem during the HTTP request: 503 Service Unavailable

I suspect we are conveniently going to label these as a temporary hiccup? Is there anything else to resolve within this task? @Addshore

Addshore closed this task as Resolved.Oct 13 2020, 7:46 AM
Addshore claimed this task.
Restricted Application added a project: User-Addshore. · View Herald TranscriptOct 13 2020, 7:46 AM
Addshore moved this task from incoming to in progress on the Wikidata board.Oct 13 2020, 7:47 AM