Update maxlag calculation maintenance script to reflect new prometheus queries
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	ItamarWMDE
	May 10 2023, 9:21 AM

Description

Problem:

In T331405: [WD-ORG] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off it was concluded that utilizing a new Prometheus query in the updateQueryServiceLag.php maintenance script (in the Wikidata.org extension) will help mitigate potential false positive alerts while switching datacenters. The new query as crafted by @dcausse in T331405#8806909 is:

Prometheus query for maxlag calculation

max(time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*"))

To ensure this query is taken into account while running the maxlag updater maintenance script, the Prometheus URLs in the script should be updated.

Acceptance Criteria:

The updateQueryServiceLag.php is updated to take into account the new Prometheus query for maxlag calculation.

Details

	Subject	Repo	Branch	Lines +/-
	Use new Prometheus query in updateQueryServiceLag	mediawiki/extensions/Wikidata.org	master	+111 -763

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T331405 [WD-ORG] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off
		Resolved		hoo	T336352 Update maxlag calculation maintenance script to reflect new prometheus queries

Event Timeline

Prioritzation meeting notes from parent task:

Task Review:

This may produce false positive alerts when switching over data centers, which adds overhead to an already stressful situation.
The scope of the task for the Wikidata team is to replace the urls of the promethous query used to determine the maxlag value

Prio notes:

Impact areas: Reliability
Does not affect end users / production
Affects monitoring efforts
Does not affect development efforts
Does not affect onboarding efforts
Affects additional stakeholders (WMF Search team)

ItamarWMDE renamed this task from Update maxlag calculation maintenance script to reflect new prometheus queries to [SW] Update maxlag calculation maintenance script to reflect new prometheus queries.May 10 2023, 9:22 AM

ItamarWMDE moved this task from Incoming to [DOT] Tech Backlog on the Wikidata Dev Team board.

ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.

ItamarWMDE renamed this task from [SW] Update maxlag calculation maintenance script to reflect new prometheus queries to Update maxlag calculation maintenance script to reflect new prometheus queries.May 10 2023, 11:48 AM

• karapayneWMDE moved this task from [DOT] Tech Backlog to Unified DOT Backlog on the Wikidata Dev Team board.May 10 2023, 1:47 PM

Gehel moved this task from Incoming to Watching / Waiting on the Wikidata-Query-Service board.May 15 2023, 12:15 PM

Lucas_Werkmeister_WMDE updated the task description. (Show Details)Jun 28 2023, 9:22 AM

• karapayneWMDE set the point value for this task to 5.Jun 28 2023, 9:25 AM

• karapayneWMDE moved this task from Unified DOT Backlog to Sprint-∞ on the Wikidata Dev Team board.

• karapayneWMDE edited projects, added Wikidata Dev Team (Sprint-∞); removed Wikidata Dev Team.

Michael updated the task description. (Show Details)Jun 28 2023, 9:26 AM

Notes from task breakdown:

At some point during development or review, someone should verify that the updated URL returns sensible results in production. (That requires production shell access, but the person doing this check doesn’t have to be the person who picks up this task.)
- The URL is currently 'http://' . $host . '/ops/api/v1/query?query=blazegraph_lastupdated' (in updateQueryServiceLag.php), where $host is prometheus.svc.eqiad.wmnet (from puppet wikidata.pp).
- We’re probably changing the URL ?query part and keeping the $host the same.
- So the command to check would be something like curl http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=... in production.
- We can also check whether the old and new query URL return approximately the same results (we don’t expect any difference at the moment).
We should check whether the isInstancePooled() check in MostLaggedPooledServerProvider (using information from WikimediaLoadBalancerQueryServicePoolStatusProvider) is still necessary or not.
- The “only take pooled servers into account” check is now probably encoded in the prometheus query itself.
- Probably re-read T331405 to make sure we understand the situation.
We probably won’t be able to verify that this does the right thing until the next data center switch, but we can at least verify that API requests with maxlag=-1 still return maxlag of type wikibase-queryservice most of the time (rather than type db as on other wikis), i.e. that we didn’t completely lose maxlag information.

hoo claimed this task.Jun 29 2023, 10:15 AM

hoo moved this task from Todo/Backlog to Doing on the Wikidata Dev Team (Sprint-∞) board.

One problem encountered while migrating the maintenance script to the new format is, that we can no longer report which server is the most lagged.

Rewording the query fixed this, but this way we still need to "manually" compute the max lag:

time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*")

I found that I can re-write the query to use topk (with k=1) instead of max, which should do the very same as max, except that it also returns all original fields (please correct me if I'm wrong @dcausse):

topk(1, time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*"))

hoo@mwmaint1002:~$ curl '…topk…'; echo; echo; curl '…max…';
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"cluster":"wdqs","host":"wdqs1015","instance":"wdqs1015:9193","job":"blazegraph","site":"eqiad"},"value":[1688505009.894,"87.89400005340576"]}]}}

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1688505009.908,"85.9079999923706"]}]}}

(I guess that some deviation is to be expected here, as those don't hit the same instance at the very same time)

@hoo using topk sounds good to me!
I used max to graph the maxlag on a single timeseries in grafana but hadn't thought about your usecase.

Change 936052 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@master] Use new Prometheus query in updateQueryServiceLag

https://gerrit.wikimedia.org/r/936052

gerritbot added a project: Patch-For-Review.Jul 6 2023, 2:52 PM

hoo moved this task from Doing to Peer Review on the Wikidata Dev Team (Sprint-∞) board.Jul 6 2023, 2:54 PM

Change 936052 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Use new Prometheus query in updateQueryServiceLag

https://gerrit.wikimedia.org/r/936052

Maintenance_bot removed a project: Patch-For-Review.Jul 11 2023, 12:30 PM

hoo moved this task from Peer Review to Tech Verification on the Wikidata Dev Team (Sprint-∞) board.Jul 11 2023, 1:49 PM

Wow! this is such an improvement of the Query lag calculation, and it greatly reduces the complexity here. Thank you @hoo and nice find on the topk.

ItamarWMDE closed this task as Resolved.Jul 27 2023, 2:46 PM

ItamarWMDE moved this task from Tech Verification to Our work done on the Wikidata Dev Team (Sprint-∞) board.Aug 3 2023, 3:53 AM

dcausse mentioned this in T360993: WDQS lag propagation to wikidata not working as intended.Mar 26 2024, 10:51 AM

Maintenance_bot moved this task from [DOT] Prioritized to Ongoing on the wmde-wikidata-tech board.Mar 26 2024, 11:29 AM

Change #1014580 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580

gerritbot added a project: Patch-For-Review.Mar 26 2024, 6:21 PM

dcausse removed a project: Patch-For-Review.Mar 26 2024, 8:08 PM