Page MenuHomePhabricator

Update maxlag calculation maintenance script to reflect new prometheus queries
Closed, ResolvedPublic5 Estimated Story Points

Description

Problem:

In T331405: [WD-ORG] Query service maxlag calculation should exclude datacenters that don't receive traffic and where the updater is turned off it was concluded that utilizing a new Prometheus query in the updateQueryServiceLag.php maintenance script (in the Wikidata.org extension) will help mitigate potential false positive alerts while switching datacenters. The new query as crafted by @dcausse in T331405#8806909 is:

Prometheus query for maxlag calculation
max(time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*"))

To ensure this query is taken into account while running the maxlag updater maintenance script, the Prometheus URLs in the script should be updated.

Acceptance Criteria:

  • The updateQueryServiceLag.php is updated to take into account the new Prometheus query for maxlag calculation.

Event Timeline

Prioritzation meeting notes from parent task:

Task Review:

  • This may produce false positive alerts when switching over data centers, which adds overhead to an already stressful situation.
  • The scope of the task for the Wikidata team is to replace the urls of the promethous query used to determine the maxlag value

Prio notes:

  • Impact areas: Reliability
  • Does not affect end users / production
  • Affects monitoring efforts
  • Does not affect development efforts
  • Does not affect onboarding efforts
  • Affects additional stakeholders (WMF Search team)
ItamarWMDE renamed this task from Update maxlag calculation maintenance script to reflect new prometheus queries to [SW] Update maxlag calculation maintenance script to reflect new prometheus queries.May 10 2023, 9:22 AM
ItamarWMDE moved this task from Incoming to [DOT] Tech Backlog on the Wikidata Dev Team board.
ItamarWMDE moved this task from Incoming to [DOT] Prioritized on the wmde-wikidata-tech board.
ItamarWMDE renamed this task from [SW] Update maxlag calculation maintenance script to reflect new prometheus queries to Update maxlag calculation maintenance script to reflect new prometheus queries.May 10 2023, 11:48 AM

Notes from task breakdown:

  • At some point during development or review, someone should verify that the updated URL returns sensible results in production. (That requires production shell access, but the person doing this check doesn’t have to be the person who picks up this task.)
    • The URL is currently 'http://' . $host . '/ops/api/v1/query?query=blazegraph_lastupdated' (in updateQueryServiceLag.php), where $host is prometheus.svc.eqiad.wmnet (from puppet wikidata.pp).
    • We’re probably changing the URL ?query part and keeping the $host the same.
    • So the command to check would be something like curl http://prometheus.svc.eqiad.wmnet/ops/api/v1/query?query=... in production.
    • We can also check whether the old and new query URL return approximately the same results (we don’t expect any difference at the moment).
  • We should check whether the isInstancePooled() check in MostLaggedPooledServerProvider (using information from WikimediaLoadBalancerQueryServicePoolStatusProvider) is still necessary or not.
    • The “only take pooled servers into account” check is now probably encoded in the prometheus query itself.
    • Probably re-read T331405 to make sure we understand the situation.
  • We probably won’t be able to verify that this does the right thing until the next data center switch, but we can at least verify that API requests with maxlag=-1 still return maxlag of type wikibase-queryservice most of the time (rather than type db as on other wikis), i.e. that we didn’t completely lose maxlag information.

One problem encountered while migrating the maintenance script to the new format is, that we can no longer report which server is the most lagged.

Rewording the query fixed this, but this way we still need to "manually" compute the max lag:

time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*")

I found that I can re-write the query to use topk (with k=1) instead of max, which should do the very same as max, except that it also returns all original fields (please correct me if I'm wrong @dcausse):

topk(1, time() - label_replace(blazegraph_lastupdated, "host", "$1", "instance", "^([^:]+):.*") and on(host) label_replace(rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[5m]) > 1, "host", "$1", "instance", "^([^:]+):.*"))
hoo@mwmaint1002:~$ curl '…topk…'; echo; echo; curl '…max…';
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"cluster":"wdqs","host":"wdqs1015","instance":"wdqs1015:9193","job":"blazegraph","site":"eqiad"},"value":[1688505009.894,"87.89400005340576"]}]}}

{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1688505009.908,"85.9079999923706"]}]}}

(I guess that some deviation is to be expected here, as those don't hit the same instance at the very same time)

@hoo using topk sounds good to me!
I used max to graph the maxlag on a single timeseries in grafana but hadn't thought about your usecase.

Change 936052 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikidata.org@master] Use new Prometheus query in updateQueryServiceLag

https://gerrit.wikimedia.org/r/936052

Change 936052 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] Use new Prometheus query in updateQueryServiceLag

https://gerrit.wikimedia.org/r/936052

Wow! this is such an improvement of the Query lag calculation, and it greatly reduces the complexity here. Thank you @hoo and nice find on the topk.

Change #1014580 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580