WDQS lag propagation to wikidata not working as intended
Open, HighPublic
Actions

Assigned To

Authored By

	dcausse
	Mar 26 2024, 10:51 AM

Description

Propagating the lag of a wdqs host should only be done if this host is ''pooled'' (actually serving user traffic).
Determining the ''pooling'' status appeared to be quite challenging in our infra so in T336352 we started using a metric based on the query rate hoping that it would be a reasonably proxy for determining if the server is serving users or not.

This worked well so far but a recent incident where a server was depooled after being stuck for some reasons showed that this metric based on query rate is too fragile:
We consider a server to be pooled if its query rate is above 1 qps:
rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[10m]) > 1

Sadly this was not true on wdqs1013 when it was depooled, for some reasons its query rate was still above 1 (below 1.3). It is possible that this metric is polluted with monitoring queries that do not relate to serving user traffic. We should perhaps refine how we generate org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries and make sure we only measure user queries.

AC:

wdqs lag propagation should no longer include false positives (count the lag of a server that is actually depooled)

Details

Subject	Repo	Branch	Lines +/-
updateQueryServiceLag: tune the min query rate of a pooled server	operations/puppet	production	+3 -1
updateQueryServiceLag: add an option to tune the query rate	mediawiki/extensions/Wikidata.org	master	+28 -5
Add support for x-monitoring-query header	wikidata/query/rdf	master	+42 -11
wdqs: add x-monitoring-query	operations/puppet	production	+43 -12

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	dcausse	T360993 WDQS lag propagation to wikidata not working as intended
Declined	bking	T361106 Restore wdqs1013 with a data transfer
Open	None	T361114 Alert Search Platform and/or DPE SRE when Wikidata is lagged

Event Timeline

dcausse created this task.Mar 26 2024, 10:51 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2024, 10:51 AM

Mitigation on wdqs1013:

blazegraph stopped
updater stopped with the /srv/wdqs/data_loaded flag removed
puppet disabled

dcausse triaged this task as High priority.Mar 26 2024, 11:06 AM

dcausse added a project: Discovery-Search (Current work).

Lucas_Werkmeister_WMDE added a project: Wikidata.Mar 26 2024, 11:24 AM

bking subscribed.Mar 26 2024, 1:18 PM

It is possible that this metric is polluted with monitoring queries that do not relate to serving user traffic

I did a little checking around this. Prometheus blackbox checks are defined here . Their frequency is defined by the scrape_interval setting for the probes-custom_puppet-http.yaml job , which is currently 15 seconds. Since we require a unique check per team, that should be around 8 queries per minute per Prometheus host.

Note that there are separate Prometheus checks that use the jmx exporter instead of blackbox, each with a frequency of 60 seconds. Looking at prometheus1006, I see the following:

jmx_wdqs_updater
jmx_query_service_streaming_updater
jmx_wdqs_blazegraph

My next step is to figure out how many of the Prometheus hosts are actively polling each WDQS endpoint.

Here are the UAs seen in hour of a depooled server:

+------------------------------------------------------------------+-----+
|UA                                                                |count|
+------------------------------------------------------------------+-----+
|check_http/v2.3.3 (monitoring-plugins 2.3.3)                      |87   |
|Twisted PageGetter                                                |2146 |
|prometheus-public-sparql-ep-check                                 |1913 |
|wmf-prometheus/prometheus-blazegraph-exporter (root@wikimedia.org)|120  |
+------------------------------------------------------------------+-----+

Change #1014551 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: add x-monitoring-query

https://gerrit.wikimedia.org/r/1014551

gerritbot added a project: Patch-For-Review.Mar 26 2024, 4:30 PM

Change #1014566 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Add support for x-monitoring-query header

https://gerrit.wikimedia.org/r/1014566

Change #1014580 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580

Change #1014584 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] updateQueryServiceLag: tune the min query rate on a pooled server

https://gerrit.wikimedia.org/r/1014584

dcausse claimed this task.Mar 26 2024, 6:30 PM

dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

The approach taken is:

from nginx control a new header named 'x-monitoring-query' set to true if a list of criteria is met (currently using user-agent strings but could be extended to using source IPs as well I suppose)
from blazegraph, do not log query with the header x-monitoring-query set
adapt Wikidata.org to allow tuning the minimal query rate expected to be served from a pooled served (was hardcoded to 1.0)
change the systemd timer that runs updateQueryServiceLag.php to set --pooled-server-min-query-rate to 0.5 (will need to double check that this value is sane and works well for codfw and eqiad servers)

Per sudo cumin A:prometheus 'w' from a cumin host, there are 8 active prometheus hosts.

We also have 3 load balancer pools for each wdqs host:

wdqs
wdqs-ssl
wdqs-heavy-queries

Each one of these represents a separate healthcheck as well.

Change #1014566 merged by jenkins-bot:

[wikidata/query/rdf@master] Add support for x-monitoring-query header

https://gerrit.wikimedia.org/r/1014566

Change #1014551 merged by Bking:

[operations/puppet@production] wdqs: add x-monitoring-query

https://gerrit.wikimedia.org/r/1014551

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:55:13Z] <inflatador> bking@cumin2002 running puppet against A:wdqs-main to apply nginx changes T360993

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:16:57Z] <ryankemper> T360993 [WDQS Deploy] Gearing up for deploy of wdqs 0.3.138. Pre-deploy tests passing on canary wdqs1003

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:17:55Z] <ryankemper> T360993 [WDQS Deploy] Tests passing following deploy of 0.3.138 on canary wdqs1003; proceeding to rest of fleet

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:19Z] <ryankemper> T360993 [WDQS Deploy] Restarted wdqs-updater across all hosts, 4 hosts at a time: sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:23Z] <ryankemper> T360993 [WDQS Deploy] Restarted wdqs-categories across all test hosts simultaneously: sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:28Z] <ryankemper> T360993 [WDQS Deploy] Restarting wdqs-categories across lvs-managed hosts, one node at a time: sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T23:58:24Z] <ryankemper> T360993 [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good

Change #1014580 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.25; 2024-04-02).Mar 28 2024, 10:00 AM

Mentioned in SAL (#wikimedia-operations) [2024-03-28T13:07:34Z] <dcausse> temporarily depooling wdqs2009 (test query rate when depooled T360993)

Mentioned in SAL (#wikimedia-operations) [2024-03-28T13:17:10Z] <dcausse> repooling wdqs2009 (test query rate when depooled T360993)

depooling the node we can see that the query rate actually going down to 0, request rate is generally very low on codfw so we might have to tune the threshold at around 0.2.

I could re-enable puppet on wdqs1013 and restart the updater to catchup on updates. But apparently this machine was repooled yesterday (as part of the wdqs scap deploy I suppose) and thus started to serve stale data without triggering any maxlag. It's when re-enabling puppet that I realized that this node was still pooled so I depooled it immediately but this caused a maxlag for several minutes.
Scap repooling machines might be something we might look into to avoid this kind of issues in the future.

dcausse mentioned this in T361246: scap deploy should not repool a wdqs node that is depooled.Mar 28 2024, 3:20 PM

Mentioned in SAL (#wikimedia-operations) [2024-03-29T08:36:32Z] <dcausse> repooling wdqs1013 (T360993)

dcausse closed subtask T361106: Restore wdqs1013 with a data transfer as Declined.Fri, Mar 29, 8:47 AM

dr0ptp4kt subscribed.Tue, Apr 2, 8:45 PM

dcausse moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Thu, Apr 4, 9:29 AM

Gehel added a project: Data-Platform-SRE.Mon, Apr 15, 1:19 PM

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.

Gehel removed a project: Wikidata-Query-Service.

	F43663858: image.png
	Mar 28 2024, 2:08 PM

WDQS lag propagation to wikidata not working as intendedOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

WDQS lag propagation to wikidata not working as intended
Open, HighPublic
Actions

Related Objects
Search...