Page MenuHomePhabricator

WDQS lag propagation to wikidata not working as intended
Open, HighPublic

Description

Propagating the lag of a wdqs host should only be done if this host is ''pooled'' (actually serving user traffic).
Determining the ''pooling'' status appeared to be quite challenging in our infra so in T336352 we started using a metric based on the query rate hoping that it would be a reasonably proxy for determining if the server is serving users or not.

This worked well so far but a recent incident where a server was depooled after being stuck for some reasons showed that this metric based on query rate is too fragile:
We consider a server to be pooled if its query rate is above 1 qps:
rate(org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries{}[10m]) > 1

Sadly this was not true on wdqs1013 when it was depooled, for some reasons its query rate was still above 1 (below 1.3). It is possible that this metric is polluted with monitoring queries that do not relate to serving user traffic. We should perhaps refine how we generate org_wikidata_query_rdf_blazegraph_filters_QueryEventSenderFilter_event_sender_filter_StartedQueries and make sure we only measure user queries.

AC:

  • wdqs lag propagation should no longer include false positives (count the lag of a server that is actually depooled)

Event Timeline

Mitigation on wdqs1013:

  • blazegraph stopped
  • updater stopped with the /srv/wdqs/data_loaded flag removed
  • puppet disabled

It is possible that this metric is polluted with monitoring queries that do not relate to serving user traffic

I did a little checking around this. Prometheus blackbox checks are defined here . Their frequency is defined by the scrape_interval setting for the probes-custom_puppet-http.yaml job , which is currently 15 seconds. Since we require a unique check per team, that should be around 8 queries per minute per Prometheus host.

Note that there are separate Prometheus checks that use the jmx exporter instead of blackbox, each with a frequency of 60 seconds. Looking at prometheus1006, I see the following:

  • jmx_wdqs_updater
  • jmx_query_service_streaming_updater
  • jmx_wdqs_blazegraph

My next step is to figure out how many of the Prometheus hosts are actively polling each WDQS endpoint.

Here are the UAs seen in hour of a depooled server:

+------------------------------------------------------------------+-----+
|UA                                                                |count|
+------------------------------------------------------------------+-----+
|check_http/v2.3.3 (monitoring-plugins 2.3.3)                      |87   |
|Twisted PageGetter                                                |2146 |
|prometheus-public-sparql-ep-check                                 |1913 |
|wmf-prometheus/prometheus-blazegraph-exporter (root@wikimedia.org)|120  |
+------------------------------------------------------------------+-----+

Change #1014551 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: add x-monitoring-query

https://gerrit.wikimedia.org/r/1014551

Change #1014566 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Add support for x-monitoring-query header

https://gerrit.wikimedia.org/r/1014566

Change #1014580 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580

Change #1014584 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] updateQueryServiceLag: tune the min query rate on a pooled server

https://gerrit.wikimedia.org/r/1014584

The approach taken is:

  • from nginx control a new header named 'x-monitoring-query' set to true if a list of criteria is met (currently using user-agent strings but could be extended to using source IPs as well I suppose)
  • from blazegraph, do not log query with the header x-monitoring-query set
  • adapt Wikidata.org to allow tuning the minimal query rate expected to be served from a pooled served (was hardcoded to 1.0)
  • change the systemd timer that runs updateQueryServiceLag.php to set --pooled-server-min-query-rate to 0.5 (will need to double check that this value is sane and works well for codfw and eqiad servers)

Per sudo cumin A:prometheus 'w' from a cumin host, there are 8 active prometheus hosts.

We also have 3 load balancer pools for each wdqs host:

  • wdqs
  • wdqs-ssl
  • wdqs-heavy-queries

Each one of these represents a separate healthcheck as well.

Change #1014566 merged by jenkins-bot:

[wikidata/query/rdf@master] Add support for x-monitoring-query header

https://gerrit.wikimedia.org/r/1014566

Change #1014551 merged by Bking:

[operations/puppet@production] wdqs: add x-monitoring-query

https://gerrit.wikimedia.org/r/1014551

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:55:13Z] <inflatador> bking@cumin2002 running puppet against A:wdqs-main to apply nginx changes T360993

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:16:57Z] <ryankemper> T360993 [WDQS Deploy] Gearing up for deploy of wdqs 0.3.138. Pre-deploy tests passing on canary wdqs1003

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:17:55Z] <ryankemper> T360993 [WDQS Deploy] Tests passing following deploy of 0.3.138 on canary wdqs1003; proceeding to rest of fleet

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:19Z] <ryankemper> T360993 [WDQS Deploy] Restarted wdqs-updater across all hosts, 4 hosts at a time: sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:23Z] <ryankemper> T360993 [WDQS Deploy] Restarted wdqs-categories across all test hosts simultaneously: sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T22:30:28Z] <ryankemper> T360993 [WDQS Deploy] Restarting wdqs-categories across lvs-managed hosts, one node at a time: sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'

Mentioned in SAL (#wikimedia-operations) [2024-03-27T23:58:24Z] <ryankemper> T360993 [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good

Change #1014580 merged by jenkins-bot:

[mediawiki/extensions/Wikidata.org@master] updateQueryServiceLag: add an option to tune the query rate

https://gerrit.wikimedia.org/r/1014580

Mentioned in SAL (#wikimedia-operations) [2024-03-28T13:07:34Z] <dcausse> temporarily depooling wdqs2009 (test query rate when depooled T360993)

Mentioned in SAL (#wikimedia-operations) [2024-03-28T13:17:10Z] <dcausse> repooling wdqs2009 (test query rate when depooled T360993)

depooling the node we can see that the query rate actually going down to 0, request rate is generally very low on codfw so we might have to tune the threshold at around 0.2.

image.png (837×859 px, 223 KB)

I could re-enable puppet on wdqs1013 and restart the updater to catchup on updates. But apparently this machine was repooled yesterday (as part of the wdqs scap deploy I suppose) and thus started to serve stale data without triggering any maxlag. It's when re-enabling puppet that I realized that this node was still pooled so I depooled it immediately but this caused a maxlag for several minutes.
Scap repooling machines might be something we might look into to avoid this kind of issues in the future.