Page MenuHomePhabricator

Investigate rdf-streaming-updater consumer failures in eqiad
Closed, ResolvedPublic

Description

As of 2026-05-07 1700UTC the whole of eqiad is experiencing max lag above SLO.

Initially we diagnosed a pool of scrapers hitting the service and causing load. Rate limiting removed pressures
(timeout rate decreased) , but the update lag is monotonically increasing in all of eqiad since the incident started.

Codfw is unaffected.

We are not moving to step 1.3 in https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbooks/ElevatedMaxLagWDQS

Placeholder task to investigate.

Details

Other Assignee
gmodena

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2026-05-08T09:44:58Z] <btullis> depooled wdqs-main in eqiad for T425758

gmodena renamed this task from Investigate rdf-streaming-updater failure in eqiad to Investigate rdf-streaming-updater consumer failures in eqiad.May 8 2026, 9:50 AM

Mentioned in SAL (#wikimedia-operations) [2026-05-08T10:51:10Z] <btullis> re-pooled wdqs-main in eqiad for T425758

gmodena updated Other Assignee, added: gmodena.

The issue is ongoing, but a bit more under control. There are two things at play:

  1. High volume of scrapers are hitting wdqs.
  2. There is some bespoke request throttling logic in WDQS that kicks in. streaming-updater-consumer gets throttled too, and this results in the db not receiving updated.

In the short term, we need to manually apply limits and forcefully restart blazegraph to keep up, if needed. I have a simple patch for allowing requests from the streaming-updater-consumer to go through, but i don't want to deploy right before the weekend (and without having properly reasoned on second order dependencies).

In the long term: REST gateway should be the right place to address throttling, not the current bespoke implementation.

Things seem a bit more under contro again. We are collecting comments from all respondents involved and will publish an incident report this week.
At the time of writing the ratio of failed queries is <5%, lag has been absorbed, and all nodes have been re-pooled. Alerts have resolved.

[F80904835}

Screenshot From 2026-05-11 17-04-35.png (938×318 px, 72 KB)

What happened

Aggressive scrapers started hitting the public WDQS endpoint on 2026-05-07 causing a decreased service availability that impacted SLO (both Uptime (availability) percentage as well as Excessive lag percentage).

Over the whole period we identified two issues at play:

  1. Blazegraph was under load and started to timeout for a large population of users.
  2. The streaming-updater-consumer service (responsible for real-time index updates) was throttled by the overloaded Blazegraph (T425770: WDQS token bucket throttling logic should not apply to the the streaming-updater-consumer), resulting in index UPDATES being rejected (429) and lag increased. This, in return, triggered max lag protection in Wikibase, resulting in wikidata.org request getting throttled.

Upon alerts review on Friday (2026-05-08) we diagnosed that the whole of eqiad was lagging, and proceeded to depooling the whole deployment to allow wikidata changes (wdqs index updates) to propagate. As lag started to increase again we proceeded to rate limit actors that were aggressively querying and timing out the service.

Despite aggressive rate limiting applied on Friday (globally, at the edge), the outage persisted over the weekend. The Initial rate limiting rules were extrapolated from a Turnilo data cube based on a sample (1 in 128) of all incoming webrequests (all Wikimedia projects). Deeper analysis of wdqs logs (offline on HDFS, and in real-time on the nodes themselves) on Monday (2026-05-11) identified a scraper that had not previously been captured by the webrequest sample (Turnilo). Once a requestctl rule had been applied on the scraper signatures, the rate of queries timing out went back to baseline. Nodes impacted by high lag had been depooled and repooled once lag had been fully absorbed.

Follow ups

  • We will update the runbooks with additional information on troubleshooting traffic directly from logs (in real-time).
  • We have a workaround for WDQS to not throttle streaming-updater-consumer requests. This will be deployed and tested in Wikidata Platform’s current sprint.
  • We started to investigate and document options to improve real-time traffic analysis for the WDQS telemetry.

We are collecting comments from all respondents involved and will publish an incident report this week.

More info at https://wikitech.wikimedia.org/wiki/Incidents/2026-05-13_wdqs