Page MenuHomePhabricator

WDQS high load/lag incident 2025-11-10
Closed, ResolvedPublic

Description

See this graph for details about lag spikes. Every time lag is over 10 m, users accessing the Wikidata Query Service from the endpoint may experience slow responses, errors, and/or timeouts. Bots editing wikidata will be instructed to throttle, as the data from WDQS would be considered stale.

As I write this, we've had:

  • A smaller spike from ~1125-1145 UTC that only affected our CODFW datacenter (Dallas, TX)
  • A larger spike from ~1645-1900 UTC that affected both the CODFW and EQIAD (Herndon, VA) datacenter.

Event Timeline

End of shift update:

  • We didn't see any blatantly abusive traffic with our typical tools, so Ryan and I started searching through /var/log/wdqs/wdqs-blazegraph.log (which contains actual queries along with user-agents, HTTP response codes etc) in hopes of finding bad queries there. We started by banning a user-agent that's:
  1. blatantly generic
  2. only seems to send malformed queries

WDQS is not being attacked at the moment, so it's tough to say whether or not this makes a difference. However, I do think we will be looking more closely at malformed queries if the abusive traffic returns.

bking triaged this task as Medium priority.

Update: I've created some follow-up subtasks (see "related objects" above). If we fix our Logstash dashboard and proactively ban some abusive queries, we should be in better shape next time the load creeps up. Closing in favor of subtasks...