Page MenuHomePhabricator

Post-incident ticket for elevated 500 errors from WDQS
Closed, ResolvedPublic

Description

I've created this incident report to follow up on today's WDQS 500 error incident.

Creating this ticket to:

  • Review the incident report
  • Identify and implement any changes that would prevent or lessen the risk of a similar incident

Event Timeline

bking renamed this task from Investigate elevated 500 errors from WDQS to Post-incident ticket for elevated 500 errors from WDQS.Feb 27 2025, 5:42 PM
bking updated the task description. (Show Details)
bking updated the task description. (Show Details)

A few notes:

  • There is a suspicion (correlation != causation) that WikimediaCampaignEvents is involved in the issue on wdqs.discovery.wmnet
  • WikimediaCampaignEvents has a dependency on WDQS. Without more reading, it is not clear if that dependency is synchronous or not. Synchronous dependency would be against our recommendations.
  • WikimediaCampaignEvents having impact on the public WDQS endpoint would imply that it isn't using the internal endpoint as it should. We have separate public and internal endpoints to ensure that traffic spikes generated on the external endpoint don't have impact on Mediawiki.
  • It looks like WikimediaCampaignEvents does not protect itself from a misbehaving WDQS. I don't see a poolcounter or another kind of circuit breaker being implemented (I did not look very deep). WDQS is known to have unstable performances, so protecting against it is important to ensure the stability of Mediawiki as a whole. Note that this isn't the failure mode we're seeing in this incident, just a related observation.

I just found out about this incident, so apologies if I didn't respond sooner. I would have liked to help as soon as the issue was identified, as that might have saved us all some time. For the next time, tagging the task with WikimediaCampaignEvents would be sufficient.

  • WikimediaCampaignEvents has a dependency on WDQS. Without more reading, it is not clear if that dependency is synchronous or not. Synchronous dependency would be against our recommendations.

I'm sure we talked about this a couple times in the past, and we (Campaigns) reviewed the recommendations with Search Platform before our first deployment (I can look that up if needed). The dependency is asynchronous, as the WDQS query is run in a DeferredUpdate and does not block the response. There are a couple lines about this in WikiProjectIDLookup (which is also a relatively small class). I'm wondering where we could put this information to make it more visible, so that this question does not come up next time. Would the class documentation work? I'd be happy to improve the docs.

  • WikimediaCampaignEvents having impact on the public WDQS endpoint would imply that it isn't using the internal endpoint as it should. We have separate public and internal endpoints to ensure that traffic spikes generated on the external endpoint don't have impact on Mediawiki.

I'm not sure. All I know is that we went through our WDQS code with y'all last summer and everything seemed OK. It also has worked fine for the last few months AFAICT, and there have been no changes to the code.

  • It looks like WikimediaCampaignEvents does not protect itself from a misbehaving WDQS. I don't see a poolcounter or another kind of circuit breaker being implemented (I did not look very deep). WDQS is known to have unstable performances, so protecting against it is important to ensure the stability of Mediawiki as a whole. Note that this isn't the failure mode we're seeing in this incident, just a related observation.

Indeed, we do not have any PoolCounter-like protection in place. However, AIUI, PoolCounter is useful when there's a concrete risk of many threads from concurrently computing the same thing. I don't think that's a huge risk for our use case, which can trigger a query only when someone visits the Special:AllEvents page. Definitely something to look into, just not high on the priority list I'd say (unless there's evidence of the opposite).


Next, from reading the incident report.

This mediawiki extension CR was merged/deployed

I fail to see a link between r1123238 and the WDQS code, beyond "they both live in the same extension". They're separated feature, triggered via entirely different use cases (special page vs action API), maintained by two different teams (Language and Product Localization is responsible for that feature). It's about as unrelated as you can get within the "same codebase" constraint. This alone is not sufficient evidence, of course: there might be a link between the two that I missed. But it already looks suspicious.

Problems with Mediawiki extension deploy noted in #wikimedia-operations:

Tracked in T387461, those were definitely caused by the deployment, but it's a PHP error within MediaWiki that has nothing to do with WDQS. Once again, there might be a link between the two that I'm not seeing, but I don't really believe that.

Next, I checked the SAL and cross-linked it with IRC logs and the spike seen in the WDQS dashboard. There seems to be some confusion in the incident doc. A more accurate timeline seems to be the following (all times in UTC):

  • 14:01:40: deployment window begins, @TheresNoTime deploys
  • 14:02:14: r1123356, an unrelated config patch, is approved via scap backport
  • 14:02:50: while the config patch is being merged, @KartikMistry asks for their patch (the WikimediaCampaignEvents one, or WMCE for short) to be merged to save time
  • 14:03:22: @TheresNoTime manually sets CR+2 on the WMCE patch
  • 14:04:35: The WMCE patch is merged. This does nothing though. Normally, the merge would only enqueue the patch for the next beta cluster code update. But because this patch was in a wmf branch (1.44.0-wmf.18) and beta uses master, nothing (should have) happened.
  • ~14:08:xx: WDQS errors begin to rise
  • 14:08:51: @TheresNoTime reports seeing errors in check_testservers_baremetal-1_of_1
  • 14:10:58: still getting those issues after 3 retries. Output pasted in P73774. The config patch deployment is put on hold.
  • ~14:11:xx: WDQS errors reach a first peak
  • 14:13:55: scap sync-world for the WMCE patch
  • 14:16:51: the WMCE is synced to mwdebug

In practice, this seems to indicate that the WDQS error rate began increasing before the WMCE patch was deployed anywhere. There seems to be a better match between the revert and the decrease in errors, but I don't think that matters at this point.

All in all, I don't see enough evidence that this had anything to do with WikimediaCampaignEvents. However, to have a final word on this, is it possible to check access logs for the internal endpoint? All requests via SparqlClient (like the WMCE ones) are sent using the MediaWiki/1.44.0-alpha SparqlClient User Agent, and our query is basically always the same, so it should be easy to find. The number of such queries during this time frame should give a definitive answer on whether the extension is involved.

It now seems clear that WikimediaCampaignEvents isn't involved in this incident. Thanks for having a look and my apologies for implying there might be an issue.

We don't have great understanding of the root causes of this incident, but we'll blame the usual WDQS instabilities and move on for the moment.

It now seems clear that WikimediaCampaignEvents isn't involved in this incident. Thanks for having a look and my apologies for implying there might be an issue.

Thank you for letting us know! I think the hypothesis was reasonable given the lack of familiarity with the code. If something like this happens again, tagging the component (in this case, WikimediaCampaignEvents) will get my attention. I'm also on IRC most of the time, so feel free to ping me there as soon as there's a sign that something is going haywire.

I'm wondering where we could put this information to make it more visible, so that this question does not come up next time. Would the class documentation work? I'd be happy to improve the docs.

Filed T387893: Write documentation for WikimediaCampaignEvents integration with WDQS.

If something like this happens again, tagging the component (in this case, WikimediaCampaignEvents) will get my attention. I'm also on IRC most of the time, so feel free to ping me there as soon as there's a sign that something is going haywire.

I'll try to remember, but navigating how to contact each team / project tend to not fit in my small brain. I'm probably going to struggle again!

If something like this happens again, tagging the component (in this case, WikimediaCampaignEvents) will get my attention. I'm also on IRC most of the time, so feel free to ping me there as soon as there's a sign that something is going haywire.

I'll try to remember, but navigating how to contact each team / project tend to not fit in my small brain. I'm probably going to struggle again!

We don't really have an "official" way... Tagging the extension will work, and as for me personally, everywhere you can find me (phab, email, slack, IRC...) will work for me. I think we'll probably also look into watching the team board more closely (and not just the component tag).