I've created this incident report to follow up on today's WDQS 500 error incident.
Creating this ticket to:
- Review the incident report
- Identify and implement any changes that would prevent or lessen the risk of a similar incident
I've created this incident report to follow up on today's WDQS 500 error incident.
Creating this ticket to:
A few notes:
I just found out about this incident, so apologies if I didn't respond sooner. I would have liked to help as soon as the issue was identified, as that might have saved us all some time. For the next time, tagging the task with WikimediaCampaignEvents would be sufficient.
I'm sure we talked about this a couple times in the past, and we (Campaigns) reviewed the recommendations with Search Platform before our first deployment (I can look that up if needed). The dependency is asynchronous, as the WDQS query is run in a DeferredUpdate and does not block the response. There are a couple lines about this in WikiProjectIDLookup (which is also a relatively small class). I'm wondering where we could put this information to make it more visible, so that this question does not come up next time. Would the class documentation work? I'd be happy to improve the docs.
- WikimediaCampaignEvents having impact on the public WDQS endpoint would imply that it isn't using the internal endpoint as it should. We have separate public and internal endpoints to ensure that traffic spikes generated on the external endpoint don't have impact on Mediawiki.
I'm not sure. All I know is that we went through our WDQS code with y'all last summer and everything seemed OK. It also has worked fine for the last few months AFAICT, and there have been no changes to the code.
- It looks like WikimediaCampaignEvents does not protect itself from a misbehaving WDQS. I don't see a poolcounter or another kind of circuit breaker being implemented (I did not look very deep). WDQS is known to have unstable performances, so protecting against it is important to ensure the stability of Mediawiki as a whole. Note that this isn't the failure mode we're seeing in this incident, just a related observation.
Indeed, we do not have any PoolCounter-like protection in place. However, AIUI, PoolCounter is useful when there's a concrete risk of many threads from concurrently computing the same thing. I don't think that's a huge risk for our use case, which can trigger a query only when someone visits the Special:AllEvents page. Definitely something to look into, just not high on the priority list I'd say (unless there's evidence of the opposite).
Next, from reading the incident report.
This mediawiki extension CR was merged/deployed
I fail to see a link between r1123238 and the WDQS code, beyond "they both live in the same extension". They're separated feature, triggered via entirely different use cases (special page vs action API), maintained by two different teams (Language and Product Localization is responsible for that feature). It's about as unrelated as you can get within the "same codebase" constraint. This alone is not sufficient evidence, of course: there might be a link between the two that I missed. But it already looks suspicious.
Problems with Mediawiki extension deploy noted in #wikimedia-operations:
Tracked in T387461, those were definitely caused by the deployment, but it's a PHP error within MediaWiki that has nothing to do with WDQS. Once again, there might be a link between the two that I'm not seeing, but I don't really believe that.
Next, I checked the SAL and cross-linked it with IRC logs and the spike seen in the WDQS dashboard. There seems to be some confusion in the incident doc. A more accurate timeline seems to be the following (all times in UTC):
In practice, this seems to indicate that the WDQS error rate began increasing before the WMCE patch was deployed anywhere. There seems to be a better match between the revert and the decrease in errors, but I don't think that matters at this point.
All in all, I don't see enough evidence that this had anything to do with WikimediaCampaignEvents. However, to have a final word on this, is it possible to check access logs for the internal endpoint? All requests via SparqlClient (like the WMCE ones) are sent using the MediaWiki/1.44.0-alpha SparqlClient User Agent, and our query is basically always the same, so it should be easy to find. The number of such queries during this time frame should give a definitive answer on whether the extension is involved.
It now seems clear that WikimediaCampaignEvents isn't involved in this incident. Thanks for having a look and my apologies for implying there might be an issue.
We don't have great understanding of the root causes of this incident, but we'll blame the usual WDQS instabilities and move on for the moment.
Thank you for letting us know! I think the hypothesis was reasonable given the lack of familiarity with the code. If something like this happens again, tagging the component (in this case, WikimediaCampaignEvents) will get my attention. I'm also on IRC most of the time, so feel free to ping me there as soon as there's a sign that something is going haywire.
Filed T387893: Write documentation for WikimediaCampaignEvents integration with WDQS.
I'll try to remember, but navigating how to contact each team / project tend to not fit in my small brain. I'm probably going to struggle again!
We don't really have an "official" way... Tagging the extension will work, and as for me personally, everywhere you can find me (phab, email, slack, IRC...) will work for me. I think we'll probably also look into watching the team board more closely (and not just the component tag).