Page MenuHomePhabricator

WikibaseQualityConstraints gets Too Many Requests errors from query service
Closed, ResolvedPublic

Description

Since around 2018-07-16 09:00:00Z (with an earlier spike around 2018-07-15 16:00:00Z) WikibaseQualityConstraints is experiencing higher query runtimes and HTTP 429 errors (Too Many Requests) from the query service. The number of “type fallback” queries and the volume of regex requests both show spikes, so I’m not sure what could cause this – there appears to be no increase in overall constraint check requests other than the usual day-night-cycle change in activity.

@Smalyshev should we even be getting HTTP 429 errors from the internal endpoint?

I’m not sure what to do about this. If it was just “type fallback” queries, it might be a good idea to increase the $wgWBQualityConstraintsTypeCheckMaxEntities setting in production (currently at 10, which really is just a wild guess by me from olden days), but with the corresponding increase in regex queries as well, I have no idea what could even cause this.

Event Timeline

I strongly believe that those are happened after GZWDer starts his cebwiki importing.

I see no evidence for that. GZWDer's edits started at 17:47 on the 14th (link)

However the spike does not stop when the import stopped.

should we even be getting HTTP 429 errors from the internal endpoint?

Yes :) The internal endpoint still designed to serve multiple client scenarios, even if reduced to internal ones. Which means there should be balancing between them, so runaway client wouldn't take it down. The exact settings though could be adjusted. What is the timings that WikibaseQualityConstraints needs? Which queries, how often, etc.?

I’m not sure how to answer that question, because I don’t really understand how WBQC uses WDQS. According to Grafana, the mean runtime for all four checkers which potentially use SPARQL (type, value type, format, unique) is somewhere between 15 seconds and one minute, which is just absurd since we supposedly run every SPARQL query with maxQueryTimeMillis=5000 (5000 ms = 5 s). Something’s not right there, and I don’t know what it is.

在T199787#4430692中,@Bugreporter写道:

However the spike does not stop when the import stopped.

If you can't stop that work of your bot, then even I won't see if one day this issue can be fixed. Or you may consider do that without pywikibot, but instead AutoWikiBrowser?

Maybe this should have either UBN or High priority.

Welp, since yesterday evening (first plateau 2018-07-18 20:00:00Z – ca. midnight, second [even higher] plateau since 2018-07-19 06:50:00Z) someone is also making [lots of wbcheckconstraints requests](https://grafana.wikimedia.org/dashboard/db/wikidata-quality?orgId=1&from=1531911557660&to=1531997957660&refresh=10s&panelId=6&fullscreen), all of them cache misses, and the HTTP statuses we get from WDQS are not just HTTP 429 Too Many Requests but also HTTP 403 Forbidden. @Smalyshev do you know what could cause WDQS to return 403 instead of 429? I assume it’s due to the high request volume, but I don’t see that error code at all in the nginx config.

what could cause WDQS to return 403 instead of 429?

If you ignore 429 (i.e. try to access the service again within the timeout period specified in Retry-After inside 429) you eventually get temporarily banned, which results in 403.

We need to have some definition of the needs of WikibaseQualityConstraints towards WDQS. If it's just some client spamming wbcheckconstraints endpoint, it'd be fine to just pass 429/403 along (I don't think WikibaseQualityConstraints does it now, but probably it should), but if it's a part of Wiki lifecycle, we should see what is the cause and maybe amend the throttle/ban configurations.

Smalyshev triaged this task as Medium priority.Jul 20 2018, 12:05 AM

Well, according to Grafana, everything (constraint check upper runtime, number of SPARQL queries, SPARQL HTTP errors, etc.) is back to normal levels now…

I don’t think we should directly pass 429/403 errors from the query service to the client: there’s no reason to skip all other constraint checks just because there are problems with the query service. We might want to throttle constraint checks directly as well, but that’s a different issue.

But if we get HTTP 429 from WDQS, perhaps we should temporarily disable SPARQL for all requests (act as if it had not been configured), to avoid a hard ban?

And I still don’t understand how this error started – there was someone spamming wbcheckconstraints (apparently), but the first problems started earlier than that, so throttling wbcheckconstraints might not help to prevent whatever first caused the issue to occur.

在T199787#4437749中,@Bugreporter写道:

that's fixed, and I don't think there are possible relations

This is why 403 error will occur.

I think we can close this task as having resolved itself. Whatever the issue was at the time isn’t really relevant anymore; meanwhile, WBQC respects 429 errors from WDQS now: T204469: WikibaseQualityConstraints should respect query service 429 header response..

it might be a good idea to increase the $wgWBQualityConstraintsTypeCheckMaxEntities setting in production (currently at 10, which really is just a wild guess by me from olden days)

And this was increased in T209504: Perform more constraint type checks in PHP before falling back to SPARQL, by the way.