Page MenuHomePhabricator

AQS alerts due to big queries issued to Druid for the edit API
Closed, ResolvedPublic8 Story Points

Description

Today some alerts fired for AQS, in correspondence with the following peaks of calls:

https://grafana.wikimedia.org/d/000000538/druid?panelId=42&fullscreen&orgId=1&from=1554126237025&to=1554217110944

I was able to get some URIs from aqs1007 via httpry, some examples:

2019-04-02 10:51:21	10.64.0.112	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Camp_de_r%C3%A9fugi%C3%A9s/all-editor-types/daily/20070126/20190402	HTTP/1.1	-	-
2019-04-02 10:51:22	10.64.48.134	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Crist%C3%B3bal_Cruz/all-editor-types/daily/20081216/20190402	HTTP/1.1	-	-
2019-04-02 10:51:22	10.64.0.223	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Mikola_Statkevich/all-editor-types/daily/20181021/20190402	HTTP/1.1	-	-
2019-04-02 10:51:23	10.64.32.178	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Masakatsu_Suzoki/all-editor-types/daily/20140425/20190402	HTTP/1.1	-	-
2019-04-02 10:51:26	10.64.48.110	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Satavahana/all-editor-types/daily/20110627/20190402	HTTP/1.1	-	-
2019-04-02 10:51:26	10.64.0.31	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Duke_Lemur_Center/all-editor-types/daily/20121017/20190402	HTTP/1.1	-	-
2019-04-02 10:51:27	10.64.32.178	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Sam_Perkins/all-editor-types/daily/20081018/20190402	HTTP/1.1	-	-
2019-04-02 10:51:29	10.64.0.112	10.2.2.12	>	GET	aqs.svc.eqiad.wmnet:7232	/analytics.wikimedia.org/v1/edits/per-page/fr.wikipedia.org/Dewoitine_D.333/all-editor-types/daily/20120816/20190402	HTTP/1.1	-	-

The queries were made directly to AQS and using big time windows, also bypassing cache due to requesting each time a different page.

We should find the Druid limit and build throttling in AQS to avoid this issue to re-happen again. This was likely a bot hitting us, but it shouldn't matter - our API is public and it should only allow traffic that it can sustain.

Event Timeline

elukey triaged this task as Normal priority.Apr 2 2019, 5:55 PM
elukey created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 5:55 PM
Nuria added a comment.EditedApr 2 2019, 11:39 PM

Well yesterday (2019-04-02) between 10 and 11 am we had about 120,000 reqs for fr.wikipedia edit data, last month same interval there was just a couple (literally a couple). The UA of these requests is a python/blah script and they come from two ip addresses.

Nuria added a comment.Apr 2 2019, 11:50 PM

For the whole day there is about 300,000 requests across three IPs

Nuria added a comment.Apr 2 2019, 11:59 PM

All these requests have a 200 response, need to check throttling settings

Would 429s get through to webrequest? I know at some point restbase was sending the 429s in which case, yes, but I forget if the throttling got moved to varnish in which case it might be earlier in the request life?

elukey added a comment.EditedApr 3 2019, 6:32 AM

Would 429s get through to webrequest? I know at some point restbase was sending the 429s in which case, yes, but I forget if the throttling got moved to varnish in which case it might be earlier in the request life?

I think that it would, IIRC we have it set in restbase itself:

https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L607-L613 (and others)

Even if it was returned by Varnish we'd have a trace in its shared memory log that would be picked up by Varnishkafka, so we'd find a trace in webrequest in any case afaik. I think that our throttling limit for the edit API is too high, 25 rps is not sustainable.. Nuria said that ~120k requests were made in an hour, that is about ~33rps on average, but I am pretty sure that the bot made less than 25rps for along time before increasing (and causing the alerts). My opinion is that we should either spend time on improving the speed of Druid, or clearly define an acceptable amount of requests that the edit API can sustain without falling over and throttle accordingly (or even both! But we probably don't have time :)

I think the problem experienced yesterday could have been prevented by T189623.
Data backup:

Reducing the number of segments queriable at once should help a lot.
Also, from another angle, the issue occured here because the variability of queriable pages was big (per-page queries). Maybe we should restrict the maximum time-span of requests for very narow requests (per-page, and per-user if we get there), as broader-span requests get cached eventually.

Nuria added a comment.EditedApr 3 2019, 4:55 PM

@elukey, i think that is the global throttling which is ok at that level, we probably just want to restrict per method throttling in the edit methods that include per-article arguments

Shall I send a PR updating rate-limiting in restbase for edits-per-page requests to 10?
https://github.com/wikimedia/restbase/blob/master/v1/metrics.yaml#L1551

Nuria added a comment.Apr 3 2019, 6:33 PM

@JAllemandou let me verify that throttling at levels specified is actually happening. Will do this later on today.

Nuria added a comment.EditedApr 3 2019, 10:34 PM

So, overall the course of the day in which we get 320,000 requests only 11 get throttled.
Correcting my numbers, in the course of two hours (hours 10 and 11) we get 63797 by one iP and 58842 by another. So neither IP is over the throttling limit per sec and thus the low numbers of 429s make sense. Now, as @elukey mentioned we cannot sustain 25 req/sec on editing endpoint. Until we limit the page size (per @JAllemandou comment above) we will be reducing limits to 5 reqs per sec for that method as from data does not look like we can sustain 10 (the numbers I calculated are for a 2 hour period so % might be a bit strict)

Nuria claimed this task.Apr 3 2019, 11:13 PM
Nuria added a project: Analytics-Kanban.
Nuria set the point value for this task to 3.
Nuria moved this task from Next Up to In Code Review on the Analytics-Kanban board.

Interesting :)
When you look at teh dashboard @elukey built, this graph is under the historical metrics section, making it clear that those requests are for historical nodes (by-segment requests to be aggregated later). I could be good to add the explicit node-type in the graph title.

fdans moved this task from Incoming to Ops Week on the Analytics board.Apr 4 2019, 5:18 PM
Nuria added a comment.Apr 4 2019, 7:16 PM

Closed pull request and opened another one with hopefully the right changes: https://github.com/wikimedia/restbase/pull/1109

Change 502198 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/aqs@master] Update mediawiki-history per-page and per-editor

https://gerrit.wikimedia.org/r/502198

I did a quick analysis over request-patterns: 94% of edits-per-page requests made on April 4th were on a timespan of more than 1 year with mostly daily granularity. I have made the following patch to restrict timespan of per-page requests to 1 year: https://gerrit.wikimedia.org/r/c/analytics/aqs/+/502198

Mentioned in SAL (#wikimedia-operations) [2019-04-08T18:06:25Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@9cf5364]: Lower AQS rate limits and fix recommendation-api spec - T219910 T220221

Mentioned in SAL (#wikimedia-operations) [2019-04-08T18:27:40Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@9cf5364]: Lower AQS rate limits and fix recommendation-api spec - T219910 T220221 (duration: 21m 14s)

Change 502198 merged by Milimetric:
[analytics/aqs@master] Update mediawiki-history per-page and per-editor

https://gerrit.wikimedia.org/r/502198

Nuria moved this task from Ready to Deploy to Done on the Analytics-Kanban board.
Nuria moved this task from Done to Ready to Deploy on the Analytics-Kanban board.

Change 502814 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: set stricter timeouts

https://gerrit.wikimedia.org/r/502814

Change 502814 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: set stricter timeouts

https://gerrit.wikimedia.org/r/502814

Change 502951 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: raise max direct mem for historical

https://gerrit.wikimedia.org/r/502951

Change 502951 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: raise max direct mem for historical

https://gerrit.wikimedia.org/r/502951

Change 502958 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::druid::public::worker: reduce heap size for coord/overlord

https://gerrit.wikimedia.org/r/502958

Change 502958 merged by Elukey:
[operations/puppet@production] role::druid::public::worker: reduce heap size for coord/overlord

https://gerrit.wikimedia.org/r/502958

elukey added a comment.EditedApr 11 2019, 10:23 AM

After the above changes it seems that the other problem (brokers getting stuck when deleting segments) do not show up anymore. This is what we changed:

  • More HTTP threads on the historical daemons (from 20 -> 60), since each broker holds a 20 threads http client connection pool to each historical (so 3 brokers == 60 http conns for each historical).
  • More processing threads for the historical daemons (10 -> 36), since the historical need to do more work than the brokers especially under load.
  • Stricter timeouts for broker and historicals, to avoid waiting minutes before considering a request aborted (increasing the possibility of queueing and requests piling up, failing health checks, etc..)
elukey changed the point value for this task from 3 to 8.
Nuria closed this task as Resolved.Apr 19 2019, 1:24 PM