Page MenuHomePhabricator

wikifeeds usage increased by 3x on 2020-06-24
Closed, ResolvedPublic

Description

Creation of this task was triggered by an icinga alert:

/{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200)

429 is the HTTP status code for rate limiting.

Looking more into it and more specifically

https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?orgId=1&from=1592927887704&to=1593073020947

it's obvious that usage of the service has increased 3x. This isn't a specific endpoint either. Judging from[1] and the image pasted below

2020-06-25_11-23.png (1×2 px, 396 KB)
it's 5 distinct endpoints that are receiving anything between 5 and 10 times more traffic.

This extra traffic is violating the services stated SLOs[2], both for Traffic as well the Latency signals (Errors are violating the SLO, but way way less so) so I am guessing it's not expected.

[1] https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=12&fullscreen&orgId=1&from=1592927887704&to=1593073020947
[2] https://wikitech.wikimedia.org/wiki/Wikifeeds#Service_level_indicators/objectives_(SLIs/SLOs)

Event Timeline

akosiaris added subscribers: RLazarus, JMeybohm, Joe.

Turnillo points out that these requests come disproportionately with this User Agent

WikipediaApp/6.6.0.1742 (iOS 13.5.1; Phone)

which seems to be the latest version of the Wikipedia iOS app.[1]

Hits are particularly from the US and DE country codes with 2 of the respective ISPs bearing most of the traffic. However no IP can be singled out

This is looking more and more like valid traffic to me. Did we have some new feature rolled out that suddenly increased inbound traffic? Version history[2] seems to not suggest so, at least timewise. May 26, 2020 seems to be the date the app was rolled out, so 1 month ago.

[1] https://apps.apple.com/us/app/wikipedia/id324715238

Mentioned in SAL (#wikimedia-operations) [2020-06-25T09:28:44Z] <akosiaris> schedule downtime for eqiad wikifeeds as it's flapping too much without yet knowing why. T256358

codfw wikifeeds is by the way exhibiting similar behavior. https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?panelId=12&fullscreen&orgId=1&from=now-24h&to=now&var-dc=codfw%20prometheus%2Fk8s&var-service=wikifeeds

It's less pronounced there. About 2x to 3x on the same endpoints, and 2x on average as it seems

The start of the increases coincides with a deploy of restbase

16:46	<ppchelko@deploy1001>	Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo (duration: 05m 11s)	[production]
16:41	<ppchelko@deploy1001>	Started deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, feeds timed out, redo	[production]
16:40	<ppchelko@deploy1001>	Finished deploy [restbase/deploy@5f08f32]: Release PCS endpoints updates, take 2 (duration: 14m 11s)

https://sal.toolforge.org/log/i9o053IBLkHzneNNEhCF and friends

akosiaris added a subscriber: Pchelolo.

@Pchelolo I think that yesterday's deploy of restbase has caused the issue described in this task. Restbase also alerts with

/en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path => Missing keys: ['mostread']

We should rollback and figure out what happened and how we can address it.

Thank you for digging into this. I'll have a look asap

This is caused by https://github.com/wikimedia/restbase/pull/1256 that supports language variants for feeds. The problem with the PR is that it assigns vary: accept-language regardless whether the language has variants or not - thus less edge caching, thus more traffic. Fixing.

https://github.com/wikimedia/restbase/pull/1268 should begin emitting Vary: accept-language header only for the feeds that indeed can be converted to language variants. After deployment the edge cache should get populated and we should see the issue disappear.

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:19:08Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:20:45Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358 (duration: 01m 37s)

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:23:06Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:29:44Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, take 2 (duration: 06m 38s)

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:30:17Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:33:41Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 24s)

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:33:46Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups

akosiaris lowered the priority of this task from High to Low.Jun 25 2020, 3:34 PM

It seems like the restbase deploy of 821e96b fixed the issue. I 'll lower priority but leave this open for a few hours and if everything is fine, close as Resolved

@Pchelolo thanks for the quick solution.

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:37:24Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 03m 38s)

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:37:47Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups

Mentioned in SAL (#wikimedia-operations) [2020-06-25T15:42:56Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@821e96b]: Only emit vary: accept-language for feeds when it matters T256358, more groups (duration: 05m 09s)

Everything seems fine. I am gonna resolve this. @Pchelolo thanks again!