Page MenuHomePhabricator

flapping monitoring for recommendation_api on scb
Open, MediumPublic

Description

We have flapping Icinga monitoring for recommendation_api on scb machines a lot. The pattern is always:

18:53 < icinga-wm> PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target) is CRITICAL: 
                   Test normal source and target returned the unexpected status 429 (expecting: 200)
18:54 < icinga-wm> RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy

So it gets 429 for a while and then back to 200. It's like the service is rate-limiting the Icinga check.

Event Timeline

Dzahn created this task.Oct 17 2017, 10:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 17 2017, 10:58 PM
Dzahn updated the task description. (Show Details)Oct 17 2017, 10:59 PM
Dzahn renamed this task from flapping monitoring for cxserver on scb to flapping monitoring for recommendation_api on scb.Oct 17 2017, 11:02 PM
Dzahn updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-10-17T23:06:51Z] <mutante> disabling Icinga notifications for service recommendation_api on scb hosts - please remember to re-enable once ticket is resolved (T178445) (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=recommendation_api%20endpoints)

mobrovac added subscribers: Gehel, Smalyshev, mobrovac.

The 429 coming from WDQS. @Gehel, @Smalyshev would it be possible to split WDQS' rate limiting for internal and external requests?

Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptOct 18 2017, 7:35 AM
Smalyshev added a comment.EditedOct 18 2017, 7:39 AM

@mobrovac what's the rate the requests are currently sent at? IIRC the limits we have are pretty generous, but depends on the use case of course.

Gehel added a comment.Oct 18 2017, 7:40 AM

@mobrovac it is possible if we can identify internal traffic. The throttling we apply is bucketed by user agent / IP, so I suspect that all the recommendation API traffic end up in the same bucket.

We could add an exception for specific user agent. Or have Recommendation API propagate some way to bucket by end user. This needs a little bit of thinking.

@mobrovac what's the rate the requests are currently sent at? IIRC the limits we have are pretty generous, but depends on the use case of course.

This is the part that I don't understand - the rate of the public end point is 0, which means that the Recommendation API service contacts WDQS on;y 4 times every 60 seconds due to the automatic monitoring script sending health-check requests to it, which doesn't seem like something that should trigger rate limiting.

We could add an exception for specific user agent. Or have Recommendation API propagate some way to bucket by end user. This needs a little bit of thinking.

The latter would be ideal. We could forward the client's IP via the x-client-ip header.

Gehel added a comment.Oct 18 2017, 8:16 AM

Looking at logs in logstash, it seems we throttled Recommendation API only when WDQS was overwhelmed by another user. Throttling is done on overall request time and the service was most probably already not responding correctly, or in a reasonable time.

My understanding at this point is that the flapping and alerting was reasonable. WDQS was returning 429 (throttling), but without throttling, it would probably have timed out.

Forwarding x-client-ip might still make sense. Or sending a user-agent combining the current UA and the end user UA.

The amount of requests from the Recommendation API service actually makes sense. On each service checker script run, 3 requests are sent to WDQS from the service. Accounting for the number of hosts and frequency of the checks, we get to 0.5 requests per second. However, because the checker script does not receive the expected status, it retries multiple times, which brings us to the observed rate of approximately 1.5 reqs/s. @Joe, @Volans should perhaps some back-off policy be implemented in the checker script, especially when it receives 429s?

I'm not sure why we have a retry in the first place. An exponential back-off would be good, or to honor the "Retry-After" HTTP header sent by WDQS on 429 errors.

Ottomata triaged this task as Medium priority.Jan 16 2018, 8:13 PM
Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 18 2018, 2:38 PM
brennen added a subscriber: brennen.Dec 4 2019, 8:52 PM
jcrespo added a subscriber: jcrespo.

This is flapping very frequently, but with a 500, not a 429 (scb1002 only, for example, twice per hour). Should I close this and open a new one, or can this be handled here?

fgiunchedi moved this task from Inbox to Radar on the observability board.Mon, Jul 20, 1:16 PM