Page MenuHomePhabricator

recommendation api's test on scb nodes are flapping
Closed, ResolvedPublic

Description

The recommendation api on scb eqiad nodes seems to be failing intermittently, causing alerts to SRE.

It seems a problem with API calls like the following, returning intermittently HTTP 200 (expected one) or 404/503 (not expected). To reproduce:

curl http://scb1001.eqiad.wmnet:9632/uz.wikipedia.org/v1/article/creation/morelike/Palov -i

We thought that the problem was WDQS internal but after a chat with the Search team, we didn't find any outstanding problem. This seems to be something related to the application itself, so I'd ask to the Research team a quick look if possible to help us debugging :)

Event Timeline

elukey renamed this task from reccommendation api's test on scb nodes are flapping to recommendation api's test on scb nodes are flapping.Mar 16 2020, 9:45 AM
elukey created this task.
elukey updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-03-16T10:36:04Z] <elukey> roll restart of recommendation service on scb* as attempt to fix the flapping alerts - T247732

@elukey thanks for flagging this.

@bmansurov can you look into this and let me know what the best course of action is?

leila triaged this task as High priority.Mar 16 2020, 7:20 PM

@elukey how can I access http://scb1001.eqiad.wmnet:9632? Should I be on some host to ping that URL? Also, where can I see the logs? Thanks!

@elukey how can I access http://scb1001.eqiad.wmnet:9632? Should I be on some host to ping that URL? Also, where can I see the logs? Thanks!

On any host that can reach scb is fine, from within the analytics VLAN we need to use the http proxy. For the logs access to the scb nodes is needed :(

The API was heavily flapping also yesterday, in my opinion a chat between Research and SRE needs to happen to figure out how to best manage this service. Pulling in @akosiaris as point of contact to decide how to proceed :)

@bmansurov let me know if my help is needed. Otherwise, I assume you're on it.

Thanks both. @leila I'm on it. I need to access logstash.wikimedia.org to see the logs. According to the documentation:

wikitech LDAP username and password and membership in one of the following LDAP groups: nda, ops, wmf

Do you know who can add me to the nda group? Can you help me with gaining access? I'm using my Gerrit username and password, but cannot login.

@bmansurov please fill out the task description for T250335 and add LDAP-Access-Requests as a tag.

@elukey from the logs I see that both 404 and 503 come in pairs. In the recommendation API we ping the MediaWiki API, which sometimes returns 503. We then return a 404 here. So, the error has really to do with the MW API. Below I'm pasting a sample response from the MW API (prettified for readibility):

<!DOCTYPE html>
<html lang="en" dir="ltr">
<meta charset="utf-8">
<title>Wikimedia Error
</title>
<style>* { margin: 0; padding: 0; }body { background: #fff; font:
15px/1.6 sans-serif; color: #333; }.content { margin: 7% auto 0;
padding: 2em 1em 1em; max-width: 640px; }img { float: left; margin: 0
2em 2em 0; }a img { border: 0; }h1 { margin-top: 1em; font-size:
1.2em; }p { margin: 0.7em 0 1em 0; }a { color: #0645AD;
text-decoration: none; }a:hover { text-decoration: underline; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org">
<img src="https://www.wikimedia.org/static/images/wmf.png"
srcset="https://www.wikimedia.org/static/images/wmf-2x.png 2x"
alt=Wikimedia width=135 height=135>
</a>
<h1>Service Temporarily Unavailable
</h1>
<p>Our servers are currently under maintenance or experiencing a
technical problem. Please
<a href="" title="Reload this page"
onclick="location.reload(false); return false">try again
</a> in a few&nbsp;minutes.
</p>
</div>
</html>

I'm open to suggestions on how to work this out. Ideally this should be fixed in MediaWiki.

@bmansurov thanks for following up! What I'd start doing is to log 50x errors from the MW api in the service logs if possible, so people can easily get what is happening when the recommendation api starts flapping. To be more resilient to temporary glitches of the MW API we could add a simple and quick retry/backoff, but it may amplify traffic volume to the MW API when it is suffering so it could be a double edge sword solution.

One thing that I'd like to clarify is that from our tests when the recommendation api flaps calls for the same URL (like the one in the description) sometimes lead to a 404 and sometimes to a 503, meanwhile from your description it seems that a 404 is the response of a 503 received from the MW API.

@bmansurov thanks for following up! What I'd start doing is to log 50x errors from the MW api in the service logs if possible, so people can easily get what is happening when the recommendation api starts flapping.

This is already happening. See /srv/log/recommendation_api/main.log in scb1001.eqiad.wmnet, for example.

One thing that I'd like to clarify is that from our tests when the recommendation api flaps calls for the same URL (like the one in the description) sometimes lead to a 404 and sometimes to a 503, meanwhile from your description it seems that a 404 is the response of a 503 received from the MW API.

In addition to 404 after 503, I also see that 503 happens by itself with the same error message as in my previous comment when pining /www.wikidata.org/v1/description/translation/from/tr/to/en or /commons.wikimedia.org/v1/caption/translation/from/ru/to/en, for example.

And you're right, 503 for the existing article happens alone as well, it seems. However, the error message is still

Service Temporarily Unavailable
Our servers are currently under maintenance or experiencing a technical problem.

This is weird because we don't raise a 503 in our code, so it maybe propagating from somewhere else.

Yes sorry what I meant is if there is an explanation of the 404 in the logs, since it is not something that catches any eyes on when debugging why a service flaps for example. Anyway, let's look a little bit more into the 503s!

Change 590755 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/recommendation-api@master] Morelike API: return 503 when MW API fails

https://gerrit.wikimedia.org/r/590755

Change 590755 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Morelike API: return 503 when MW API fails

https://gerrit.wikimedia.org/r/590755

@elukey With the above patch some 503 errors will be logged correctly with informative message. I'll deploy the patch as soon as possible.

I also figured out why other 503's were happening. Basically, we make more than one type of MW API request (e.g. getWikidataId, getArticleNames, etc.), and some of those request errors aren't being explicitly caught and aren't returning a custom error message. We just let the API utility function to log and return the error. I think we should keep letting errors propagate because the error messages are already informative. What do you think?

@bmansurov the idea seems good! About the propagation of the error, I would say that it is better to wrap the 503 returned by the API in something ad-hoc for the recommendation API, since it might be confusing for a user to query a service and get back the response that the MW api returns when in error. Maybe something like "One or more MW API calls failed with 503, check the logs for more info" or similar. Does it sound good?

Change 592490 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/recommendation-api@master] Recommendation API: return user friendly error messages

https://gerrit.wikimedia.org/r/592490

Change 592490 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Recommendation API: return user friendly error messages

https://gerrit.wikimedia.org/r/592490

bmansurov claimed this task.
bmansurov removed a project: Patch-For-Review.
bmansurov moved this task from Staged to Services on the Research board.

All changes have been deployed. Feel free to re-open when you see the issue again.

@bmansurov one thing that I'd consider is changing the health check for the recommendation API service, and possibly not fire a request to mediawiki but just a dummy one? We currently see the service flapping every time that the mw api is under distress, but it is just noise, I don't think that it is particularly valuable to test a real page. What do you think?

Change 593916 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/recommendation-api@master] Health check: disable error logging to prevent false alrams

https://gerrit.wikimedia.org/r/593916

@elukey Please take a look at the above patch. Rather than stubbing requests to MW API (and thus making a lot of code changes) I decided not to log health check error messages when the MW API is under stress. If the patch looks good to you, let me know and I'll deploy it.

@elukey Please take a look at the above patch. Rather than stubbing requests to MW API (and thus making a lot of code changes) I decided not to log health check error messages when the MW API is under stress. If the patch looks good to you, let me know and I'll deploy it.

The main issue (IIUC the code) is still that a failed health check response would return a 50x error, thus ending up in icinga alerts.

OK, I see. Then I'll not merge that patch. I'll handle the response code directly, possibly returning a 404 for failed health check requests.

Change 593916 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Health check: return 404 when MW API fails

https://gerrit.wikimedia.org/r/593916

@elukey I've deployed the fix. Let me know if you still see the issue.

Legoktm added a subscriber: Legoktm.

This is no longer an issue because SCB is long gone, and there are no flapping alerts for this service that I've seen recently.