Page MenuHomePhabricator

https://recommend.wmflabs.org is down
Closed, ResolvedPublic

Description

https://recommend.wmflabs.org is returning 'Internal Server Error' at moment. Currently, ContentTranslation depends to fetch article suggestion from it.

Details

Related Gerrit Patches:
research/recommendation-api : masterTemporarily disable related_articles

Event Timeline

KartikMistry triaged this task as High priority.Mar 19 2018, 6:20 AM

Can ContentTranslation handle this being down, or will this cause issues/errors?

Can ContentTranslation handle this being down, or will this cause issues/errors?

Users won't get any suggestions.

Joe added a subscriber: Joe.Mar 19 2018, 10:31 AM

Just to understand: we have a production service dependent on something that lives in labs in order to function properly?

Joe added a comment.Mar 19 2018, 10:36 AM

Also, why is your service not using the production recommendation API service?

Where is the configuration pointing to that labs url?

Joe added a comment.Mar 19 2018, 10:41 AM

The solution to this ticket, as far as I understand, is using the production url http://recommendation-api.discovery.wmnet:9632 instead of the labs service.

Joe added a comment.Mar 19 2018, 11:06 AM

Turns out the content-translation extension is not compatible with the version of the service that runs in production.

So I would say what needs to be done is to update the content-translation extension.

Joe added a comment.Mar 19 2018, 11:08 AM

for reference, production expects requests like

/{domain}/v1/translation/articles/{source}{/seed}

(found from the swagger spec at http://recommendation-api.discovery.wmnet:9632/?spec)

while the extension makes requests like

/types/translation/v1/articles?source=en&target=fa&seed=Aliweb&search=related_articles&application=CX`

which return an error on the production instance.

Mentioned in SAL (#wikimedia-cloud) [2018-03-19T11:37:24Z] <arturo> reboot tool.recommendation-api.eqiad.wmflabs for T190014

A reboot didn't solve the issue.

I can see this in the logs:

Mar 19 03:59:10 tool nslcd[1546]: [b9af84] <passwd="bmansurov"> (re)loading /etc/nsswitch.conf
[...]
Mar 19 04:01:26 tool systemd[1]: Reloading.
Mar 19 04:01:26 tool systemd[1]: [/lib/systemd/system/lxc.service:15] Unknown lvalue 'Delegate' in section 'Service'
Mar 19 04:01:26 tool systemd[1]: Stopping Recommendation tool service...
[...]
Mar 19 04:01:27 tool systemd[1]: Started Recommendation tool service.
Mar 19 04:01:27 tool uwsgi[14268]: [uWSGI] getting INI configuration from /etc/recommendation/uwsgi.ini
Mar 19 04:01:27 tool uwsgi[14268]: *** Starting uWSGI 2.0.14 (64bit) on [Mon Mar 19 04:01:27 2018] ***
[...]
Mar 19 04:01:27 tool uwsgi[14268]: *** Operational MODE: preforking ***
Mar 19 04:01:27 tool systemd[1]: Starting A high performance web server and a reverse proxy server...
Mar 19 04:01:27 tool systemd[1]: Failed to read PID from file /run/nginx.pid: Invalid argument
Mar 19 04:01:27 tool systemd[1]: Started A high performance web server and a reverse proxy server.
Mar 19 04:01:28 tool uwsgi[14268]: 2018-03-19 04:01:28,064 recommendation.api.types.related_articles.candidate_finder initialize():97 INFO -- starting to load embedding
Mar 19 04:01:28 tool uwsgi[14268]: Traceback (most recent call last):
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/local/lib/python3.4/dist-packages/recommendation/api/types/related_articles/candidate_finder.py", line 112, in load_raw_embedding
Mar 19 04:01:28 tool uwsgi[14268]: f = open(path, 'r', encoding='utf-8')
Mar 19 04:01:28 tool uwsgi[14268]: FileNotFoundError: [Errno 2] No such file or directory: '/etc/recommendation/mini_embedding'
Mar 19 04:01:28 tool uwsgi[14268]: During handling of the above exception, another exception occurred:
Mar 19 04:01:28 tool uwsgi[14268]: Traceback (most recent call last):
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/lib/python3/dist-packages/pkg_resources.py", line 231, in get_provider
Mar 19 04:01:28 tool uwsgi[14268]: module = sys.modules[moduleOrReq]
Mar 19 04:01:28 tool uwsgi[14268]: KeyError: ''
Mar 19 04:01:28 tool uwsgi[14268]: During handling of the above exception, another exception occurred:
Mar 19 04:01:28 tool uwsgi[14268]: Traceback (most recent call last):
Mar 19 04:01:28 tool uwsgi[14268]: File "/etc/recommendation/recommendation.wsgi", line 26, in <module>
Mar 19 04:01:28 tool uwsgi[14268]: candidate_finder.initialize_embedding()
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/local/lib/python3.4/dist-packages/recommendation/api/types/related_articles/candidate_finder.py", line 79, in initialize_embedding
Mar 19 04:01:28 tool uwsgi[14268]: _embedding.initialize(embedding_path, embedding_package, embedding_name, optimize, optimized_embedding_path)
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/local/lib/python3.4/dist-packages/recommendation/api/types/related_articles/candidate_finder.py", line 102, in initialize
Mar 19 04:01:28 tool uwsgi[14268]: self.load_raw_embedding(path, package, name)
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/local/lib/python3.4/dist-packages/recommendation/api/types/related_articles/candidate_finder.py", line 114, in load_raw_embedding
Mar 19 04:01:28 tool uwsgi[14268]: f = open(resource_filename(package, name), 'r', encoding='utf-8')
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/lib/python3/dist-packages/pkg_resources.py", line 953, in resource_filename
Mar 19 04:01:28 tool uwsgi[14268]: return get_provider(package_or_requirement).get_resource_filename(
Mar 19 04:01:28 tool uwsgi[14268]: File "/usr/lib/python3/dist-packages/pkg_resources.py", line 233, in get_provider
Mar 19 04:01:28 tool uwsgi[14268]: __import__(moduleOrReq)
Mar 19 04:01:28 tool uwsgi[14268]: ValueError: Empty module name
Mar 19 04:01:28 tool uwsgi[14268]: unable to load app 0 (mountpoint='') (callable not found or import error)
Mar 19 04:01:28 tool uwsgi[14268]: *** no app loaded. going in full dynamic mode ***
Mar 19 04:01:28 tool uwsgi[14268]: *** uWSGI is running in multiple interpreter mode ***

This seems like a bad deployment. Ping @bmansurov

bmansurov added a comment.EditedMar 19 2018, 12:19 PM

I was trying to update the server with a new patch from T189931. I'll look into the issue.

Edit: Looks like one of the needed file has gone missing. The service expects a file at /var/lib/recommendation/embedding.npz. The labs setup script has deleted it but, adding it hasn't been automated it seems. @leila do you know where I can get that file from? Nathaniel must have gotten it from a researcher, but I'm not sure.

leila raised the priority of this task from High to Unbreak Now!.Mar 19 2018, 5:12 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 19 2018, 5:12 PM

@aborrero you might want to change your password, since you posted it in the logs above :)

The service is back up again. I've disabled the problematic related_articles part until we figure out how to fix it properly.

Thanks @bmansurov ! Can you look at, https://phabricator.wikimedia.org/T190034 ? Language-Team wasn't aware about its deployment to production and API changes. We need to update CX code to match with it. Also, who is contact person to co-ordinate further on this? (CC: @leila)

@KartikMistry, I've left a comment on that task (let me konw if you need any other info there). I think you can contact me or someone from the Services about this service.

Change 420638 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[research/recommendation-api@master] Temporarily disable related_articles

https://gerrit.wikimedia.org/r/420638

bd808 edited projects, added VPS-Projects; removed Cloud-Services.Mar 20 2018, 4:44 PM
leila added a comment.Mar 20 2018, 7:36 PM

@KartikMistry recommendation API is almost Productionized. If you check T148129, you'll see under Related Objects that all tasks except 2 are resolved. The two are: Finding a product lead and tech lead for the API. That's why we don't call it fully Productionized. As ContentTranslation is going to out of beta, we are very close to the point that we have to make decisions about those 2 open tasks. Let's continue that discussion outside of this task. :)

leila assigned this task to bmansurov.Mar 26 2018, 6:50 PM
leila lowered the priority of this task from Unbreak Now! to High.

@bmansurov shall we leave this task open?

bmansurov closed this task as Resolved.Mar 27 2018, 1:30 AM

I'm closing the task as I haven't heard back from the interested parties about the service being down since I posted a fix. Feel free to re-open if the issue persists.

Change 420638 abandoned by Bmansurov:
Temporarily disable related_articles

Reason:
I've disabled the service on https://recommend.wmflabs.org/ for now.

https://gerrit.wikimedia.org/r/420638