Page MenuHomePhabricator

/v1/translate/{from}/{to}{/provider} endpoint fails while deploying cxserver
Closed, ResolvedPublic

Description

Since T173031 is not the issue or root cause, creating this task to track cxserver deployment failure.

scap-log shows output like:

{"name": "target.scb2001.codfw.wmnet.checks", "created": 1502386956.317136, "args": [], "msecs": 317.1360492706299, "filename": "checks.py", "levelno": 30, "msg": "Check 'endpoints' failed: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) is CRITICAL: Could not fetch url http://10.192.32.132:8080/v1/translate/en/es/Apertium: Generic connection error: HTTPConnectionPool(host=u'10.192.32.132', port=8080): Max retries exceeded with url: /v1/translate/en/es/Apertium (Caused by ProtocolError('Connection aborted.', BadStatusLine(\"''\",)))\n", "host": "scb2001.codfw.wmnet", "lineno": 88, "exc_text": null, "funcName": "handle_failure", "relativeCreated": 9354.056119918823}

Details

Related Gerrit Patches:
mediawiki/services/cxserver/deploy : masterUpdate mwapi_req template configuration
mediawiki/services/cxserver : masterAdd default mwapi_req template configuration to config.yaml

Event Timeline

Restricted Application added a project: ContentTranslation. · View Herald TranscriptAug 10 2017, 6:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We have two endpoints: mt and translate. Neither is currently working. Does the mt endpoint have the same error? When does this error happen? During deployment?
What is restbase calling right now? Why is it returning HTTP 200 code when error happens? My best guess currently is that restbase is not sending the body as application/x-www-form-urlencoded but instead as multipart/form-data or something else. I was not able to figure out what the HyperSwitch library uses internally and how it handles post requests.

Where is get_from_cx defined? It's used in https://github.com/wikimedia/restbase/blob/master/v1/transform-global.yaml#L60 Also, why is https://github.com/wikimedia/restbase/blob/master/v1/transform-lang.yaml#L54 using operationId: doMT instead? Shouldn't those two be the same?

Additional notes after chatting with other team members:

Apparently translate endpoint has another problem that CXServer is not being able to call Wikipedia APIs. I think it is currently masked by a problem I described above, since the mt endpoint does not do such calls. Given mt endpoint is working in CXServer, my suspicions point to RESTBase as described above.

We have two endpoints: mt and translate. Neither is currently working. Does the mt endpoint have the same error? When does this error happen? During deployment?

mt end point is fine during deployment. Error happens during deployment.

Why is it returning HTTP 200 code when error happens?

HTTP 200 is for any output by mt or translate end point test result as we can't predicate exact output of MT translation (it may vary and break tests).

Where is get_from_cx defined? It's used in https://github.com/wikimedia/restbase/blob/master/v1/transform-global.yaml#L60

https://github.com/wikimedia/restbase/blob/master/v1/content_segments.yaml#L69

For the rest, @mobrovac @santhosh 's input is needed.

Mentioned in SAL (#wikimedia-operations) [2017-08-21T14:53:46Z] <mobrovac@tin> Started deploy [cxserver/deploy@1065ffe]: Deploy 1065ffe2 to canary scb2001 for debugging - T173038

Mentioned in SAL (#wikimedia-operations) [2017-08-21T14:54:04Z] <mobrovac@tin> Finished deploy [cxserver/deploy@1065ffe]: Deploy 1065ffe2 to canary scb2001 for debugging - T173038 (duration: 00m 18s)

Mentioned in SAL (#wikimedia-operations) [2017-08-21T14:55:05Z] <mobrovac> cxserver depool scb2001 to debug failed checks - T173038

The actual error is:

FATAL: cxserver/198 on scb2001: First argument must be a string or Buffer (err.levelPath=fatal/service-runner/unhandled)
    TypeError: First argument must be a string or Buffer
        at ClientRequest.OutgoingMessage.write (_http_outgoing.js:458:11)
        at Request.write (/srv/deployment/cxserver/deploy-cache/revs/1065ffe22766219fab8d5f8509ced5c84eb5654d/node_modules/request/request.js:1514:27)
        at end (/srv/deployment/cxserver/deploy-cache/revs/1065ffe22766219fab8d5f8509ced5c84eb5654d/node_modules/request/request.js:552:18)
        at Immediate.<anonymous> (/srv/deployment/cxserver/deploy-cache/revs/1065ffe22766219fab8d5f8509ced5c84eb5654d/node_modules/request/request.js:581:7)
        at runCallback (timers.js:672:20)
        at tryOnImmediate (timers.js:645:5)
        at processImmediate [as _immediateCallback] (timers.js:617:5)

This happens when CXServer is trying to contact the MW API when it comes across the MWLink present in the spec example. The above error is produced likely because the request to the API is sent before anything is written into it. The code handling all of that has been introduced in dde5ba3b0c1efd7f1247952d12ff66d9af8508a4 , so this is where you will have to look. A couple of observations:

  • You shouldn't mix native Promises and bluebird ones
  • Why do you need a queue for executing MW API requests? I'm quite positive that it can be done without the async queue handling
  • Overall, my impression is that the code in lib/mw is overly complicated for what you are trying to achieve. Bluebird has stable APIs that would help you there (.each, .all, etc), you should take advantage of it!

Removing the RESTBase tag as this has nothing to do with RESTBase.

Mentioned in SAL (#wikimedia-operations) [2017-08-21T17:22:18Z] <mobrovac@tin> Started deploy [cxserver/deploy@f43ef96]: Bring back cxserver on scb2001 to a stable state - T173038

Mentioned in SAL (#wikimedia-operations) [2017-08-21T17:22:32Z] <mobrovac@tin> Finished deploy [cxserver/deploy@f43ef96]: Bring back cxserver on scb2001 to a stable state - T173038 (duration: 00m 14s)

FATAL: cxserver/198 on scb2001: First argument must be a string or Buffer (err.levelPath=fatal/service-runner/unhandled)

Thanks. I could able to reproduce the issue. This is because of a configuration issue.

The mwapi_req configuration need either query or body options depending on whether the api request is GET or POST. The default configuration that inherited from https://github.com/wikimedia/service-template-node/blob/master/config.dev.yaml#L65 does not take care of the HTTP methods and has only body. So I have added query also to there: See: https://github.com/wikimedia/mediawiki-services-cxserver/blob/master/lib/api-util.js#L57

But our deployment configuration is not updated to take care of this: https://github.com/wikimedia/mediawiki-services-cxserver-deploy/blob/master/scap/templates/config.yaml.j2#L40

The node module wikimedia/preq will throw the following error if query is not present in the template, and body used always.

[2017-08-22T10:49:16.455Z] FATAL: cxserver/25628 on thottingal: First argument must be a string or Buffer (err.levelPath=fatal/service-runner/unhandled)
    TypeError: First argument must be a string or Buffer
        at ClientRequest.OutgoingMessage.write (_http_outgoing.js:457:11)
        at Request.write (/home/santhosh/work/wiki/cxserver/node_modules/request/request.js:1514:27)
        at end (/home/santhosh/work/wiki/cxserver/node_modules/request/request.js:552:18)
        at Immediate.<anonymous> (/home/santhosh/work/wiki/cxserver/node_modules/request/request.js:581:7)
        at runCallback (timers.js:672:20)
        at tryOnImmediate (timers.js:645:5)
        at processImmediate [as _immediateCallback] (timers.js:617:5)

@KartikMistry If you can just update the https://github.com/wikimedia/mediawiki-services-cxserver-deploy/blob/master/scap/templates/config.yaml.j2#L40 as follows, this will be fixed:

body: "{{request.body}}"
query: "{{ default(request.query, {}) }}"

Change 373051 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/services/cxserver@master] Add default mwapi_req template configuration to config.yaml

https://gerrit.wikimedia.org/r/373051

Change 373061 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver/deploy@master] Update mwapi_req template configuration

https://gerrit.wikimedia.org/r/373061

Change 373051 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Add default mwapi_req template configuration to config.yaml

https://gerrit.wikimedia.org/r/373051

Change 373061 merged by jenkins-bot:
[mediawiki/services/cxserver/deploy@master] Update mwapi_req template configuration

https://gerrit.wikimedia.org/r/373061

I looked over the patches, and I don't think they will solve this problem. One thing to note is that in the code you sometimes use the GET method, but the template enforces POST, which is the correct way: the MW API should be used with the POST method only.

Also, you haven't addressed the other concerns in my previous comment, most notably the one pertaining to the request queue - making your own queue is error-prone and ultimately unnecessary. The same thing can be achieved with bluebird's .all() method with limiting concurrency.

I looked over the patches, and I don't think they will solve this problem. One thing to note is that in the code you sometimes use the GET method, but the template enforces POST

I don't think this is true. We end up calling preq.get directly in those cases, and it seems to override request.method in my testing.

return preq[ method ]( request ).then( ( response ) => {

which is the correct way: the MW API should be used with the POST method only.

Could you elaborate on this? Why only POST?

Also, you haven't addressed the other concerns in my previous comment, most notably the one pertaining to the request queue - making your own queue is error-prone and ultimately unnecessary. The same thing can be achieved with bluebird's .all() method with limiting concurrency.

I did not design this system, but I believe the main purpose was to batch up multiple titles into one/few requests of same type. Asynchronous because there will be lots of data that is needed and doing it serially would slow it down.

Also, you haven't addressed the other concerns in my previous comment, most notably the one pertaining to the request queue - making your own queue is error-prone and ultimately unnecessary. The same thing can be achieved with bluebird's .all() method with limiting concurrency.

I did not design this system, but I believe the main purpose was to batch up multiple titles into one/few requests of same type. Asynchronous because there will be lots of data that is needed and doing it serially would slow it down.

Yes, the batching is required since we will be processing multiple titles in one go(All titles in a article section). We need to save API requests to same domain and the MW api accepts multiple titles as well. So we queue it. This is not a new practice - just the nodejs ported ApiResponseCache code of Visual Editor. The concurrency option allows to limit the number of api requests, which, if we use bluebird Promise.map will be ineffcient one api call for one title. Here our requirement is to collect as much as titles and give it to the piped values of titles to MW api. That is not exactly the concurrency limiting feature that bluebird provides.

I have updated cxserver in Production and this seems resolved. https://tools.wmflabs.org/sal/log/AV55rN2EF4fsM4DBdBW5

KartikMistry moved this task from QA to Done on the Language-2017-July-Sept board.Sep 18 2017, 5:25 AM
Arrbee closed this task as Resolved.Oct 3 2017, 7:23 AM
Arrbee claimed this task.