Page MenuHomePhabricator

Parsoid return broken status code '5 unknown'; leads to typeerror in RESTBase
Closed, DuplicatePublic

Description

An access to https://en.wikipedia.org/api/rest_v1/page/html/WWL_World_Trios_Championship/678143660

results in

TypeError: Cannot read property 'body' of undefined
    at /srv/deployment/restbase/deploy/restbase/sys/parsoid.js:322:37
    at tryCatcher (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/util.js:24:31)
    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/promise.js:582:31)
    at Promise._settlePromiseAt (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/promise.js:727:18)
    at Promise._settlePromises (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/promise.js:845:14)
    at Async._drainQueue (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/async.js:79:16)
    at Async._drainQueues (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/async.js:89:10)
    at Async.drainQueues (/srv/deployment/restbase/deploy/node_modules/bluebird/js/main/async.js:14:14)
    at process._tickCallback (node.js:419:13)

Parsoid is returning a 503 for this title (http://parsoid-lb.eqiad.wikimedia.org/en.wikipedia.org/v1/page/html/WWL_World_Trios_Championship/678143660), and for some reason this does not cause an exception in RESTBase, which would prevent it from dereferencing the body.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.
GWicke subscribed.

It turns out that this is caused by Parsoid returning a garbled response:

gwicke@tin:~$ ab -v2 http://parsoid.svc.eqiad.wmnet:8000/en.wikipedia.org/v3/page/pagebundle/WWL_World_Trios_Championship/678143660
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking parsoid.svc.eqiad.wmnet (be patient)...INFO: POST header == 
---
GET /en.wikipedia.org/v3/page/pagebundle/WWL_World_Trios_Championship/678143660 HTTP/1.0
Host: parsoid.svc.eqiad.wmnet:8000
User-Agent: ApacheBench/2.3
Accept: */*


---
LOG: header received:
HTTP/1.1 5 unknown
X-Powered-By: Express
Access-Control-Allow-Origin: *
content-revision-id: 678143660
Content-Type: text/plain; charset=utf-8
Content-Length: 1353
ETag: W/"549-pl//g00OKIxv+W/Q72bpdg"
Vary: Accept-Encoding
Date: Wed, 16 Dec 2015 01:57:57 GMT
Connection: close

INVALID_CHARACTER_ERR (5): the string contains invalid characters
INVALID_CHARACTER_ERR: INVALID_CHARACTER_ERR (5): the string contains invalid characters
    at Object.exports.InvalidCharacterError (/srv/deployment/parsoid/deploy/node_modules/domino/lib/utils.js:19:52)
    at HTMLTableDataCellElement.setAttributeNS (/srv/deployment/parsoid/deploy/node_modules/domino/lib/Element.js:473:40)
    at DOMTreeBuilder.createElement (/srv/deployment/parsoid/deploy/node_modules/html5/lib/dom/DOMTreeBuilder.js:70:12)
    at DOMTreeBuilder.TreeBuilder.insertElement (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:2747:21)
    at Object.TreeBuilder.modes.inRow.startTagTableCell (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:2332:8)
    at Object.TreeBuilder.modes.base.processStartTag (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:146:40)
    at Object.TreeBuilder.modes.inCell.startTagTableOther (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:1516:23)
    at Object.TreeBuilder.modes.base.processStartTag (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:146:40)
    at DOMTreeBuilder.TreeBuilder.processToken (/srv/deployment/parsoid/deploy/node_modules/html5/lib/TreeBuilder.js:2664:17)
    at TreeBuilder.EventEmitter.emit (events.js:95:17)
WARNING: Response code not 2xx (5 u)
..done

Note the status line of HTTP/1.1 5 unknown.

This is then turned into a 503 by Varnish.

A patch to preq to treat broken status codes as an error is available in https://github.com/wikimedia/preq/pull/10.

GWicke renamed this task from 404 from Parsoid leads to typeerror in RESTBase to Parsoid return broken status code '5 unknown'; leads to typeerror in RESTBase.Dec 16 2015, 3:51 AM
GWicke added a project: Parsoid.
GWicke set Security to None.
GWicke updated the task description. (Show Details)

Minimal test <table data-sort-a+b="123">haha</table> ... working on a fix on the Parsoid end.

ssastry lowered the priority of this task from Medium to Low.Dec 16 2015, 11:52 PM
ssastry subscribed.

Unless quick, this fix can happen in the background with a low priority.

@ssastry: A quick improvement might be to make sure that a valid status code is set in any case. In doubt, return 500.

@ssastry: A quick improvement might be to make sure that a valid status code is set in any case. In doubt, return 500.

It's actually returning HTTP code 5 because we use the code from the error by default and domino sets codes on its errors,
https://github.com/fgnass/domino/blob/master/lib/DOMException.js#L122

Should probably catch rethrow that without the code. Patch forthcoming.

Change 259626 had a related patch set uploaded (by Arlolra):
Don't use code set on domino errors as http status code

https://gerrit.wikimedia.org/r/259626

Change 259626 merged by jenkins-bot:
Use httpStatus instead of code as the property on errors

https://gerrit.wikimedia.org/r/259626

Change 260535 had a related patch set uploaded (by Arlolra):
Use domino internal methods for setting attributes while parsing

https://gerrit.wikimedia.org/r/260535

Change 260703 had a related patch set uploaded (by Arlolra):
Match permitted attributes to php's getAttribsRegex

https://gerrit.wikimedia.org/r/260703

Change 260535 merged by jenkins-bot:
Note that DOM tree building uses restrictive checks

https://gerrit.wikimedia.org/r/260535