Page MenuHomePhabricator

Can't retrieve HTML from REST API
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • When trying to query https://nl.wiktionary.org/api/rest_v1/page/html/%3B or https://ne.wikipedia.org/api/rest_v1/page/html/ढाँचा:;

What happens?:
Returns 400 response status code.

What should have happened instead?:
Successful 200 response status code with the HTML of the page.

Other information (browser name/version, screenshots, etc.):
Note that his pages actually exists: https://nl.wiktionary.org/wiki/; and https://ne.wikipedia.org/wiki/ढाँचा:;

Event Timeline

This may potentially be a RESTBase issue. Looking at the core REST API which also fetches Parsoid HTML, https://nl.wiktionary.org/w/rest.php/v1/page/%3B/html renders properly.

Interestingly on en.wiktionary.org there's a different error:

{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/internal_error","method":"get","detail":"TypeError: Cannot read property 'toLowerCase' of undefined","uri":"/en.wiktionary.org/v1/page/html/"}

Seems the semicolon is just being ignored entirely based on uri

Protsack.stephan renamed this task from Cant retrieve HTML from Rest API to Cant retrieve HTML from REST API .Apr 11 2023, 2:49 PM
Protsack.stephan renamed this task from Cant retrieve HTML from REST API to Can't retrieve HTML from REST API .

This seems to be happening in RB and is related to page language of the given page. See the stack trace below:

TypeError: Cannot read property 'toLowerCase' of undefined
    at mwUtil.getSiteInfo.then (.../restbase/lib/mwUtil.js:95:66)
    at tryCatcher (.../node_modules/bluebird/js/release/util.js:16:23)
    at Promise._settlePromiseFromHandler (.../node_modules/bluebird/js/release/promise.js:547:31)
    at Promise._settlePromise (.../node_modules/bluebird/js/release/promise.js:604:18)
    at Promise._settlePromise0 (.../node_modules/bluebird/js/release/promise.js:649:10)
    at Promise._settlePromises (.../node_modules/bluebird/js/release/promise.js:729:18)
    at _drainQueueStep (.../node_modules/bluebird/js/release/async.js:93:12)
    at _drainQueue (.../node_modules/bluebird/js/release/async.js:86:9)
    at Async._drainQueues (.../node_modules/bluebird/js/release/async.js:102:5)
    at Immediate.Async.drainQueues [as _onImmediate] (.../node_modules/bluebird/js/release/async.js:15:14)
    at runCallback (timers.js:705:18)
    at tryOnImmediate (timers.js:676:5)
    at processImmediate (timers.js:658:5)

The affected line is: #95 in mwUtil.js. Somehow, the page language for ; is undefined. I'm still trying to find out why.

NOTE: I can see request on logstash from en.wikipedia.org but I can't find anything on logstash for nl.wiktionary.org.

Looking at the content-location response header, it looks like the page ; is entirely ignored.

access-control-allow-headers: accept,content-type,content-length,cache-control,accept-language,api-user-agent,if-match,if-modified-since,if-none-match,dnt,accept-encoding 
 access-control-allow-methods: GET,HEAD 
 access-control-allow-origin: * 
 access-control-expose-headers: etag 
 age: 1 
 cache-control: no-cache 
 content-length: 126 
 content-location: https://nl.wiktionary.org/api/rest_v1/page/html/ 
 content-security-policy: default-src 'none'; frame-ancestors 'none' 
 content-type: application/problem+json 
 date: Wed,12 Apr 2023 17:17:52 GMT 
 nel: { "report_to": "wm_nel","max_age": 604800,"failure_fraction": 0.05,"success_fraction": 0.0} 
 referrer-policy: origin-when-cross-origin 
 report-to: { "group": "wm_nel","max_age": 604800,"endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] } 
 server: restbase1031 
 server-timing: cache;desc="pass",host;desc="cp6009" 
 strict-transport-security: max-age=106384710; includeSubDomains; preload 
 vary: Accept-Encoding 
 x-cache: cp6010 miss,cp6009 pass 
 x-cache-status: pass 
 x-client-ip: 129.0.102.4 
 x-content-security-policy: default-src 'none'; frame-ancestors 'none' 
 x-content-type-options: nosniff 
 x-frame-options: SAMEORIGIN 
 x-webkit-csp: default-src 'none'; frame-ancestors 'none' 
 x-xss-protection: 1; mode=block

That explains why we have the undefined page language because a page is not even supplied in the URL according to the error.

Added another example to the ticket description https://ne.wikipedia.org/api/rest_v1/page/html/ढाँचा:; returns 400 as well.

Added another example to the ticket description https://ne.wikipedia.org/api/rest_v1/page/html/ढाँचा:; returns 400 as well.

This also returns a 404 because the newFromText() title validation check splits ढाँचा:; into namespace: ढाँचा and and page name: ;, so we're still dealing with the ; page. I think the issue is actually with validating ; as a page name in RB. With and without a namespace, ; as page name will trigger _checkEmptyTitle() to fail and throw a 400.

Relevant code in node_modules/mediawiki-title/lib/index.js

// Initial colon indicates main namespace rather than specified default
// but should not create invalid {ns,title} pairs such as {0,Project:Foo}
if (title !== '' && title[0] === ':') {
     title = title.substr(1).replace(/^_+/, '');
     defaultNs = 0;
}

_checkEmptyTitle(title);

Similarly, https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Template:; also fails with a 400 but I've not been able to reproduce this error locally.

Locally, I can fetch the HTML of this page correctly but on production, I can't.

{F36954443}

Interestingly this one is throwing to many redirects:

  • https://el.wikipedia.org/wiki/Πρότυπο:Αναξιόπιστη_πηγή;
  • https://el.wikipedia.org/api/rest_v1/page/html/Πρότυπο:Αναξιόπιστη_πηγή;

Change 912312 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] Varnish/ATS semicolon workaround for Restbase

https://gerrit.wikimedia.org/r/912312

We had a brief meeting on this, and I think the actual problem and immediate workaround is actually much simpler than we imagined. We're going to apply the same workaround we did for MediaWiki traffic in T238285 ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/882663/ ) to the Restbase traffic for now. Patch incoming shortly!

Change 912312 merged by BBlack:

[operations/puppet@production] Varnish/ATS semicolon workaround for Restbase

https://gerrit.wikimedia.org/r/912312

The patch has been rolled out everywhere for a little while at this point, should be able to confirm success

@Protsack.stephan looking good from your side for URLs that have been having problems? Everything else also looking okay?

Thanks @dr0ptp4kt Looks good from my side, don't see any new errors showing up in the DLQ.

dr0ptp4kt claimed this task.

Okay, closing for now. Please re-open in case an obvious duplicate or side effect emerges from the fix.