Page MenuHomePhabricator

REST `/page/html/` API endpoint does not follow redirects as documented & demoed
Open, Needs TriagePublicBUG REPORT

Description

We use the rest_v1 API (now that the superior JSON 'mobile' endpoint has been removed) for providing popups of WP articles, calling /page/html/ for the HTML content to render. This works except in the case of links to redirects in popups, which fail to show the expected content, instead showing a stub or title at most. Investigating, the /page/html/ endpoint is returning just the HTML content of the *redirect page*. Example, the 'Errol Flynn' article links to 'Vancouver, British Columbia', which redirects to just 'Vancouver'. This would be fine except that the API call for that yields this (trimmed for clarity):

$ curl --location 'https://en.wikipedia.org/api/rest_v1/page/html/Vancouver%2C_British_Columbia'
<!DOCTYPE html>
...
<b>This page is a <a rel="mw:WikiLink" href="./Wikipedia:Redirect" title="Wikipedia:Redirect">redirect</a>. <small>The following <a rel="mw:WikiLink" href="./Wikipedia:Categorizing_redirects" title="Wikipedia:Categorizing redirects">categories</a> are used to track and monitor this redirect:</small></b>
<div class="rcat rcat-R_from_Canadian_settlement_name">
<ul><li><b><a rel="mw:WikiLink" href="./Category:Redirects_from_Canadian_settlement_names" title="Category:Redirects from Canadian settlement names">From a Canadian settlement name</a></b>: This is a redirect from an article title related to a Canadian settlement. This title has been redirected in accordance with the article naming conventions for <a rel="mw:WikiLink" href="./Canada" title="Canada">Canada</a>-related articles in the <a rel="mw:WikiLink" href="./Wikipedia:CANSTYLE" title="Wikipedia:CANSTYLE" class="mw-redirect">Manual of Style</a>.</li></ul>
</div><link rel="mw:PageProp/Category" href="./Category:Redirects_from_Canadian_settlement_names"/>
<i><small>When appropriate, <a rel="mw:WikiLink" href="./Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection levels</a> are automatically sensed, described and categorized.</small></i></div></td></tr></tbody></table></section></body></html>

This is surprising because one would expect it to follow the redirect, and the /summary/ command does:

$ curl -L 'https://en.wikipedia.org/api/rest_v1/page/summary/Vancouver%2C_British_Columbia' {"type":"standard","title":"Vancouver","displaytitle":"<span class=\"mw-page-title-main\">Vancouver</span>","namespace":{"id":0,"text":""},"wikibase_item":"Q24639","titles":{"canonical":"Vancouver","normalized":"Vancouver","display":"<span class=\"mw-page-title-main\">Vancouver</span>"},"pageid":32706,"thumbnail":{"source":"https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Concord_Pacific_Master_Plan_Area.jpg/320px-Concord_Pacific_Master_Plan_Area.jpg","width":320,"height":180},"originalimage":{"source":"https://upload.wikimedia.org/wikipedia/commons/5/57/Concord_Pacific_Master_Plan_Area.jpg","width":4356,"height":2450},"lang":"en","dir":"ltr","revision":"1164544333","tid":"1ea1dc40-2699-11ee-9e66-3fbfbf713148","timestamp":"2023-07-09T18:26:31Z","description":"City in British Columbia, Canada","description_source":"local","content_urls":{"desktop":{"page":"https://en.wikipedia.org/wiki/Vancouver","revisions":"https://en.wikipedia.org/wiki/Vancouver?action=history","edit":"https://en.wikipedia.org/wiki/Vancouver?action=edit","talk":"https://en.wikipedia.org/wiki/Talk:Vancouver"},"mobile":{"page":"https://en.m.wikipedia.org/wiki/Vancouver","revisions":"https://en.m.wikipedia.org/wiki/Special:History/Vancouver","edit":"https://en.m.wikipedia.org/wiki/Vancouver?action=edit","talk":"https://en.m.wikipedia.org/wiki/Talk:Vancouver"}},"extract":"Vancouver is a major city in western Canada, located in the Lower Mainland region of British Columbia. As the most populous city in the province, the 2021 Canadian census recorded 662,248 people in the city, up from 631,486 in 2016. The Greater Vancouver area had a population of 2.6 million in 2021, making it the third-largest metropolitan area in Canada. Greater Vancouver, along with the Fraser Valley, comprises the Lower Mainland with a regional population of over 3 million. Vancouver has the highest population density in Canada, with over 5,700 people per square kilometre, and fourth highest in North America.","extract_html":"<p><b>Vancouver</b> is a major city in western Canada, located in the Lower Mainland region of British Columbia. As the most populous city in the province, the 2021 Canadian census recorded 662,248 people in the city, up from 631,486 in 2016. The Greater Vancouver area had a population of 2.6<span class=\"nowrap\"> </span>million in 2021, making it the third-largest metropolitan area in Canada. Greater Vancouver, along with the Fraser Valley, comprises the Lower Mainland with a regional population of over 3<span class=\"nowrap\"> </span>million. Vancouver has the highest population density in Canada, with over 5,700 people per square kilometre, and fourth highest in North America.</p>"}

OK, so perhaps /page/html/ just has a bad default setting which we will have to override? Going to the docs, we are sent to https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_

This page strongly implies that /page/html/ *should* be following the redirects:

Requests for redirect pages return HTTP 302 with a redirect target in Location header and content in the body. To get a 200 response instead, supply false to the redirect parameter.

So it has redirect=true as the default, and if for some reason we need the redirect page itself, we can set redirect=false. Except... then why doesn't it do that? Let's double check with the executable example. We put in Vancouver,_British_Columbia as the target page, and hit 'Execute'.

It generates the example command

curl -X 'GET' \
  'https://en.wikipedia.org/api/rest_v1/page/html/Vancouver%2C_British_Columbia' \
  -H 'accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"'

which then executes and yields... the expected HTML content of 'Vancouver' (and not 'Vancouver, British Columbia' the redirect page)

...
<p id="mwIA"><b id="mwIQ">Vancouver</b> (<span class="rt-commentedText nowrap" about="#mwt65" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"IPAc-en","href":"./Template:IPAc-en"},"params":{"audio":{"wt":"EN-Vancouver.ogg"},"1":{"wt":"v"},"2":{"wt":"æ"},"3":{"wt":"n"},"4":{"wt":"ˈ"},"5":{"wt":"k"},"6":{"wt":"uː"},"7":{"wt":"v"},"8":{"wt":"ər"}},"i":0}}]}' id="mwIg"><span class="IPA nopopups noexcerpt" lang="en-fonipa"><a rel="mw:WikiLink" href="./Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="'v' in 'vie'">v</span><span title="/æ/: 'a' in 'bad'">æ</span><span title="'n' in 'nigh'">n</span><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/uː/: 'oo' in 'goose'">uː</span><span title="'v' in 'vie'">v</span><span title="/ər/: 'er' in 'letter'">ər</span></span>/</a></span><span typeof="mw:Entity"> </span><span class="nowrap" style="font-size:85%">(<span class="unicode haudio"><span class="fn"><span style="white-space:nowrap;margin-right:.25em;"><span typeof="mw:File" data-mw='{"caption":"About this sound"}'><a href="./File:EN-Vancouver.ogg" title="About this sound"><img alt="" resource="./File:Loudspeaker.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/11px-Loudspeaker.svg.png" decoding="async" data-file-width="20" data-file-height="20" data-file-type="drawing" height="11" width="11" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/17px-Loudspeaker.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Loudspeaker.svg/22px-Loudspeaker.svg.png 2x" class="mw-file-element"/></a></span></span><a rel="mw:MediaLink" href="//upload.wikimedia.org/wikipedia/commons/4/44/EN-Vancouver.ogg" resource="./Media:EN-Vancouver.ogg" title="EN-Vancouver.ogg">listen</a></span><link rel="mw:PageProp/Category" href="./Category:Articles_with_hAudio_microformats"/></span>)</span></span><link rel="mw:PageProp/Category" href="./Category:Pages_including_recorded_pronunciations" about="#mwt65" id="mwIw"/> <a rel="mw:WikiLink" href="./Help:Pronunciation_respelling_key" title="Help:Pronunciation respelling key" about="#mwt66" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"respell","href":"./Template:Respell"},"params":{"1":{"wt":"van"},"2":{"wt":"KOO"},"3":{"wt":"vər"}},"i":0}}]}' id="mwJA"><i title="English pronunciation respelling">van-<span style="font-size:90%">KOO</span>-vər</i></a>) is a major city in <a rel="mw:WikiLink" href="./Western_Canada" title="Western Canada" id="mwJQ">western Canada</a>, located in the <a rel="mw:WikiLink" href="./Lower_Mainland" title="Lower Mainland" id="mwJg">Lower Mainland</a> region of <a rel="mw:WikiLink" href="./British_Columbia" title="British Columbia" id="mwJw">British Columbia</a>. As the <a rel="mw:WikiLink" href="./List_of_cities_in_British_Columbia" title="List of cities in British Columbia" id="mwKA">most populous city</a> in the province, the <a rel="mw:WikiLink" href="./2021_Canadian_census" title="2021 Canadian census" id="mwKQ">2021 Canadian census</a> recorded 662,248 people in the city, up from 631,486 in 2016. The <a rel="mw:WikiLink" href="./Greater_Vancouver" title="Greater Vancouver" id="mwKg">Greater Vancouver area</a> had a population of 2.6<span class="nowrap" about="#mwt67" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Nbsp","href":"./Template:Nbsp"},"params":{},"i":0}}]}' id="mwKw"><span typeof="mw:Entity"> </span></span>million in 2021, making it the <a rel="mw:WikiLink" href="./List_of_census_metropolitan_areas_and_agglomerations_in_Canada#List" title="List of census metropolitan areas and agglomerations in Canada" id="mwLA">third-largest metropolitan area in Canada</a>. Greater Vancouver, along with the <a rel="mw:WikiLink" href="./Fraser_Valley" title="Fraser Valley" id="mwLQ">Fraser Valley</a>, comprises the Lower Mainland with a regional population of over 3<span class="nowrap" about="#mwt68" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"nbsp","href":"./Template:Nbsp"},"params":{},"i":0}}]}' id="mwLg"><span typeof="mw:Entity"> </span></span>million. Vancouver has the highest population density in Canada, with over 5,700 people per square kilometre,<sup about="#mwt69" class="mw-ref reference" id="cite_ref-:0_6-0" rel="dc:references" typeof="mw:Extension/ref" data-mw='{"name":"ref","attrs":{"name":":0"}}'><a href="./Vancouver#cite_note-:0-6" style="counter-reset: mw-Ref 6;" id="mwLw"><span class="mw-reflink-text" id="mwMA">[6]</span></a></sup> and fourth highest in North America (after <a rel="mw:WikiLink" href="./New_York_City" title="New York City" id="mwMQ">New York City</a>, <a rel="mw:WikiLink" href="./San_Francisco" title="San Francisco" id="mwMg">San Francisco</a>, and <a rel="mw:WikiLink" href="./Mexico_City" title="Mexico City" id="mwMw">Mexico City</a>).</p>
...

image.png (3×2 px, 1 MB)

That is what we want, but aren't getting. But wait, how does that curl command differ *at all*? We are already doing a 'GET' by default, so the '-X Get' should be redundant, and none of the arguments in '-H' seem to pertain to redirects: there is no reason that I should have to specify that I accept text/html when it defaults to everything, and this 'profile' argument is not mentioned at all in the /page/html/. But let's try it locally anyway...

This is what I get when I run the exact same command, which is generated by the documentation and works as desired on the WMF server, on my local machine:

$ curl -X 'GET' \
>   'https://en.wikipedia.org/api/rest_v1/page/html/Vancouver%2C_British_Columbia' \
>   -H 'accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"'
<!DOCTYPE html>
...<b>This page is a <a rel="mw:WikiLink" href="./Wikipedia:Redirect" title="Wikipedia:Redirect">redirect</a>. <small>The following <a rel="mw:WikiLink" href="./Wikipedia:Categorizing_redirects" title="Wikipedia:Categorizing redirects">categories</a> are used to track and monitor this redirect:</small></b>
<div class="rcat rcat-R_from_Canadian_settlement_name">
<ul><li><b><a rel="mw:WikiLink" href="./Category:Redirects_from_Canadian_settlement_names" title="Category:Redirects from Canadian settlement names">From a Canadian settlement name</a></b>: This is a redirect from an article title related to a Canadian settlement. This title has been redirected in accordance with the article naming conventions for <a rel="mw:WikiLink" href="./Canada" title="Canada">Canada</a>-related articles in the <a rel="mw:WikiLink" href="./Wikipedia:CANSTYLE" title="Wikipedia:CANSTYLE" class="mw-redirect">Manual of Style</a>.</li></ul>
</div><link rel="mw:PageProp/Category" href="./Category:Redirects_from_Canadian_settlement_names"/>
<i><small>When appropriate, <a rel="mw:WikiLink" href="./Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection levels</a> are automatically sensed, described and categorized.</small></i></div></td></tr></tbody></table></section></body></html>

That is, it's just as broken as before. Am I going crazy? I'm running the exact same thing within seconds against a public API, how can it not be returning the same thing? The only difference should be what IP we're at, and that should definitely have nothing to do with following redirects!

And deleting the various arguments does nothing. Tinkering, eventually what I find by sheer accident is that I have to add --location *and* keep the accept (despite curl sending accept: */* by default...?) *and* keep the profile argument:

$ curl --silent -L 'https://en.wikipedia.org/api/rest_v1/page/html/Vancouver%2C_British_Columbia'   -H 'accept: */*; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"'
...
n><link rel="mw:PageProp/Category" href="./Category:Use_mdy_dates_from_June_2023" about="#mwt12" id="mwDQ"/></p>
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none" about="#mwt22" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox settlement\n","href":"./Template:Infobox_settlement"},"params":{"name":{"wt":"Vancouver"},"official_name":{"wt":"City of Vancouver"},"settlement_type":{"wt":"City"},"image_skyline":{"wt":"{{multiple image\n| total_width              = 280\n| border                   = infobox\n| caption_align            = center\n| perrow                   = 1/2/2/2\n| image1                   = Concord Pacific Master Plan Area.jpg\n| caption1                 = [[Downtown Vancouver|Downtown Vancouver skyline]]\n| image2                   = Vancouver (BC, Canada), Canada Place -- 2022 -- 1847.jpg\n| caption2                 = [[Canada Place]]\n| image3                   = Stanley Park, Vancouver (7889964786).jpg\n| caption3                 = [[Stanley Park]]\n| image4                   = Science world (focusedcapture) 2 - Flickr.jpg\n| caption4                 = [[Science World (Vancouver)|Science World]]\n| image5                   = Vancouver Art Gallery (46588958915).jpg\n| caption5                 = [[Vancouver Art Gallery]]\n| image6                   = Gastown Steam Clock by Kiyokun.JPG\n| caption6                 = [[Gastown]]\n| image7                   = Vancouver (BC, Canada), Harbour Centre -- 2022 -- 1843.jpg\n| caption7                 = [[Harbour Centre]]\n}}"},"image_flag":{"wt":"Flag of Vancouver.svg"},"image_shield":{"wt":"Coat of arms of Vancouver.svg"},"image_blank_emblem":{"wt":"Vancouverlogo.svg"},"blank_emblem_type":{"wt":"Logo"},"motto":{"wt":"\"By sea land and air we prosper\""},"image_map":{"wt":"{{Maplink|frame=yes|plain=y |frame-width=300 |frame-height=200 |frame-align=center|zoom=4|type=point|title=Vancouver|marker=city|type2=shape |stroke-width2=2 |stroke-color2=#808080}}"},"map_caption":{"wt":"Interactive map of Vancouver"},"image_map1":{"wt":"GVRDVancouver.svg"},"map_caption1":{"wt":"Location of Vancouver in Metro Vancouver"},"coordinates":{"wt":"{{coord|49|15|40|N|123|06|50|W|region:CA-BC|notes=&lt;ref name=\"JBRIK\">{{Cite cgndb|JBRIK|Vancouver}}&lt;/ref>|display=inline}}"},"subdivision_type":{"wt":"Country"},"subdivision_name":{"wt":"Canada"},"subdivision_type1":{"wt":"Province"},"subdivision_type2":{"wt":"[[Regional district]]"},"subdivision_nam
...

OK, so since the XMLHTTPRequest already follows redirects, we just need to add the accept line, and... now it works?

Well, as long as you aren't trying it in devtools, apparently. My co-developer Said Achmiz reports that adding it works in curl & our popups JS, but frighteningly, it does *not* work in browser devtools (reverting to the non-redirecting behavior), apparently because of other, irrelevant, HTTP headers being added like cache-control headers (?!). This is very alarming - what is going on with this endpoint? (This is far from the first problem we've had with it, which is why we are upset the JSON endpoint is gone. For example, why do <title> elements *half* escape their contents, so they look like <title>&lt;i>foo bar&lt;/i></title>?)

So it looks like:

  1. the executable documentation lies to the developer reading it. The curl commands generated are *not* the ones actually being run, or perhaps you are using a weird curl or you are changing default settings.

    In reality, all these curl commands are implicitly being run with -L/--location and that's omitted from the claimed command, which then breaks it. That's bad and you should not do that. (I don't care if this is supposedly documented somewhere in the endless sprawling WMF documentation, down in the basement in a disused lavatory with a sign 'beware of the leopard' where that is vaguely hinted at in a way which makes sense only if you already know what it means because you ran into it the hard way years ago.) Put all the necessary options into the executable demos!
  2. the documentation omits any mention of the need for this profile, and the part about redirects is erroneous: it doesn't follow redirects, regardless of the redirect arguments, and just does whatever this profile says.
  3. something is very cursed about this API endpoint. Please exorcise it.

Event Timeline

When I run curl --location 'https://en.wikipedia.org/api/rest_v1/page/html/Vancouver%2C_British_Columbia' I correctly get the expected HTML from [[Vancouver]], i.e. it emits a 302 to Vancouver which curl follows. I can't reproduce the devtools thing either in Firefox.

Can you capture and paste the headers of a bad request? That'll tell us what backend servers you're hitting, whether it's cached, etc.

For example, why do <title> elements *half* escape their contents, so they look like <title>&lt;i>foo bar&lt;/i></title>?)

This is T324431: Parsoid: displaytitle HTML now appearing in <title> element rather than page title.

  1. something is very cursed about this API endpoint. Please exorcise it.

Comments like this are unnecessary and unproductive.

Additional examples include https://en.wikipedia.org/wiki/Entrepreneur / https://en.wikipedia.org/api/rest_v1/page/html/Entrepreneur and https://en.wikipedia.org/wiki/Chief_Executive_Officer

Part of the problem is that it is difficult to replicate consistently: how can we debug the headers or the page or any of the variables when the same command returns different results, both good and bad, at different times for different people? Achmiz brought it up today while debugging popup failures of WP articles where only the redirect page text is returned, and noted that curl --location https://en.wikipedia.org/api/rest_v1/page/html/Entrepreneur would show the error (and explains why the popup broke); I ran that command within a few minutes of him, and I instead got the final page's expected text all about entrepreneurship. And this happened before as well: I would get the bad redirect page, but then he running the same command would get the good final page.

In the browser JS console, requests that seem like they should work fail:

let url = "https://en.wikipedia.org/api/rest_v1/page/html/Entrepreneur"; let method = "GET"; let req = new XMLHttpRequest(); req.addEventListener("load", (event) => { console.log(event.target); }); req.open(method, url); req.send(); returns the redirect page.

let url = "https://en.wikipedia.org/api/rest_v1/page/html/Entrepreneur"; let method = "GET"; let req = new XMLHttpRequest(); req.addEventListener("load", (event) => { console.log(event.target); }); req.open(method, url); req.setRequestHeader("accept", 'text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.1.0"'); req.send(); *used* to work, but now returns the redirect page as well.

As does opening https://en.wikipedia.org/api/rest_v1/page/html/Entrepreneur in a browser tab normally: the redirect page. However, for Achmiz, it showed the final page the first time or so, and then showed the redirect thereafter (using the same browser/URL). I hard-refreshed a few dozen times in FF & Chromium but didn't get the final page.

(Our current workaround is to avoid redirects by a brute-force approach of periodically parsing all WP links, checking for redirects, and updating a manual database of URL rewrites to point to the final URL.)