Page MenuHomePhabricator

Reader reads a page online
Closed, ResolvedPublic2 Estimated Story Points

Description

"As a Reader, I want to get a page online, so that I can read it with my browser or HTML widget and it will load fast."

Downloading a large document encoded in JSON, then loading the HTML from the JSON into a browser or native HTML widget, is much less efficient that letting the browser or widget download the HTML itself. So, if the user is "online", we want to have two endpoints: one for the JSON representation of the page without HTML, and one for HTML only.

Note: TWO endpoints follow.

GET /page/{title}/bare

Returns the page as JSON. Title is escaped for slashes

Payload: empty

Notable request headers: none

Status:
200 – this is the page
404 – page does not exist (never created or deleted)

Notable response headers: none

Body: JSON

  • id: numeric id of the page
  • key: prefixed DB key of the page, like "Talk:Main_Page"
  • title: title for display, like "Talk:Main Page"
  • latest: latest revision of the page, object with these properties
    • id: revision ID
    • timestamp: revision timestamp
  • license: Object for the preferred license of the page, including these properties:
    • spdx: SPDX code
    • url: URL for the license
    • title: title of the license
  • other_licenses: array of objects with {spdx, url, title} for other licenses the page is available under
  • content_model: content model for the main slot of the page
  • html_url: URL for the HTML stream for the page

GET /page/{title}/html

Returns the page HTML. Title is escaped for slashes

Payload: empty

Notable request headers: none

Status:
200 – this is the page
400 - the content model for the page isn't compatible with HTML output
404 – page does not exist (never created or deleted)

Notable response headers: none

Body: HTML
Reversible HTML for the page as generated by Parsoid. No skin or navigation.

Event Timeline

@tstarling pointed out how important this optimization is, especially for large pages. (AIUI @Krinkle has been a strong advocate of this.)

Browsers or HTML widgets are really good at loading HTML for a page quickly, and making the document viewable very early. Using this endpoint takes advantage of that functionality for quick loading. It's somewhat more complicated to use than the version that has the HTML included T234375.

eprodromou triaged this task as Medium priority.Oct 1 2019, 9:20 PM
eprodromou updated the task description. (Show Details)

It's worth pointing out that these endpoints are kind of broken for anything but wikitext content types. Happy to just throw an error at this point and deal with other content types later.

eprodromou updated the task description. (Show Details)Oct 11 2019, 1:55 AM
eprodromou updated the task description. (Show Details)Oct 11 2019, 2:11 AM
eprodromou updated the task description. (Show Details)Oct 28 2019, 8:34 PM
eprodromou updated the task description. (Show Details)Nov 10 2019, 7:17 PM

I added the content_model to the JSON-only endpoint, and added an error code if the page can't be rendered as HTML because of its content model.

eprodromou updated the task description. (Show Details)Nov 12 2019, 8:06 PM
BPirkle added a subscriber: BPirkle.Dec 4 2019, 4:06 PM

Reversible HTML for the page as generated by Parsoid. No skin or navigation.

Parsoid is not (yet) a part of core and therefore core code cannot use Parsoid HTML (please correct me if I'm wrong about that). Does this endpoint, at least for now, need to be implemented in an extension that depends on the Parsoid extension? Alternatively, is there any precedent for core code that behaves differently if an extension is absent (in this case, returning an error)?

Reversible HTML for the page as generated by Parsoid. No skin or navigation.

Parsoid is not (yet) a part of core and therefore core code cannot use Parsoid HTML (please correct me if I'm wrong about that).

The premise is correct but the conclusion is unwarranted.

Does this endpoint, at least for now, need to be implemented in an extension that depends on the Parsoid extension?

I think that's a good discussion to have with the parsing team. I'd suggest either:

  • taking this opportunity to add Parsoid/PHP as a library to MediaWiki (it's coming anyway; we don't have to integrate it as the default parser for everything, just for this and other REST API endpoints)
  • calling the Parsoid/PHP microservice to get data parsed

Alternatively, is there any precedent for core code that behaves differently if an extension is absent (in this case, returning an error)?

I think in this case BY FAR the better thing to do would be to return non-Parsoid HTML.

Two more thoughts:

  • endpoints that are in the larger "Unified API" but not in the "Core API" won't have these concerns - they'll be able to rely on things in our production environment outside of core. That doesn't help with this current endpoint, but is something we may be able to use to our advantage in the future
  • another option, depending on expected timing, would be to initially implement this under "coredev" using Parsoid HTML, and only move it to "v1" after Parsoid/PHP is a library. I don't have any objection to an experimental endpoint referencing things outside core.
eprodromou updated the task description. (Show Details)Dec 4 2019, 6:10 PM
WDoranWMF set the point value for this task to 2.Jan 7 2020, 7:02 PM

Change 565408 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] REST: /page/{title}/html endpoing backed by RESTBase.

https://gerrit.wikimedia.org/r/565408

I haven't really thought about it before, but how're we gonna handle redirect pages?
This is going to be a valid question for all the HTML endpoints.

Several options are available:

  1. Return 200 with the redirect page html https://en.wikipedia.org/w/index.php?title=Carnivorous&redirect=no
  2. Return a 302 with Location header for a redirect target URI
    • With redirect page HTML
    • Without redirect page HTML
  3. Return 200 with redirect target info and HTML (resolve redirects internally unconditionally) - this is a no-go if we ever want to have frontend caching with purging.

Current implementation opts for option 1, cause it is the simplest and has the least consequences.

RESTBase does a much more feature-rich redirect handling. By default it returns a normal 302 with a Location header, which is what you want while browsing pages - you don't want your client to implement HTML parsing to parse-out the redirect target.

However, for VE in order to edit the redirect page itself, you need to access it's source, so RESTBase implents redirect=false query parameter, that allows you to fetch the redirect page HTML directly.

Mediawiki does pretty much the same thing for page viewing with a 'redirect=no' parameter.

I propose that we implement something similar here eventually?

Change 565408 merged by jenkins-bot:
[mediawiki/core@master] REST: /page/{title}/{bare,html,with_html} endpoints backed by RESTBase.

https://gerrit.wikimedia.org/r/565408

Parsoid/PHP hits composer.json in MediaWiki this week

daniel added a subscriber: daniel.Mar 2 2020, 11:48 AM

This is tracked as "waiting for review", but I see no open patches. Is this done? What is missing?

Yeah, forgot to move it.

eprodromou closed this task as Resolved.Mar 11 2020, 6:06 PM

I've confirmed this works. Thanks @Pchelolo