Page MenuHomePhabricator

Support body-only retrieval in normal HTML revision end points (restbase, body)
Closed, DeclinedPublic

Description

We currently support body-only retrieval in wt2html transform end points, but not in regular HTML revision end points. We could consider adding this, possibly with a GET URL parameter. With query parameters in the mix we'll probably want to make sure that we don't ever allow varnish caching for these end points.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke triaged this task as Medium priority.Apr 6 2015, 6:09 PM
GWicke added projects: RESTBase-API, RESTBase.
GWicke set Security to None.
GWicke edited subscribers, added: DarTar; removed: Aklapper.

One issue to consider:

document.documentElement.innerHTML = '<meta>foo'
"<meta>foo"
document.documentElement.innerHTML
"<head><meta></head><body>foo</body>"

This is a quirk in the HTML5 parsing spec, which causes <meta> elements to move to the <head> if not wrapped into a <body>. We should probably return body.outerHTML instead of .innerHTML. This is also discussed in T96492.

@DarTar, would that still be useful to you?

GWicke renamed this task from Support body-only retrieval in normal HTML revision end points to Support body-only retrieval in normal HTML revision end points (restbase, body).Apr 28 2015, 1:41 AM

Documentation : FWIW, API:FAQ# (How do I) get the content of a page (HTML)? recommends index.php?action=render, mentions using RESTBase instead on Wikimedia wikis, and notes the latter's different output. If and when we implement body-only retrieval, someone should update that answer.

I see three main options:

  1. Offer a bodyOnly flag, and clearly warn about the parsing issue in the documentation, with a recommendation of prefixing <body> before parsing as a top-level document with an HTML parser.
  2. Document a cheap but reliable way to extract the body, without HTML parsing. (Regexp: /<body[^>]*>([\s\S]*)<\/body>/, reliable as our HTML is serialized from DOM)
  3. Piggy-back on T94890: RFC: API for retrieval and saving of top-level HTML elements / sections by element ID, with a special ID for the entire body. Normally by-ID retrieval will have outerHTML semantics, so we'd include the body element.

Here 2) and 3) could be combined.

@Nikerabbit, @santhosh: Any input on this? How are you parsing the content?

As per https://gerrit.wikimedia.org/r/#/c/207039/4/pageloader/PageLoader.js we are using option #2. That works for us, but better, if that is done at RESTBase API side

Hm, RestBASE supports the body_only option these days, but I think it's worth revisiting the body.outerHTML serialization. There is article-level directionality information included in the <body> tag which is lost when it is stripped. (And note that articles can have a directionality independent of the wiki itself, so it's something you really need to check for every article.)

@cscott, the body_only flag is aimed at use cases where HTML is directly embedded in a larger page. For other use cases, users can retrieve the full HTML, and get all the information from the head and body.

Yes, but in that case you still need the directionality. That is, the correct embedding of:

<body dir="rtl">Foo!</body>

is:

<div dir="rtl">Foo!</div>

*not*

Foo!

@cscott, that's well supported by retrieving the regular HTML.

@GWicke I agree, I'm just saying that body_only as it exists currently is what they call a "Candy Machine Interface" -- it makes it too easy to do the wrong thing: (a) break parsing of <meta> elements if the result is not properly wrapped, and (b) ignore necessary directionality information.

Using body.outerHTML instead of body.innerHTML would avoid both of these traps. (Not that I'm eager to change the API Yet Again.)

@cscott, the issue is that it would make the API useless for the 'just concatenate these strings' use case. I was always on the fence about supporting the body_only flag; we could consider dropping it altogether if existing users are okay with that.

I am leaning towards declining this as "YAGNI". Extracting the body is not that hard to do for clients that really need it & know what they are doing. There are even very efficient streaming solutions for this: https://github.com/wikimedia/web-html-stream

In the last two years, there have also been no requests for body-only retrieval in the REST API. I am assuming that this is very rarely needed, or something clients are already comfortable dealing with themselves.

Any objections?

GWicke lowered the priority of this task from Medium to Low.Jul 11 2017, 10:39 PM
GWicke moved this task from Backlog to watching on the Services board.
GWicke edited projects, added Services (watching); removed Services.
mobrovac edited projects, added Services (done); removed Services (watching).
mobrovac subscribed.

Let's decline this. The use-case seems really minor with a high penalty (code, Varnish, etc). We can always circle back to this in case there is a genuine need for it.