We currently support body-only retrieval in wt2html transform end points, but not in regular HTML revision end points. We could consider adding this, possibly with a GET URL parameter. With query parameters in the mix we'll probably want to make sure that we don't ever allow varnish caching for these end points.
|Resolved||• GWicke||T93291 Disable Parsoid cache updates for wikis that are migrated to RESTBase|
|Resolved||santhosh||T92359 Use Wikimedia REST API for accessing page data in Content Translation|
|Declined||None||T95199 Support body-only retrieval in normal HTML revision end points (restbase, body)|
One issue to consider:
document.documentElement.innerHTML = '<meta>foo' "<meta>foo" document.documentElement.innerHTML "<head><meta></head><body>foo</body>"
This is a quirk in the HTML5 parsing spec, which causes <meta> elements to move to the <head> if not wrapped into a <body>. We should probably return body.outerHTML instead of .innerHTML. This is also discussed in T96492.
@DarTar, would that still be useful to you?
Documentation : FWIW, API:FAQ# (How do I) get the content of a page (HTML)? recommends index.php?action=render, mentions using RESTBase instead on Wikimedia wikis, and notes the latter's different output. If and when we implement body-only retrieval, someone should update that answer.
I see three main options:
- Offer a bodyOnly flag, and clearly warn about the parsing issue in the documentation, with a recommendation of prefixing <body> before parsing as a top-level document with an HTML parser.
- Document a cheap but reliable way to extract the body, without HTML parsing. (Regexp: /<body[^>]*>([\s\S]*)<\/body>/, reliable as our HTML is serialized from DOM)
- Piggy-back on T94890: RFC: API for retrieval and saving of top-level HTML elements / sections by element ID, with a special ID for the entire body. Normally by-ID retrieval will have outerHTML semantics, so we'd include the body element.
Here 2) and 3) could be combined.
Hm, RestBASE supports the body_only option these days, but I think it's worth revisiting the body.outerHTML serialization. There is article-level directionality information included in the <body> tag which is lost when it is stripped. (And note that articles can have a directionality independent of the wiki itself, so it's something you really need to check for every article.)
@GWicke I agree, I'm just saying that body_only as it exists currently is what they call a "Candy Machine Interface" -- it makes it too easy to do the wrong thing: (a) break parsing of <meta> elements if the result is not properly wrapped, and (b) ignore necessary directionality information.
Using body.outerHTML instead of body.innerHTML would avoid both of these traps. (Not that I'm eager to change the API Yet Again.)
I am leaning towards declining this as "YAGNI". Extracting the body is not that hard to do for clients that really need it & know what they are doing. There are even very efficient streaming solutions for this: https://github.com/wikimedia/web-html-stream
In the last two years, there have also been no requests for body-only retrieval in the REST API. I am assuming that this is very rarely needed, or something clients are already comfortable dealing with themselves.