Page MenuHomePhabricator

Extract HTML Compatibility Layer from MCS Mobile Sections API
Closed, ResolvedPublic

Description

Follow up from the offsite:

Both the new and old page content APIs output JSON directly. While this is the end goal it has a few downsides:

  1. It does not separate the concerns of finding content within the HTML and formatting it into JSON
  2. Because changes are not output as HTML any cleanup performed by the API is not easily upstreamed to Parsoid if it is found to be general purpose.

In order to address this, a new API will be written the performs all the cleaning and formatting of the MCS API, but outputs it as HTML.

This includes, but is not limited to:

  • Marking up sections, including the lead section (see also T114072).
  • Marking up other content components that need special treatment or removal in mobile browsers or apps (ex: infobox, navboxes, references, pronunciation help).

In the longer term, this should result in an an improvement of the Parsoid markup spec with much better support for easily and efficiently accomplishing common selection and reformatting tasks on our content.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@bearND can you inventory what gets added, what gets moved, and what gets marked up?

Following up a meeting last week with myself and @bearND:

This API will only markup the HTML, it will not modify the structure or content of the HTML except for the adding tags/classes (For example: This API won't move the lead paragraph to the beginning of the page).

Such work will be performed in T162185: Extract JSON API from MCS Page Content API

I guess it should have html in the name plus something else, probably before that. There's already Parsoid at /page/html.
So, /page/*-html. Not sure if * should be mobile or reading or something else.

@GWicke @mobrovac @Pchelolo @bearND:

Thoughts on what to name this API?

At a high level, I guess we are trying to mark up page components, and generally improve support in Parsoid's HTML spec for reading use cases. Perhaps we could call the product something like "componentized HTML". Assuming we will fold this into a revision of the Parsoid HTML spec itself in the medium term, the naming might only matter until that is done. To allow for later replacement with plain Parsoid HTML, the "html-components" (or the like) API end point would remain unstable.

It might be useful to write up a rough list of items that should be marked up, and involve @ssastry and the Parsing-Team--ARCHIVED early in the design process. Depending on the implementation complexity and interaction with editing & visual layout, it might be possible to do some of those things in Parsoid right away, and generally design the markup with later integration in mind.

I agree with @GWicke . With the long-term move to Parsoid HMTL for views (T55784), it makes sense to think about the eventual sought outcome now as we are starting to make the first steps towards it.

On the whole I think involving the Parsoid team is good, but I want to point out the tension/balance here:

  1. Creating markup for the reading team that specifically solves problems for those platforms
  2. Running all changes by the parsoid team and having a discussion about where and how to implement each of them in a general way before implementing them

I think we should be trying to be somewhere in the middle here.

While one of the purposes of moving this code to html rather than json is to make it possible to absorb it into parsoid if it does turn out to be a general need, I don't know if we expect all or even most markup to be absorbed into parsoid. (Maybe after the inventory we will have a better idea)

On the other hand I don't want to lose sight of the fact that while we are trying to build something flexible and can be used by 3rd party clients, the main purpose of this is to create apis and markup that is shareable by the reading platforms and that we can iterate on independently.

I'm not saying this to deter the discussion - I think it's good to sync, I just want us to be cognizant of both overgeneralizing this work and spending too much time bikeshedding over implementation details.

Having said all of that:

@bearND we already talked about inventorying all of the markup we need to make. Can you do that first and add to this ticket?

Following that, I'll set up a meeting with @ssastry and whoever else we think would be best to invite from the parsoid team to discus what we plan on marking up and go from there.

@GWicke @mobrovac does this sound reasonable?

Also, I hope this is coming through that I don't think this is a bad idea at all, I'm just trying to stay on top of scope creep and administrative overhead from the beginning on this.

@Fjalapeno, agreed. Lets see what you need & if it's straightforward from the Parsoid perspective. I suspect that most needs will actually be quite straightforward stuff like adding some attributes. You can always just go ahead with *something* for harder or more specialized cases. I'm basically advocating for making a very time-boxed effort at coordinating, focusing especially on the easy cases.

I propose naming it mobile-html since I anticipate that independent of whatever updates we make to the Parsoid HTML, there will be some mobile-specific transformations that are not suited for other clients.

Iniitially (as in now), this might have many transformations some of which might migrate upstream into Parsoid output which might lead them to being removed in this api code.

I am happy to chat about whatever you guys are doing right now to massage Parsoid output for your needs and see which of these should / could be in Parsoid, and also discuss timelines. I agree with your collective assessment to time-box this co-ordination and not block on our team, especially since we are a pretty lean team right now.

We briefly discussed this during the parsing / services sync meeting. For the common case of identifying content element by templates used, we agreed that it is important to let the community control this classification for each project. One option would be to use templatedata annotations; another option would be a more specialized on-wiki data structure. From a performance perspective, it would be a lot better if a service (MCS or Parsoid) could fetch & cache a blob identifying *all* special content element templates per wiki, rather than needing to check for each template individually.

For the common case of identifying content element by templates used, we agreed that it is important to let the community control this classification for each project.

Would elaborate what this means? What classification? Maybe an example would help.

The main observation is that there is no way to automatically infer what is an infobox, navbox, etc. What we are effectively looking at is assigning tags/types to content blocks on a page, and that is meta information that is not available in wikitext. So, editors have to give us that information. These comments: T105845#1650013 and T105845#1650089 has a little bit of discussion.

Here's some of the details

Open questions/assumptions:

  • Should I switch to mobileview for main pages in the HTML endpoint or defer to JSON endpoint? I'm leaning towards deferring to the JSON endpoint and just using Parsoid HTML for the mobile-html endpoint. Same for specific User and File namespace handling (=additions of metadata).
  • Does the new version of the endpoint need to shorten page internal links?
  • I assume that the mobile-html endpoint will preserve the Parsoid head element as is (but may add to it if needed)
  • It would be good to mark reference sections so they can be easily removed later.

No special markup needed (just preserve Parsoid values):

  • Revision: Currently getting that from etag (first part before /).

Could also get from html element about="http://en.wikipedia.org/wiki/Special:Redirect/revision/773487730"
Which leads to the question: Is it ok to use the same attribute in the <html> tag as Parsoid emits?
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://en.wikipedia.org/wiki/Special:Redirect/revision/773487730">
Since these are namespace prefix definitions + the about attribute seems generic I assume that is the case.

  • lastmodified: 'head > meta[property="dc:modified"]'
  • ... there are more but since they are noops for this endpoint I don't want to pull all of them out at this point.
  • Lead vs section 0: Instead of calling it lead section should we just call it section 0?

Markup needed for:

(update getSectionsText accordingly)

  • Big maybe: _addParsoidSpecificMarkup (see Deprecate section)

Act on lead section only:

  • transforms.extractLeadIntroduction(lead) [=lead paragraph] (intro)

Act on any section:

  • transforms.extractInfobox
  • parseProperty.parsePronunciation [1](foo-pronunciation + foo-pronunciation-title)
  • transforms.extractPageIssues
  • transforms.extractHatnotes
  • parseProperty.parseGeo (foo-geo + foo-geo-title)
  • parseProperty.parseSpokenWikipedia
  • transforms.markReferenceSections

1: We assume that the pronunciations in the first sections pertain to the title. foo: still need to come up with a class prefix for our stuff. Should we use mw?

Deprecate (app should change accordingly)?

  • _rewriteUrlAttribute(doc, 'a', 'href'); .replace(/^\.\, '/wiki/');
  • Set <a class=\'external\" on all <a rel=\"mw:ExtLink\" so we get the 'external link' indicators in the UI. _addClassTo(doc, 'a[rel~=mw:ExtLink]', 'external');
  • Set class="image" on all anchors around images so the app knows to view these images in the gallery. See also https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Images _addClassTo(doc, 'figure > a, span[typeof^=mw:Image] > a', 'image');
  • hide IPAs (instead of removing IPA elements let CSS handle that)

Move calls to JSON endpoint

  • getSectionsText()
  • hideRedLinks [not effective right now, needs a patch; see T119266 for details)
  • Remove reference link backs. Example: <a href=\"#cite_ref-1\"><span class=\"mw-linkback-text\">↑ </span></a> (since it doesn't make sense in the native reference display)
  • transforms.stripUnneededMarkup() (including table.navbox)
    • remove elements with CSS selectors:
'span.Z3988', // Remove <span class=\"Z3988\"></span>
'span:empty', // Remove empty <span></span>
'link',
'table.navbox', // Remove Navboxes
'div.infobox',
'div.magnify',
'.hide-when-compact',
'.geo-nondefault',
'.geo-multi-punct',
'span#coordinates'
  • Remove attributes from elements:
_rmAttributes(doc, 'a:not([rel=nofollow],[rel~=mw:ExtLink])', ['rel']);
_rmAttributes(doc, 'a', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'code', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'div', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'ol', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'p', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'ul', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'table', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'blockquote', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'cite', ['about', 'data-mw', 'typeof']);
_rmAttributes(doc, 'abbr', ['title']);
_rmAttributes(doc, 'figure', ['id', 'typeof']);
_rmAttributes(doc, 'b', ['id']);
_rmAttributes(doc, 'q', ['id']);
_rmAttributes(doc, 'td', ['id']);
_rmAttributes(doc, 'figcaption', ['id']);
_rmAttributes(doc, 'figcaption a[class~=image]', ['class']); // T123527
_rmAttributes(doc, 'i', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'li', ['about']);
_rmAttributes(doc, 'img', ['about', 'alt', 'data-file-height', 'data-file-width', 'id',
    'resource']);
_rmAttributes(doc, 'span', ['about', 'data-file-type', 'data-mw', 'id', 'itemscope',
    'itemtype', 'lang', 'rel', 'title', 'typeof']);
  • Simplify Bracket Spans: replace <span>[</span>1<span>]</span> with [1]
  • Remove comment elements

remove elements with CSS selectors

Most of these matches look fairly brittle. Wherever possible, we should move such heuristics into the HTML transform, and mark up content elements explicitly, so that the JSON / per-device layer has a stable interface to work against. This will also allow us to gradually replace these heuristics with more robust template identification, perhaps using templatedata.

Remove attributes from elements:

It looks like most of these changes are aimed at reducing the size of the HTML, at the cost of disabling editing or ad-hoc matching abilities. It would be interesting to a) quantify how much this saves after compression, and b) see if a future revision of the Parsoid HTML spec could reduce overheads to the point where we can avoid losing the functionality.

Thoughts on what to name this API?

Ideas: read-html, view-html

@GWicke these look like good suggestions!

Most of these matches look fairly brittle. Wherever possible, we should move such heuristics into the HTML transform, and mark up content elements explicitly, so that the JSON / per-device layer has a stable interface to work against. This will also allow us to gradually replace these heuristics with more robust template identification, perhaps using templatedata.

I like the rationale about moving everything that can "break" into the HTML layer - making the JSON layers more robust. And making them easier to replace.

It looks like most of these changes are aimed at reducing the size of the HTML, at the cost of disabling editing or ad-hoc matching abilities. It would be interesting to a) quantify how much this saves after compression, and b) see if a future version of Parsoid HTML could reduce overheads to the point where we can avoid losing the functionality.

We should quantify what we are removing before removing it. No need introducing changes from regular parsoid markup that may be unneeded.

@bearND how does this sound to you?

Both points sound good to me. The attribute removals tend to provide smaller reductions compared to the removal of elements and reference sections. I'm fine iterating on that some more once I have the basic structure. If you have good suggestions for a set of representative test pages or tool on how to measure the size reduction of the compressed data I'm all ears.

The HTML endpoint should be easy to diff against Parsoid. Both visually and with diff tools.

I added this to my previous comment since this is a new requirement that has already been implemented by the page/formatted endpoint:

  • It would be good to mark reference sections so they can be easily removed later.

For completions sake:

We had meeting and created a few subtasks for this ticket.

This ticket should be completed after T164032 which documents the markup before implementing it

Change 358646 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Preserve geo information

https://gerrit.wikimedia.org/r/358646

Change 352963 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add read-html route

https://gerrit.wikimedia.org/r/352963

Change 358646 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Preserve geo information

https://gerrit.wikimedia.org/r/358646

Fjalapeno renamed this task from Extract HTML API from MCS Page Content API to Extract HTML Compatibility Layer from MCS Mobile Sections API.Jul 13 2017, 2:41 PM

Change 352963 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add read-html route

https://gerrit.wikimedia.org/r/352963

I currently have two routes: read-base-html and read-html.

  • read-base-html is the one closest to Parsoid markup, with some minor changes, mostly the sectioning artifacts. (Since Parsoid is going to implement sectioning as well there is a decent chance we could get rid of this endpoint eventually, and use Parsoid directly.)
  • read-html: Is basically what the previous endpoint provides plus some size optimizations: stripReferenceListContent + stripUnneededMarkup (=unneeded for reading)

Let me know if you come up with any better names.

Change 370295 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add read-base-html route

https://gerrit.wikimedia.org/r/370295

Since you asked for bikeshedding.. How about

  • read-html: Base HTML, with extra reading markup.
  • read-html-min: Minified base HTML, with destructive removals.
    • alternative: read-html-stripped

Pros:

  • More consistent / suggestive about stacking relationship.
  • Calls out size difference ("min"), similar to JS minification.

Cons:

  • Longer name for minified HTML.

Change 370295 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add read-base-html route

https://gerrit.wikimedia.org/r/370295

@bearND @Fjalapeno

I noticed that https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections/Barack_Obama/ is returning giant tables which are never shown on mobile. On the Obama article the "This article is part of a series about..." table is one such example.

Is that issue encompassed by this ticket or should I file a separate one?

@Mhurd a separate issue with details would be better.

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:06:32Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808)

Mentioned in SAL (#wikimedia-operations) [2017-09-06T22:11:25Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@507a479]: Update mobileapps to 2cb6281 (T168848 T169277 T169274 T162179 T164033 T167921 T174698 T168848 T174808) (duration: 04m 53s)