Extract HTML Compatibility Layer from MCS Mobile Sections API
Closed, ResolvedPublic
Actions

Description

Follow up from the offsite:

Both the new and old page content APIs output JSON directly. While this is the end goal it has a few downsides:

It does not separate the concerns of finding content within the HTML and formatting it into JSON
Because changes are not output as HTML any cleanup performed by the API is not easily upstreamed to Parsoid if it is found to be general purpose.

In order to address this, a new API will be written the performs all the cleaning and formatting of the MCS API, but outputs it as HTML.

This includes, but is not limited to:

Marking up sections, including the lead section (see also T114072).
Marking up other content components that need special treatment or removal in mobile browsers or apps (ex: infobox, navboxes, references, pronunciation help).

In the longer term, this should result in an an improvement of the Parsoid markup spec with much better support for easily and efficiently accomplishing common selection and reformatting tasks on our content.

Details

Subject	Repo	Branch	Lines +/-
Add read-base-html route	mediawiki/services/mobileapps	master	+57 -8
Add read-html route	mediawiki/services/mobileapps	master	+268 -107
Preserve geo information	mediawiki/services/mobileapps	master	+23 -11

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	• Jhernandez	T177424 Develop Compatibility Layer of PCS
Resolved	• bearND	T162179 Extract HTML Compatibility Layer from MCS Mobile Sections API
Declined	None	T78734 Support for mobile's section use cases in Parsoid HTML
Resolved	ssastry	T114072 <section> tags for MediaWiki sections
Resolved	• bearND	T178705 Prepare MCS for Parsoid section tags
Resolved	• bearND	T178706 Improve section parsing in mobile-sections endpoint
Resolved	• bearND	T178707 Create script to compare old and new sectioning code
Resolved	• bearND	T178708 Update Wiktionary definition parsing to account for new (possibly nested) section tags
Resolved	• bearND	T164032 Document HTML Markup Spec
Resolved	• bearND	T164033 Test size of "Reading stripped" HTML vs non-stripped HTML
Resolved	• bearND	T168875 Remove reference sections at end of article from page content HTML
Declined	None	T175742 Consider using data-mw-* attribute for section ids in parsoidSectionsUsingDivs
Declined	None	T172550 Detect when pages have generated content, and figure out how to inject this content into the Read HTML

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Fjalapeno mentioned this in T162185: Extract JSON API from MCS Page Content API.Apr 4 2017, 7:00 PM

• GWicke updated the task description. (Show Details)Apr 4 2017, 7:04 PM

• Pchelolo moved this task from watching to designing on the Services board.Apr 4 2017, 7:27 PM

• Pchelolo edited projects, added Services (designing); removed Services (watching).

@bearND can you inventory what gets added, what gets moved, and what gets marked up?

• bearND moved this task from Incoming to Kanban on the Mobile-Content-Service board.Apr 6 2017, 2:56 AM

• bearND edited projects, added Mobile-Content-Service (Kanban); removed Mobile-Content-Service.

• Fjalapeno added a comment.Apr 7 2017, 4:13 PM

This comment was removed by • Fjalapeno.

Following up a meeting last week with myself and @bearND:

This API will only markup the HTML, it will not modify the structure or content of the HTML except for the adding tags/classes (For example: This API won't move the lead paragraph to the beginning of the page).

Such work will be performed in T162185: Extract JSON API from MCS Page Content API

@GWicke @mobrovac @Pchelolo @bearND:

Thoughts on what to name this API?

I guess it should have html in the name plus something else, probably before that. There's already Parsoid at /page/html.
So, /page/*-html. Not sure if * should be mobile or reading or something else.

In T162179#3169058, @Fjalapeno wrote:

@GWicke @mobrovac @Pchelolo @bearND:

Thoughts on what to name this API?

At a high level, I guess we are trying to mark up page components, and generally improve support in Parsoid's HTML spec for reading use cases. Perhaps we could call the product something like "componentized HTML". Assuming we will fold this into a revision of the Parsoid HTML spec itself in the medium term, the naming might only matter until that is done. To allow for later replacement with plain Parsoid HTML, the "html-components" (or the like) API end point would remain unstable.

It might be useful to write up a rough list of items that should be marked up, and involve @ssastry and the Parsing-Team--ARCHIVED early in the design process. Depending on the implementation complexity and interaction with editing & visual layout, it might be possible to do some of those things in Parsoid right away, and generally design the markup with later integration in mind.

I agree with @GWicke . With the long-term move to Parsoid HMTL for views (T55784), it makes sense to think about the eventual sought outcome now as we are starting to make the first steps towards it.

Ok, that makes sense to me.

On the whole I think involving the Parsoid team is good, but I want to point out the tension/balance here:

Creating markup for the reading team that specifically solves problems for those platforms
Running all changes by the parsoid team and having a discussion about where and how to implement each of them in a general way before implementing them

I think we should be trying to be somewhere in the middle here.

While one of the purposes of moving this code to html rather than json is to make it possible to absorb it into parsoid if it does turn out to be a general need, I don't know if we expect all or even most markup to be absorbed into parsoid. (Maybe after the inventory we will have a better idea)

On the other hand I don't want to lose sight of the fact that while we are trying to build something flexible and can be used by 3rd party clients, the main purpose of this is to create apis and markup that is shareable by the reading platforms and that we can iterate on independently.

I'm not saying this to deter the discussion - I think it's good to sync, I just want us to be cognizant of both overgeneralizing this work and spending too much time bikeshedding over implementation details.

Having said all of that:

@bearND we already talked about inventorying all of the markup we need to make. Can you do that first and add to this ticket?

Following that, I'll set up a meeting with @ssastry and whoever else we think would be best to invite from the parsoid team to discus what we plan on marking up and go from there.

@GWicke @mobrovac does this sound reasonable?

Also, I hope this is coming through that I don't think this is a bad idea at all, I'm just trying to stay on top of scope creep and administrative overhead from the beginning on this.

@Fjalapeno, agreed. Lets see what you need & if it's straightforward from the Parsoid perspective. I suspect that most needs will actually be quite straightforward stuff like adding some attributes. You can always just go ahead with *something* for harder or more specialized cases. I'm basically advocating for making a very time-boxed effort at coordinating, focusing especially on the easy cases.

@GWicke 👍

@bearND are you good here?

I propose naming it mobile-html since I anticipate that independent of whatever updates we make to the Parsoid HTML, there will be some mobile-specific transformations that are not suited for other clients.

Iniitially (as in now), this might have many transformations some of which might migrate upstream into Parsoid output which might lead them to being removed in this api code.

I am happy to chat about whatever you guys are doing right now to massage Parsoid output for your needs and see which of these should / could be in Parsoid, and also discuss timelines. I agree with your collective assessment to time-box this co-ordination and not block on our team, especially since we are a pretty lean team right now.

• Fjalapeno mentioned this in T146944: Mobile-sections replacement.Apr 11 2017, 3:59 PM

• Mholloway subscribed.Apr 11 2017, 4:00 PM

We briefly discussed this during the parsing / services sync meeting. For the common case of identifying content element by templates used, we agreed that it is important to let the community control this classification for each project. One option would be to use templatedata annotations; another option would be a more specialized on-wiki data structure. From a performance perspective, it would be a lot better if a service (MCS or Parsoid) could fetch & cache a blob identifying *all* special content element templates per wiki, rather than needing to check for each template individually.

In T162179#3175515, @GWicke wrote:

For the common case of identifying content element by templates used, we agreed that it is important to let the community control this classification for each project.

Would elaborate what this means? What classification? Maybe an example would help.

The main observation is that there is no way to automatically infer what is an infobox, navbox, etc. What we are effectively looking at is assigning tags/types to content blocks on a page, and that is meta information that is not available in wikitext. So, editors have to give us that information. These comments: T105845#1650013 and T105845#1650089 has a little bit of discussion.

Here's some of the details

Open questions/assumptions:

Should I switch to mobileview for main pages in the HTML endpoint or defer to JSON endpoint? I'm leaning towards deferring to the JSON endpoint and just using Parsoid HTML for the mobile-html endpoint. Same for specific User and File namespace handling (=additions of metadata).
Does the new version of the endpoint need to shorten page internal links?
I assume that the mobile-html endpoint will preserve the Parsoid head element as is (but may add to it if needed)
It would be good to mark reference sections so they can be easily removed later.

No special markup needed (just preserve Parsoid values):

Revision: Currently getting that from etag (first part before /).

Could also get from html element about="http://en.wikipedia.org/wiki/Special:Redirect/revision/773487730"
Which leads to the question: Is it ok to use the same attribute in the <html> tag as Parsoid emits?
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="http://en.wikipedia.org/wiki/Special:Redirect/revision/773487730">
Since these are namespace prefix definitions + the about attribute seems generic I assume that is the case.

lastmodified: 'head > meta[property="dc:modified"]'
... there are more but since they are noops for this endpoint I don't want to pull all of them out at this point.
Lead vs section 0: Instead of calling it lead section should we just call it section 0?

Markup needed for:

Mark sections: addSectionDivs() should be replaced with using https://github.com/wikimedia/parsoid-dom-utils

(update getSectionsText accordingly)

Big maybe: _addParsoidSpecificMarkup (see Deprecate section)

Act on lead section only:

transforms.extractLeadIntroduction(lead) [=lead paragraph] (intro)

Act on any section:

transforms.extractInfobox
parseProperty.parsePronunciation [1](foo-pronunciation + foo-pronunciation-title)
transforms.extractPageIssues
transforms.extractHatnotes
parseProperty.parseGeo (foo-geo + foo-geo-title)
parseProperty.parseSpokenWikipedia
transforms.markReferenceSections

1: We assume that the pronunciations in the first sections pertain to the title. foo: still need to come up with a class prefix for our stuff. Should we use mw?

Deprecate (app should change accordingly)?

_rewriteUrlAttribute(doc, 'a', 'href'); .replace(/^\.\, '/wiki/');
Set <a class=\'external\" on all <a rel=\"mw:ExtLink\" so we get the 'external link' indicators in the UI. _addClassTo(doc, 'a[rel~=mw:ExtLink]', 'external');
Set class="image" on all anchors around images so the app knows to view these images in the gallery. See also https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Images _addClassTo(doc, 'figure > a, span[typeof^=mw:Image] > a', 'image');
hide IPAs (instead of removing IPA elements let CSS handle that)

Move calls to JSON endpoint

getSectionsText()
hideRedLinks [not effective right now, needs a patch; see T119266 for details)
Remove reference link backs. Example: <a href=\"#cite_ref-1\">↑ </a> (since it doesn't make sense in the native reference display)
transforms.stripUnneededMarkup() (including table.navbox)
- remove elements with CSS selectors:

'span.Z3988', // Remove <span class=\"Z3988\"></span>
'span:empty', // Remove empty <span></span>
'link',
'table.navbox', // Remove Navboxes
'div.infobox',
'div.magnify',
'.hide-when-compact',
'.geo-nondefault',
'.geo-multi-punct',
'span#coordinates'

Remove attributes from elements:

_rmAttributes(doc, 'a:not([rel=nofollow],[rel~=mw:ExtLink])', ['rel']);
_rmAttributes(doc, 'a', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'code', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'div', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'ol', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'p', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'ul', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'table', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'blockquote', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'cite', ['about', 'data-mw', 'typeof']);
_rmAttributes(doc, 'abbr', ['title']);
_rmAttributes(doc, 'figure', ['id', 'typeof']);
_rmAttributes(doc, 'b', ['id']);
_rmAttributes(doc, 'q', ['id']);
_rmAttributes(doc, 'td', ['id']);
_rmAttributes(doc, 'figcaption', ['id']);
_rmAttributes(doc, 'figcaption a[class~=image]', ['class']); // T123527
_rmAttributes(doc, 'i', ['about', 'data-mw', 'id', 'typeof']);
_rmAttributes(doc, 'li', ['about']);
_rmAttributes(doc, 'img', ['about', 'alt', 'data-file-height', 'data-file-width', 'id',
    'resource']);
_rmAttributes(doc, 'span', ['about', 'data-file-type', 'data-mw', 'id', 'itemscope',
    'itemtype', 'lang', 'rel', 'title', 'typeof']);

Simplify Bracket Spans: replace [1] with [1]
Remove comment elements

@GWicke @mobrovac @Pchelolo @bearND:

Thoughts on what to name this API?

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated (Kanban).Apr 25 2017, 3:45 PM

remove elements with CSS selectors

Most of these matches look fairly brittle. Wherever possible, we should move such heuristics into the HTML transform, and mark up content elements explicitly, so that the JSON / per-device layer has a stable interface to work against. This will also allow us to gradually replace these heuristics with more robust template identification, perhaps using templatedata.

Remove attributes from elements:

It looks like most of these changes are aimed at reducing the size of the HTML, at the cost of disabling editing or ad-hoc matching abilities. It would be interesting to a) quantify how much this saves after compression, and b) see if a future revision of the Parsoid HTML spec could reduce overheads to the point where we can avoid losing the functionality.

Thoughts on what to name this API?

Ideas: read-html, view-html

• GWicke added a subtask: T78734: Support for mobile's section use cases in Parsoid HTML.Apr 25 2017, 6:38 PM

@GWicke these look like good suggestions!

Most of these matches look fairly brittle. Wherever possible, we should move such heuristics into the HTML transform, and mark up content elements explicitly, so that the JSON / per-device layer has a stable interface to work against. This will also allow us to gradually replace these heuristics with more robust template identification, perhaps using templatedata.

I like the rationale about moving everything that can "break" into the HTML layer - making the JSON layers more robust. And making them easier to replace.

It looks like most of these changes are aimed at reducing the size of the HTML, at the cost of disabling editing or ad-hoc matching abilities. It would be interesting to a) quantify how much this saves after compression, and b) see if a future version of Parsoid HTML could reduce overheads to the point where we can avoid losing the functionality.

We should quantify what we are removing before removing it. No need introducing changes from regular parsoid markup that may be unneeded.

@bearND how does this sound to you?

Both points sound good to me. The attribute removals tend to provide smaller reductions compared to the removal of elements and reference sections. I'm fine iterating on that some more once I have the basic structure. If you have good suggestions for a set of representative test pages or tool on how to measure the size reduction of the compressed data I'm all ears.

The HTML endpoint should be easy to diff against Parsoid. Both visually and with diff tools.

I added this to my previous comment since this is a new requirement that has already been implemented by the page/formatted endpoint:

It would be good to mark reference sections so they can be easily removed later.

• bearND claimed this task.Apr 26 2017, 8:40 PM

• bearND moved this task from To Do to Doing on the Mobile-Content-Service (Kanban) board.

• Fjalapeno created subtask T164032: Document HTML Markup Spec.Apr 27 2017, 7:52 PM

• Fjalapeno created subtask T164033: Test size of "Reading stripped" HTML vs non-stripped HTML.Apr 27 2017, 8:30 PM

• Fjalapeno moved this task from Doing to To Do on the Mobile-Content-Service (Kanban) board.Apr 28 2017, 8:02 PM

• Fjalapeno edited projects, added Mobile-Content-Service; removed Mobile-Content-Service (Kanban).Apr 28 2017, 8:02 PM

• Fjalapeno mentioned this in T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.May 1 2017, 2:39 PM

For completions sake:

We had meeting and created a few subtasks for this ticket.

This ticket should be completed after T164032 which documents the markup before implementing it

waldyrious subscribed.May 1 2017, 4:57 PM

• Fjalapeno mentioned this in T164236: Include size of article in Summary.May 2 2017, 10:27 PM

• Fjalapeno moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.May 10 2017, 4:11 PM

Change 358646 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Preserve geo information

https://gerrit.wikimedia.org/r/358646

gerritbot added a project: Patch-For-Review.Jun 13 2017, 5:54 PM

Change 352963 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add read-html route

https://gerrit.wikimedia.org/r/352963

• bearND created subtask T167921: Support Lazy loading of page content not needed for first paint .Jun 14 2017, 8:24 PM

Change 358646 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Preserve geo information

https://gerrit.wikimedia.org/r/358646

Jdlrobson closed subtask T78734: Support for mobile's section use cases in Parsoid HTML as Declined.Jun 16 2017, 7:18 PM

• bearND mentioned this in T114072: <section> tags for MediaWiki sections.Jun 20 2017, 5:12 PM

• bearND moved this task from Doing to Code Review on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jun 20 2017, 5:32 PM

• Fjalapeno added a parent task: T169242: Develop Page Content Service for Reading Clients.Jun 29 2017, 3:38 PM

• Fjalapeno added a project: Page Content Service.Jul 11 2017, 4:18 PM

• GWicke mentioned this in T114402: Implement something similar to the RESTBase 'section' API to provide wikitext structure information.Jul 11 2017, 8:16 PM

• GWicke added a subtask: T168875: Remove reference sections at end of article from page content HTML.Jul 12 2017, 3:13 PM

• Fjalapeno renamed this task from Extract HTML API from MCS Page Content API to Extract HTML Compatibility Layer from MCS Mobile Sections API.Jul 13 2017, 2:41 PM

• Fjalapeno added a parent task: T162185: Extract JSON API from MCS Page Content API.Jul 14 2017, 4:20 PM

• GWicke edited projects, added Services (watching); removed Services (designing).Aug 8 2017, 9:36 PM

Change 352963 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add read-html route

https://gerrit.wikimedia.org/r/352963

I currently have two routes: read-base-html and read-html.

read-base-html is the one closest to Parsoid markup, with some minor changes, mostly the sectioning artifacts. (Since Parsoid is going to implement sectioning as well there is a decent chance we could get rid of this endpoint eventually, and use Parsoid directly.)

read-html: Is basically what the previous endpoint provides plus some size optimizations: stripReferenceListContent + stripUnneededMarkup (=unneeded for reading)

Let me know if you come up with any better names.

Change 370295 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add read-base-html route

https://gerrit.wikimedia.org/r/370295

Since you asked for bikeshedding.. How about

read-html: Base HTML, with extra reading markup.
read-html-min: Minified base HTML, with destructive removals.
- alternative: read-html-stripped

Pros:

More consistent / suggestive about stacking relationship.
Calls out size difference ("min"), similar to JS minification.

Cons:

Longer name for minified HTML.

Change 370295 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add read-base-html route

https://gerrit.wikimedia.org/r/370295

@bearND @Fjalapeno

I noticed that https://en.m.wikipedia.org/api/rest_v1/page/mobile-sections/Barack_Obama/ is returning giant tables which are never shown on mobile. On the Obama article the "This article is part of a series about..." table is one such example.

Is that issue encompassed by this ticket or should I file a separate one?

@Mhurd a separate issue with details would be better.

@bearND cool will file separate

@bearND new ticket: T175092