Page MenuHomePhabricator

Use Parsoid for new mobile-html-section routes
Closed, ResolvedPublic

Description

Then we don't need to change mobileview API (T106143). We would also get better structured HTML so we can be more deliberate in our DOM transformations.
The first problem here is to split the big HTML content into sections.
Another problem is to see if we can get all the metadata we get from mobileview. It would be good to not have to also call mobileview, too.

See https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Cat and https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Main_page as a examples.

Some examples:

  • displaytitle
  • (Wikidata) description
  • page id
  • revision
  • lastmodified
  • protection
  • languagecount
  • information about the lead image
  • page sections
  • pageimages

booleans:

  • editable
  • isMainPage
  • (isDisambiguation -- would be nice but doesn't even seem to work with MW API)

Details

Related Gerrit Patches:
mediawiki/services/parsoid : masterAdd optional API feature to emit <section> tags
mediawiki/services/mobileapps : masterUse Parsoid for new mobile-html-section routes

Event Timeline

bearND created this task.Aug 12 2015, 2:25 AM
bearND raised the priority of this task from to Needs Triage.
bearND updated the task description. (Show Details)
bearND added a subscriber: bearND.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2015, 2:25 AM
bearND set Security to None.
bearND updated the task description. (Show Details)Aug 25 2015, 7:26 PM
GWicke updated the task description. (Show Details)Aug 25 2015, 7:46 PM
GWicke updated the task description. (Show Details)
GWicke added a comment.EditedAug 25 2015, 8:33 PM

I think it's worth separating

a) HTML and directly related metadata like sections, and
b) other data like Wikidata descriptions.

You can switch a) to use Parsoid information, while continuing to retrieve b) from action=mobileview. This would give you the benefits of a stable DOM spec, and should also reduce the overall request latency by making the action API request slightly cheaper, and loading the Parsoid HTML from storage in parallel.

I think the following bits of information are already available in the Parsoid DOM:

  • displaytitle
  • revision
  • lastmodified
  • page sections (but, would need to write some code to match section semantics)
  • list of images
  • isDisambiguation
GWicke added a project: RESTBase-API.
GWicke edited subscribers, added: mobrovac, ssastry, cscott, Arlolra; removed: Aklapper.
bearND triaged this task as Normal priority.Sep 1 2015, 9:55 PM
bearND moved this task from Incoming to Backlog on the Mobile-Content-Service board.
bearND updated the task description. (Show Details)Oct 5 2015, 11:30 PM

@cscott and @GWicke Any hints on the best way to split the Parsoid output into sections? Should I just look for the <h2>, <h3>, <h4>, <h5>, <h6> tags?

bearND claimed this task.Oct 5 2015, 11:52 PM
bearND moved this task from Backlog to Doing on the Mobile-Content-Service board.

@bearND, I'm not sure about the parsoid team's timeline for this, but this should become easier with T114072: <section> tags for MediaWiki sections. I hope that they can prioritize this, as it will help several use cases including yours.

cscott added a comment.Oct 6 2015, 4:15 PM

It's possible we could implement T114072 as a optional feature in parsoid you could request with a special API option? it's trivial to do the necessary DOM scoping in Parsoid. That might be the best way to unblock @bearND, as well as to gain some experience with the feature before we turn it on in PHP land.

@cscott, it would be a lot nicer if this was on by default, as we'd avoid doubling our storage and parse job requirements. It'll require proper testing to establish the impact, but maybe there is a way to introduce it gradually, with safe cases (only top-level sections ?) being handled first?

Change 243954 had a related patch set uploaded (by Cscott):
Add optional API feature to emit <section> tags

https://gerrit.wikimedia.org/r/243954

One other aspect to consider here is that the Mobile Content Service will need to request the Parsoid DOM from RESTBase. Alas, the service itself has no public DNS records, which means that virtually all of the requests will come precisely through RESTBase. Hence, I think it'd be worth adding POST methods to the routes which would allow RESTBase to send the Parsoid DOM directly to the Mobile Content Service, in this way eliminating potential request loops between the service and RESTBase. The handler logic can be shared between GET /mobile-html-sections and POST /mobile-html-sections with the difference that the GET handler would still need to request the content from RESTBase, while the POST one would jump right into manipulation. The same can be applied to all of the other routes using the Parsoid DOM as well.

Change 246100 had a related patch set uploaded (by BearND):
Use Parsoid for new mobile-html-section routes

https://gerrit.wikimedia.org/r/246100

bearND moved this task from Doing to Code Review on the Mobile-App-Android-Sprint-69-Thulium board.

Change 246100 merged by Mobrovac:
Use Parsoid for new mobile-html-section routes

https://gerrit.wikimedia.org/r/246100

bearND closed this task as Resolved.Nov 6 2015, 4:57 AM
bearND moved this task from Code Review to To Deploy on the Mobile-Content-Service board.