Page MenuHomePhabricator

Use Parsoid for new mobile-html-section routes
Closed, ResolvedPublic

Description

Then we don't need to change mobileview API (T106143). We would also get better structured HTML so we can be more deliberate in our DOM transformations.
The first problem here is to split the big HTML content into sections.
Another problem is to see if we can get all the metadata we get from mobileview. It would be good to not have to also call mobileview, too.

See https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Cat and https://en.wikipedia.org/api/rest_v1/page/mobile-html-sections-lead/Main_page as a examples.

Some examples:

  • displaytitle
  • (Wikidata) description
  • page id
  • revision
  • lastmodified
  • protection
  • languagecount
  • information about the lead image
  • page sections
  • pageimages

booleans:

  • editable
  • isMainPage
  • (isDisambiguation -- would be nice but doesn't even seem to work with MW API)

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedJdlrobson
ResolvedDbrant
Resolved bearND
Resolved bearND
Resolvedtstarling
Resolvedtstarling
ResolvedArlolra
ResolvedEsanders
ResolvedCatrope
DeclinedNone
DeclinedNone
Resolvedssastry
Resolved bearND
Resolved bearND
Resolved bearND
Resolved bearND
OpenNone
ResolvedJhernandez
OpenNone
Resolvedmarcoil
ResolvedCatrope
Resolvedmarcoil
ResolvedArlolra
ResolvedArlolra
Resolved GWicke
Resolved GWicke
Resolved GWicke
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedCmjohnson
ResolvedCmjohnson
ResolvedJoe
Resolvedfgiunchedi
Resolved GWicke
Resolved Jdouglas
Resolved GWicke
Resolved GWicke
ResolvedArlolra
Resolved GWicke
Resolvedmobrovac
Resolvedmobrovac
Resolvedmobrovac
Resolvedmobrovac
Duplicate Jdouglas
ResolvedAndrew
Resolved GWicke
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedEevans
Resolvedfgiunchedi
Resolved GWicke
Resolved GWicke
Resolvedfgiunchedi
Resolvedmobrovac
Resolved GWicke
Resolved GWicke
Resolved AlexMonk-WMF
Resolvedsanthosh
Resolvedssastry
ResolvedMholloway
ResolvedJackmcbarn
ResolvedRenxiaoyi
Resolvedcscott
ResolvedKelson
OpenNone
OpenNone
OpenNone
ResolvedArlolra
ResolvedArlolra
OpenNone
DeclinedNone
StalledRenxiaoyi
OpenNone
DeclinedNone
DeclinedNone
DeclinedNone
OpenNone
OpenNone
InvalidNone
InvalidNone
DuplicateNone
DuplicateNone
ResolvedJhernandez
ResolvedJdlrobson
DuplicatePeter
Resolvedbmansurov
DeclinedNone
DuplicateNone
Resolvednray
Resolvedphuedx
ResolvedAnomie
ResolvedAnomie
ResolvedAnomie
ResolvedEBernhardson
ResolvedAnomie
ResolvedAnomie
OpenNone
DuplicateNone
ResolvedNone
Resolvedphuedx
DeclinedNone
ResolvedPchelolo
ResolvedArlolra

Event Timeline

bearND raised the priority of this task from to Needs Triage.
bearND updated the task description. (Show Details)
bearND added a subscriber: bearND.

I think it's worth separating

a) HTML and directly related metadata like sections, and
b) other data like Wikidata descriptions.

You can switch a) to use Parsoid information, while continuing to retrieve b) from action=mobileview. This would give you the benefits of a stable DOM spec, and should also reduce the overall request latency by making the action API request slightly cheaper, and loading the Parsoid HTML from storage in parallel.

I think the following bits of information are already available in the Parsoid DOM:

  • displaytitle
  • revision
  • lastmodified
  • page sections (but, would need to write some code to match section semantics)
  • list of images
  • isDisambiguation
bearND triaged this task as Medium priority.Sep 1 2015, 9:55 PM
bearND moved this task from Incoming to Backlog on the Mobile-Content-Service board.

@cscott and @GWicke Any hints on the best way to split the Parsoid output into sections? Should I just look for the <h2>, <h3>, <h4>, <h5>, <h6> tags?

@bearND, I'm not sure about the parsoid team's timeline for this, but this should become easier with T114072: <section> tags for MediaWiki sections. I hope that they can prioritize this, as it will help several use cases including yours.

It's possible we could implement T114072 as a optional feature in parsoid you could request with a special API option? it's trivial to do the necessary DOM scoping in Parsoid. That might be the best way to unblock @bearND, as well as to gain some experience with the feature before we turn it on in PHP land.

@cscott, it would be a lot nicer if this was on by default, as we'd avoid doubling our storage and parse job requirements. It'll require proper testing to establish the impact, but maybe there is a way to introduce it gradually, with safe cases (only top-level sections ?) being handled first?

Change 243954 had a related patch set uploaded (by Cscott):
Add optional API feature to emit <section> tags

https://gerrit.wikimedia.org/r/243954

One other aspect to consider here is that the Mobile Content Service will need to request the Parsoid DOM from RESTBase. Alas, the service itself has no public DNS records, which means that virtually all of the requests will come precisely through RESTBase. Hence, I think it'd be worth adding POST methods to the routes which would allow RESTBase to send the Parsoid DOM directly to the Mobile Content Service, in this way eliminating potential request loops between the service and RESTBase. The handler logic can be shared between GET /mobile-html-sections and POST /mobile-html-sections with the difference that the GET handler would still need to request the content from RESTBase, while the POST one would jump right into manipulation. The same can be applied to all of the other routes using the Parsoid DOM as well.

Change 246100 had a related patch set uploaded (by BearND):
Use Parsoid for new mobile-html-section routes

https://gerrit.wikimedia.org/r/246100

Change 246100 merged by Mobrovac:
Use Parsoid for new mobile-html-section routes

https://gerrit.wikimedia.org/r/246100