Page MenuHomePhabricator

Evaluate difficulty of porting PCS summary logic to PHP
Open, LowPublic

Description

One of the options for T213505: RfC: OpenGraph descriptions in wiki pages is to port the existing description parsing logic in the Page Content Service summary endpoint to PHP and do it in MediaWiki during parsing or page rendering; this task is about estimating the difficulty of that.

Event Timeline

Tgr created this task.Jan 17 2019, 2:38 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2019, 2:38 AM
Tgr added a comment.Jan 17 2019, 3:36 AM

The process of creating the summary looks like this:

  1. get Parsoid HTML
  2. split via <section>, get lead section (parsoidSections.createDocumentFromLeadSection)
  3. remove dewiki IPA markup (stripGermanIPA.js)
  4. remove elements matching certain CSS selectors (rmElements.js, conf)
  5. remove some useless spans, e.g. replace <span>[</span>1<span>]</span> with [1] (rmBracketSpans.js)
  6. remove comment nodes (rmComments.js)
  7. remove some common HTML attributes (rmAttributes.js, conf)
  8. remove the mw* ids generated by Parsoid (rmMwIdAttributes.js)
  9. fetch the contents of the first non-empty <p> block + any trailing content until the next <p> (extractLeadIntroduction.js)
  10. turn links into spans, drop spans which do not do any styling (summarize.js#37-38)
  11. remove elements matching certain CSS selectors (summarize.js#39-44)
  12. remove Parsoid-specific HTML attributes (summarize.js#45)
  13. if the lead does not seem to contain math formulas, remove parentheses that are inside parentheses or have space inside and space/nbsp right before them, or are / have become empty (summarize.js#46-67 and 17-27)
  14. collapse whitespace (summarize.js#69-73, summarize.js#78-79)
  15. call sanitize-html, which will discard non-text tags like <script>, and then remove a bunch of non-whitelisted tags while keeping their contents, and remove a bunch of non-whitelisted attributes/styes from allowed tags (sanitizeSummary.js)
  16. remove space/nbsp before punctuation in some cases (summarize.js#81-88)
  17. do a standard DOM HTML-to-text transformation on the results (summarize.js#93)

This is pretty complex, but most of it is done for getting a safe and readable HTML extract. og:description is text-only (and probably whitespace-insensitive) so only 1-4, 9, 11, 13, 17, and a small part of 15 applies. Almost all of that is simple, more configuration than code (remove elements matching certain CSS selectors, drill down to some part of the document that could be expressed with an xpath, do the DOM spec's HTML-to-text), with the exception of steps 3 and 13, but 3 becomes very simple when you don't care about HTML (just replace IPA blocks with some special unicode character, then after the rest of the transforms are over drop the replacement + preceding/following brackets), so 13 is the only complicated logic that would have to be duplicated, and that's just a series of regular expressions. So altogether a few dozen lines of logic and another few dozen lines of configuration, assuming there's support for manipulating a HTML document via CSS selectors and standard DOM commands. The code would have to be maintained in two places (since the generation of HTML descriptions would remain in PCS, and some of these steps are used by other endpoints as well) but even so it does not seem like a huge burden.

Tgr added a comment.Jan 17 2019, 10:53 PM

A precondition for doing this is that the code needs to be fast enough to run during a parse or render (if it has to be done asynchronously, we can just use PCS as it is now). Current PCS latency is 200ms p50, 250ms p75, 1s p95, 50s p99. That's obviously unworkable but given that we'd omit most of the processing that's currently done, processing time might become significantly shorter. Also, much of the remaining processing could probably be optimized (e.g. the element removal does a separate DOM pass for every element).

Another question is whether this needs to use Parsoid HTML (in which case it would be blocked on the porting of Parsoid to MediaWiki). At a glance, not so much (although there a few Parsoid-specific selectors like [data-mw*="target":{"wt":"IPA"] but that can be worked around by adding better markup in the templates) but there might be differences between the HTML structure generated by MediaWiki and Parsoid that would result in breaking the current logic; that's hard to tell without doing and testing it.

bearND added a subscriber: bearND.Jan 19 2019, 12:22 AM

PCS has a good amount of unit tests for this. Might be worth porting those as well. There is a also a growing list of page titles that are good candidates for testing summaries.

Deprioritizing, I think it makes sense to run the RfC first and see what options are acceptable in the first place.

Jhernandez triaged this task as Low priority.Feb 6 2019, 4:37 PM