Evaluate difficulty of porting PCS summary logic to PHP
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tgr
	Jan 17 2019, 2:38 AM

Description

One of the options for T213505: RfC: OpenGraph descriptions in wiki pages is to port the existing description parsing logic in the Page Content Service summary endpoint to PHP and do it in MediaWiki during parsing or page rendering; this task is about estimating the difficulty of that.

Related Objects
Search...

Status	Subtype	Assigned	Task
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Open		None	T318449 [Tracking] Improve the sharing of MediaWiki content
Open		None	T56829 [Goal] Implement support for sharing protocols
Open		None	T131932 Video and audio files should expose 'player cards' to Twitter for embedded playback
Open	Feature	None	T63487 Sharing a Wikimedia Commons file description on Twitter should use a Twitter card
Stalled		ovasileva	T142090 Add hover-card like summary (og:description) to open graph meta data printing plain summary and headline property in the SameAs schema
Open		None	T220182 Include the extracted intro of a page using parser function
Open		None	T213505 RfC: OpenGraph descriptions in wiki pages
Open		FBellamy-WMF	T361936 [EPIC] Port page summary endpoint to MW
Resolved		FBellamy-WMF	T214000 Evaluate difficulty of porting PCS summary logic to PHP

Event Timeline

Tgr created this task.Jan 17 2019, 2:38 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2019, 2:38 AM

Tgr added a parent task: T213505: RfC: OpenGraph descriptions in wiki pages.Jan 17 2019, 2:38 AM

Tgr moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

The process of creating the summary looks like this:

get Parsoid HTML
split via <section>, get lead section (parsoidSections.createDocumentFromLeadSection)
remove dewiki IPA markup (stripGermanIPA.js)
remove elements matching certain CSS selectors (rmElements.js, conf)
remove some useless spans, e.g. replace <span>[</span>1<span>]</span> with [1] (rmBracketSpans.js)
remove comment nodes (rmComments.js)
remove some common HTML attributes (rmAttributes.js, conf)
remove the mw* ids generated by Parsoid (rmMwIdAttributes.js)
fetch the contents of the first non-empty <p> block + any trailing content until the next <p> (extractLeadIntroduction.js)
turn links into spans, drop spans which do not do any styling (summarize.js#37-38)
remove elements matching certain CSS selectors (summarize.js#39-44)
remove Parsoid-specific HTML attributes (summarize.js#45)
if the lead does not seem to contain math formulas, remove parentheses that are inside parentheses or have space inside and space/nbsp right before them, or are / have become empty (summarize.js#46-67 and 17-27)
collapse whitespace (summarize.js#69-73, summarize.js#78-79)
call sanitize-html, which will discard non-text tags like <script>, and then remove a bunch of non-whitelisted tags while keeping their contents, and remove a bunch of non-whitelisted attributes/styes from allowed tags (sanitizeSummary.js)
remove space/nbsp before punctuation in some cases (summarize.js#81-88)
do a standard DOM HTML-to-text transformation on the results (summarize.js#93)

This is pretty complex, but most of it is done for getting a safe and readable HTML extract. og:description is text-only (and probably whitespace-insensitive) so only 1-4, 9, 11, 13, 17, and a small part of 15 applies. Almost all of that is simple, more configuration than code (remove elements matching certain CSS selectors, drill down to some part of the document that could be expressed with an xpath, do the DOM spec's HTML-to-text), with the exception of steps 3 and 13, but 3 becomes very simple when you don't care about HTML (just replace IPA blocks with some special unicode character, then after the rest of the transforms are over drop the replacement + preceding/following brackets), so 13 is the only complicated logic that would have to be duplicated, and that's just a series of regular expressions. So altogether a few dozen lines of logic and another few dozen lines of configuration, assuming there's support for manipulating a HTML document via CSS selectors and standard DOM commands. The code would have to be maintained in two places (since the generation of HTML descriptions would remain in PCS, and some of these steps are used by other endpoints as well) but even so it does not seem like a huge burden.

A precondition for doing this is that the code needs to be fast enough to run during a parse or render (if it has to be done asynchronously, we can just use PCS as it is now). Current PCS latency is 200ms p50, 250ms p75, 1s p95, 50s p99. That's obviously unworkable but given that we'd omit most of the processing that's currently done, processing time might become significantly shorter. Also, much of the remaining processing could probably be optimized (e.g. the element removal does a separate DOM pass for every element).

Another question is whether this needs to use Parsoid HTML (in which case it would be blocked on the porting of Parsoid to MediaWiki). At a glance, not so much (although there a few Parsoid-specific selectors like [data-mw*="target":{"wt":"IPA"] but that can be worked around by adding better markup in the templates) but there might be differences between the HTML structure generated by MediaWiki and Parsoid that would result in breaking the current logic; that's hard to tell without doing and testing it.

PCS has a good amount of unit tests for this. Might be worth porting those as well. There is a also a growing list of page titles that are good candidates for testing summaries.

Tgr mentioned this in T213505: RfC: OpenGraph descriptions in wiki pages.Jan 22 2019, 9:32 AM

Deprioritizing, I think it makes sense to run the RfC first and see what options are acceptable in the first place.

• Mholloway subscribed.Feb 6 2019, 4:36 PM

• Jhernandez triaged this task as Low priority.Feb 6 2019, 4:37 PM

• Jhernandez moved this task from Needs triage to Backlog on the Product-Infrastructure-Team-Backlog-Deprecated board.

Anomie mentioned this in T245674: Reader gets page description in search results.Feb 24 2020, 6:47 PM

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

daniel mentioned this in T319365: PCS caching and pregeneration when restbase is decommissioned.Oct 6 2022, 10:25 AM

Note also that doing proper language converter redirects (T240068 is one example) and handling flagged revisions (T209936) become a lot easier if this code is moved to core (in php). It will also avoid a round trip, since both the parsoid and legacy html are cached in core.

(There's another LanguageConverter related bug somewhere, which i can't find right now, where a request to the summary endpoint for title [[A]] from mobile doesn't return any results because the title actually exists in the DB as [[B]], where B is the language-converted form of A.)

dr0ptp4kt subscribed.Oct 27 2022, 8:10 PM

In T214000#8290521, @cscott wrote:

(There's another LanguageConverter related bug somewhere, which i can't find right now, where a request to the summary endpoint for title [[A]] from mobile doesn't return any results because the title actually exists in the DB as [[B]], where B is the language-converted form of A.)

T277059: [Bug] Blue links are broken on the Serbian (Latin) Burek article on iOS and Android

Krinkle added a project: Content-Transform-Team.Apr 30 2023, 11:37 AM

MSantos edited projects, added Page Content Service; removed Content-Transform-Team, Product-Infrastructure-Team-Backlog-Deprecated.May 4 2023, 2:07 PM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptMay 4 2023, 2:07 PM

Recent discussions with other teams interested in the summary endpoint have progressed towards extracting the logic to a JS library instead, so it can be used by other teams to create their own interfaces with the summary extraction logic.

I'll add this to the icebox to revisit in the future since we might want to close this as invalid depending on the results of the aforementioned effort.

dr0ptp4kt unsubscribed.Jul 28 2023, 3:49 PM

MSantos edited projects, added Content-Transform-Team; removed Product-Infrastructure-Team-Backlog-Deprecated.Mar 26 2024, 11:11 AM

cscott edited projects, added Content-Transform-Team-WIP; removed Content-Transform-Team.Mar 28 2024, 2:18 PM

FBellamy-WMF claimed this task.Mar 28 2024, 3:32 PM

FBellamy-WMF moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.Mar 28 2024, 3:34 PM

Jgiannelos added a parent task: T361936: [EPIC] Port page summary endpoint to MW.Apr 5 2024, 1:08 PM

MSantos added a project: Essential-Work.Apr 15 2024, 1:51 PM

@Jgiannelos and I have evaluated the difficulty of porting PCS summary logic to PHP and included our findings in this doc.

FBellamy-WMF closed this task as Resolved.Apr 18 2024, 6:01 PM