Page MenuHomePhabricator

HtmlFormatter library: Code stewardship review
Open, Needs TriagePublic

Description

A succinct problem statement to give context for why the review was initiated.

The HtmlFormatter library is bundled inside mediawiki core and supports functionality in MobileFrontend that allows the mobile site to make some mobile-critical transformations to the HTML code - namely the removal of content that is not mobile friendly (removal of DOM nodes), lazy loading images and moving the infobox below the first lead paragraph (DOM traversal/manipulation).

It also supports TextExtracts which also has a code steward review at T256505

During a recent bug report to the library, it was clear that nobody is maintaining the library. The library is pretty stable but it would benefit from some clear ownership.

Recent chats with @Esanders suggest there will be wider adoption of this library to support the talk page project.

Entry in Developers/Maintainers with:

Unmaintained

Number, severity, and age of known and confirmed security issues

No known

Was it a cause of production outages or incidents? List them.

No known but there are several timeout issues likely due to the poor performance of the library on complex HTML.

Does it have sufficient hardware resources for now and the near future (to take into account expected usage growth)?

Unclear

Is it a frequent cause of monitoring alerts that need action, and are they addressed timely and appropriately?

Given its a library, we should monitor issues on logstash and address them as needed.

When it was first deployed to Wikimedia production

2013

Usage statistics based on audience(s) served

It is used in every mobile page view.

Changes committed in last 1, 3, 6, and 12 months

https://gerrit.wikimedia.org/r/q/project:HtmlFormatter

Reliance on outdated platforms (e.g. operating systems)

Unknown

Number of developers who committed code in the last 1, 3, 6, and 12 months

5 in the last 1 years

https://gerrit.wikimedia.org/r/q/project:HtmlFormatter

Number and age of open patches

0

Number and age of open bugs

4 https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-toa3gtsfxym5gezfknrx&statuses=open()&group=none&order=newest#R
(all 1 month old)

Number of known dependencies?

MobileFrontend, TextExtracts

Is there a replacement/alternative for the feature? Is there a plan for a replacement?

Yes. Switching to Remex per @cscott
T255586

Submitter's recommendation (what do you propose be done?)

This feels like a good fit for the parsing team, especially if we switch to Remex. I suspect their knowledge and expertise would provide a better supported alternative for all teams.

Event Timeline

Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Esanders.

I would like to burn HtmlFormatter in fire and replace what it does with a proper spec-compliant PHP DOM library.

I guess as an interim goal that would be "gradually replace all uses of HtmlFormatter with use of proper DOM code" and/or, to the extent the library survives, it would be to hold code such as Parsoid's DOMCompat class, which exists to propose DOM-spec-compliant implementations of code missing from PHP.

I mentioned this plan briefly in T255586: Replace HTMLFormatter by Remex.

HtmlFormatter uses libxml which is HTML 4 only, causing lots of small bugs (T251434#6334001 is a recent example). Fixing that is T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest. If the functionality of HtmlFormatter is generic enough, reducing overhead by killing the repo and turning it into a Remex or Zest utility class would certainly be nice. (It would have to rely on both, though, and IIRC they are currently independent of each other.)

(T255586: Replace HTMLFormatter by Remex and T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest seem like duplicates.)

It's not just that libxml is HTML4-only -- it's not even a good/bug-free implementation of an HTML4 DOM. See https://bugs.php.net/bug.php?id=78221 for just the most-recent example, where the behavior of a DOM method was silently changed in a bug-fix patch release of PHP (!). (This required a825953abdf30908b40b150e98910ba5f50160f0 in Parsoid.) We were collecting examples in T215000: Fill gaps in PHP DOM's functionality but there were just so many that we lost count.

Recent chats with @Esanders suggest there will be wider adoption of this library to support the talk page project.

We're not currently using it. We tried it, but later replaced it with our own parse/serialize code (using Parsoid) in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/641197.

I would also suggest re-incorporating this file into core. It has no uses outside of MediaWiki extensions, and keeping it in a separate repo just makes updates more complicated.