Page MenuHomePhabricator

php8.2 deprecate use of 'HTML-ENTITIES' on mb_convert_encoding
Closed, ResolvedPublic

Description

- Mbstring:
  . Use of QPrint, Base64, Uuencode, and HTML-ENTITIES 'text encodings' is
    deprecated for all Mbstring functions. Unlike all the other text
    encodings supported by Mbstring, these do not encode a sequence of
    Unicode codepoints, but rather a sequence of raw bytes. It is not
    clear what the correct return values for most Mbstring functions should
    be when one of these non-encodings is specified. Further, PHP has
    separate, built-in implementations of all of them; for example, UUencoded
    data can be handled using convert_uuencode/convert_uudecode.

From https://github.com/php/php-src/commit/9308974f8cc6c1046f228be5320fe067913ba987

Codesearch (best-effort): https://codesearch.wmcloud.org/search/?q=mb_%5Cw%2B%5Cs*%5C(%5B%5E%22%27%5D*%5B%22%27%5D(BASE64%7CHTML-ENTITIES%7CHTML%7CQUOTED-PRINTABLE%7CQPRINT%7CUUENCODE)%5B%22%27%5D&i=fosho&files=%5C.php&excludeFiles=&repos=

Event Timeline

For the Parsoid project it looks like it's the Zest library which uses HTML-ENTITIES. I can take a look and release a PHP 8.2-compatible version.

ssastry triaged this task as Medium priority.Dec 1 2022, 3:13 PM

This is a common workaround for the fact that the default for a DOM document is Latin-1. We have HTML that is utf-8 encoded, then escape the utf-8 to entities, so that it becomes valid latin-1 and then feed the DOMDocument.

This is required in these cases, likely because we are not feeding our DOMDocument proper HTML5 files, and possibly in older libxml versions, the loadHTML/libxml was unable to read the meta declaration from the html document ?

The proper way to fix this issue is

  • feeding it an ACTUAL utf-8 HTML document (instead of html fragments)
  • prepending the '<?xml encoding="UTF-8"> to force the parser to utf-8
  • prepending <meta charset="utf-8"/> or <meta http-equiv="content-type" content="text/html; charset=utf-8"> to the html (these last things might be what didn't work in the past in libxml ???)

Change 879625 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[HtmlFormatter@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879625

Change 879632 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/libs/Zest@master] tests: Use utf-8 html document for testing

https://gerrit.wikimedia.org/r/879632

Change 879639 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/CommonsMetadata@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879639

Change 879641 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/RandomImage@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879641

Change 879642 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/skins/Refreshed@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879642

Change 879642 merged by jenkins-bot:

[mediawiki/skins/Refreshed@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879642

Change 879625 merged by jenkins-bot:

[HtmlFormatter@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879625

Change 879639 merged by jenkins-bot:

[mediawiki/extensions/CommonsMetadata@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879639

Change 879641 merged by jenkins-bot:

[mediawiki/extensions/RandomImage@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/879641

We need a new release of wikimedia/html-formatter on Packagist so that we can update MediaWiki core's composer.json from 3.0.1 to whatever the new version will be. Looking at the README and the mediawiki.org docs (https://www.mediawiki.org/wiki/HtmlFormatter), I don't see notes about how to do this or who usually does it. Does anyone know?

We need a new release of wikimedia/html-formatter on Packagist so that we can update MediaWiki core's composer.json from 3.0.1 to whatever the new version will be. Looking at the README and the mediawiki.org docs (https://www.mediawiki.org/wiki/HtmlFormatter), I don't see notes about how to do this or who usually does it. Does anyone know?

Have created T330528 as request for a new release

Change 879632 merged by jenkins-bot:

[mediawiki/libs/Zest@master] tests: Use utf-8 html document for testing

https://gerrit.wikimedia.org/r/879632

Change 892414 had a related patch set uploaded (by Jack Phoenix; author: Umherirrender):

[mediawiki/skins/Refreshed@REL1_35] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/892414

Change 892414 merged by jenkins-bot:

[mediawiki/skins/Refreshed@REL1_35] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/892414

The bug is still present in 1.39.4, it appears only on non-big cell phones, I see it with Pixel 5 simulator in chrome developer tools.

Change 951493 had a related patch set uploaded (by Reedy; author: Reedy):

[HtmlFormatter@master] HtmlFormatter: Add #[\ReturnTypeWillChange] to ease migration

https://gerrit.wikimedia.org/r/951493

The bug is still present in 1.39.4, it appears only on non-big cell phones, I see it with Pixel 5 simulator in chrome developer tools.

FWIW, no one ever stated it was fixed ;)

Change 951493 merged by jenkins-bot:

[HtmlFormatter@master] HtmlFormatter: Add #[\ReturnTypeWillChange] to ease migration

https://gerrit.wikimedia.org/r/951493

Change 951494 had a related patch set uploaded (by Reedy; author: Reedy):

[HtmlFormatter@master] HtmlFormatter: Add #[\ReturnTypeWillChange] to correct function to ease migration

https://gerrit.wikimedia.org/r/951494

Change 951494 merged by jenkins-bot:

[HtmlFormatter@master] HtmlFormatter: Add #[\ReturnTypeWillChange] to more functions to ease migration

https://gerrit.wikimedia.org/r/951494

Guess I need to still do some backports... yay

Change 958422 had a related patch set uploaded (by Reedy; author: Umherirrender):

[mediawiki/extensions/RandomImage@REL1_39] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958422

Change 958423 had a related patch set uploaded (by Reedy; author: Umherirrender):

[mediawiki/extensions/CommonsMetadata@REL1_39] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958423

Change 958423 merged by jenkins-bot:

[mediawiki/extensions/CommonsMetadata@REL1_39] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958423

Change 958422 merged by jenkins-bot:

[mediawiki/extensions/RandomImage@REL1_39] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958422

Guess I need to still do some backports... yay

Or not. I've done HtmlFormatter into REL1_39/REL1_40... I think the two above are about all that was missing...

Change 958944 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/extensions/ContentTranslation@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958944

Change 958949 had a related patch set uploaded (by Reedy; author: Reedy):

[mediawiki/extensions/EImage@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958949

Change 958949 merged by jenkins-bot:

[mediawiki/extensions/EImage@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958949

Change 958944 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Parse html as whole document to avoid encoding issues

https://gerrit.wikimedia.org/r/958944

Pginer-WMF claimed this task.
Pginer-WMF moved this task from Needs Triage to Upstream/Other teams on the ContentTranslation board.