Page MenuHomePhabricator

IPTCTest::testIPTCParseForcedUTFButInvalid failure on PHP with buggy glibc (iconv //IGNORE broken)
Closed, ResolvedPublic

Description

iconv-test.c

Our test IPTCTest::testIPTCParseForcedUTFButInvalid verifies that when feeding image metadata marked as UTF-8 but with non-UTF-8 bytes, the bad bytes will be dropped and the sane UTF-8 kept.

This was the behavior of iconv() in php < 5.4 as can be tested with
var_dump( iconv("UTF-8", "UTF-8//IGNORE", "\xC3\xC3\xC3\xB8") );

The behavior of iconv(3) (with IGNORE) is to provide the good bytes *and* report the error. That can be tested with the attached program.

The fact that when not using IGNORE, the were returned was reported as a bug in https://bugs.php.net/52211 and fixed in e3fdf3 by always returning an empty string.

So our parsing of IPTC data is now different (wrong?) on PHP 5.4

We can:

  • Set the empty string as the correct output (remove/change the test)
  • Verify UTF-8 correctness ourselves (using UtfNormal::cleanUp() seems the appropiate one, we could then remove utf-8 replacement char if a slient skip is really desired).
  • Request php iconv() behavior to change back / add a new flag.

Version: 1.20.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=73178
https://bugzilla.wikimedia.org/show_bug.cgi?id=67908
https://sourceware.org/bugzilla/show_bug.cgi?id=13541
https://bugs.php.net/bug.php?id=48147

Attached:

Details

Reference
bz37665

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:28 AM
bzimport set Reference to bz37665.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 67908 has been marked as a duplicate of this bug. ***

It looks to me like the real problem is described in https://bugs.php.net/bug.php?id=48147 and the upstream-upstream bug at https://sourceware.org/bugzilla/show_bug.cgi?id=13541. Apparently glibc's iconv implementation deviates from the documented API of libiconv. Unfortunately the fix that was suggested to PHP to work around the glibc bug has not been implemented.

Change 172101 had a related patch set uploaded by BryanDavis:
Avoid glibc iconv bug by using mb_convert_encoding

https://gerrit.wikimedia.org/r/172101

  • Bug 73178 has been marked as a duplicate of this bug. ***
bd808 changed the task status from Open to Stalled.Dec 20 2014, 11:13 PM
bd808 added a subscriber: PleaseStand.

My patch that attempts to work around the iconv bug is stalled due to potential issues with $wgLegacyEncoding and $fallback8bitEncoding as described by @PleaseStand in https://gerrit.wikimedia.org/r/#/c/172101/3/languages/Language.php,unified. I'm not a language guru so I either need someone to tell me what to do to work around these issues or take over the patch.

Unresolved issues raised by @PleaseStand in gerrit:

On the Wikimedia cluster, $wgLegacyEncoding (used in Revision::decompressRevisionText()) is set to 'windows-1252' for a handful of wikis, and false for the rest. The underlying libmbfl library does support that encoding, though there are some bugs in error handling (e.g. "\x81" becomes U+FFFE?)

However, some of the files in languages/messages specify a "$fallback8bitEncoding" (used in WebRequest when filtering input). Some of those encodings are supported, possibly under a slightly different name (e.g. "iso-8859-2" instead of "iso8859-2"). Others (e.g. windows-1255) are not.

MessagesAr.php:$fallback8bitEncoding = 'windows-1256';
MessagesBg.php:$fallback8bitEncoding = 'windows-1251';
MessagesBs.php:$fallback8bitEncoding = "iso-8859-2";
MessagesCkb.php:$fallback8bitEncoding = 'windows-1256';
MessagesCrh_cyrl.php:$fallback8bitEncoding = 'windows-1251';
MessagesCrh_latn.php:$fallback8bitEncoding = 'windows-1254';
MessagesCs.php:$fallback8bitEncoding = 'cp1250';
MessagesEl.php:$fallback8bitEncoding = 'iso-8859-7';
MessagesEn.php:$fallback8bitEncoding = 'windows-1252';
MessagesFa.php:$fallback8bitEncoding = 'windows-1256';
MessagesHe.php:$fallback8bitEncoding = 'windows-1255';
MessagesHr.php:$fallback8bitEncoding = 'iso-8859-2';
MessagesHu.php:$fallback8bitEncoding = "iso8859-2";
MessagesHy.php:$fallback8bitEncoding = 'UTF-8';
MessagesKaa.php:$fallback8bitEncoding = 'windows-1254';
MessagesKk_arab.php:$fallback8bitEncoding = 'windows-1256';
MessagesKk_cyrl.php:$fallback8bitEncoding = 'windows-1251';
MessagesKk_latn.php:$fallback8bitEncoding = 'windows-1254';
MessagesLbe.php:$fallback8bitEncoding = 'windows-1251';
MessagesLt.php:$fallback8bitEncoding = 'windows-1257';
MessagesMzn.php:$fallback8bitEncoding = 'windows-1256';
MessagesOs.php:$fallback8bitEncoding = 'windows-1251';
MessagesPl.php:$fallback8bitEncoding = 'iso-8859-2';
MessagesPnb.php:$fallback8bitEncoding = 'windows-1256';
MessagesRo.php:$fallback8bitEncoding = 'iso8859-2';
MessagesRu.php:$fallback8bitEncoding = 'windows-1251';
MessagesSd.php:$fallback8bitEncoding = 'windows-1256';
MessagesSl.php:$fallback8bitEncoding = "iso-8859-2";
MessagesTt_latn.php:$fallback8bitEncoding = "windows-1254";
MessagesTyv.php:$fallback8bitEncoding = "windows-1251";
MessagesUdm.php:$fallback8bitEncoding = 'windows-1251';
MessagesUk.php:$fallback8bitEncoding = 'windows-1251';
MessagesUr.php:$fallback8bitEncoding = 'windows-1256';
MessagesUz.php:$fallback8bitEncoding = 'windows-1252';
MessagesXal.php:$fallback8bitEncoding = "windows-1251";
MessagesZh_hans.php:$fallback8bitEncoding = 'windows-936';
MessagesZh_hant.php:$fallback8bitEncoding = 'windows-950';
MessagesZh_hk.php:$fallback8bitEncoding = 'Big5-HKSCS';

The test passed with 3.3.1+dfsg1-1+wm3.1 but failed with 3.6.1+dfsg1-1+wm2

HHVM has an ini setting that works around this problem: hhvm.hack.lang.iconv_ignore_correct=true

The name of the setting makes it seem like it is for Hack mode only but in reality it applies in normal PHP mode as well. This setting has been changed for MediaWiki-Vagrant and WMF's production HHVM configs.

The fix @Smalyshev provided to PHP 5.5+ and the discovery of the HHVM hhvm.hack.lang.iconv_ignore_correct=true setting make this bug something that can be worked around in the PHP interpreter itself either by upgrading or via configuration.

I'm inclined to close this as resolved. Any objections?

JanZerebecki claimed this task.

But what about PHP 5.4 ? I'd propose to assume that malformed UTF-8 results in the empty output.

Noting this is a problem with the 5.5.9 in 14.04 still, see T124574