IGNORE broken)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Platonides
	Jun 17 2012, 3:35 PM

Description

iconv-test.c

Our test IPTCTest::testIPTCParseForcedUTFButInvalid verifies that when feeding image metadata marked as UTF-8 but with non-UTF-8 bytes, the bad bytes will be dropped and the sane UTF-8 kept.

This was the behavior of iconv() in php < 5.4 as can be tested with
var_dump( iconv("UTF-8", "UTF-8//IGNORE", "\xC3\xC3\xC3\xB8") );

The behavior of iconv(3) (with IGNORE) is to provide the good bytes *and* report the error. That can be tested with the attached program.

The fact that when not using IGNORE, the were returned was reported as a bug in https://bugs.php.net/52211 and fixed in e3fdf3 by always returning an empty string.

So our parsing of IPTC data is now different (wrong?) on PHP 5.4

We can:

Set the empty string as the correct output (remove/change the test)
Verify UTF-8 correctness ourselves (using UtfNormal::cleanUp() seems the appropiate one, we could then remove utf-8 replacement char if a slient skip is really desired).
Request php iconv() behavior to change back / add a new flag.

Version: 1.20.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=73178
https://bugzilla.wikimedia.org/show_bug.cgi?id=67908
https://sourceware.org/bugzilla/show_bug.cgi?id=13541
https://bugs.php.net/bug.php?id=48147

Attached:

iconvt.c754 BDownload

Details

Reference: bz37665

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T75176 Make PHPUnit tests pass on Travis CI
Resolved	Krinkle	T75175 Make PHPUnit tests pass with hhvm/MySQL on Travis CI
Resolved	JanZerebecki	T39665 IPTCTest::testIPTCParseForcedUTFButInvalid failure on PHP with buggy glibc (iconv //IGNORE broken)

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:28 AM

• bzimport added a project: MediaWiki-Core-Tests.

• bzimport set Reference to bz37665.

• bzimport added a subscriber: Unknown Object (MLST).

Platonides created this task.Jun 17 2012, 3:35 PM

Bug 67908 has been marked as a duplicate of this bug. ***

It looks to me like the real problem is described in https://bugs.php.net/bug.php?id=48147 and the upstream-upstream bug at https://sourceware.org/bugzilla/show_bug.cgi?id=13541. Apparently glibc's iconv implementation deviates from the documented API of libiconv. Unfortunately the fix that was suggested to PHP to work around the glibc bug has not been implemented.

Change 172101 had a related patch set uploaded by BryanDavis:
Avoid glibc iconv bug by using mb_convert_encoding

https://gerrit.wikimedia.org/r/172101

Bug 73178 has been marked as a duplicate of this bug. ***

duplicatebug unsubscribed.Dec 13 2014, 11:51 AM

My patch that attempts to work around the iconv bug is stalled due to potential issues with $wgLegacyEncoding and $fallback8bitEncoding as described by @PleaseStand in https://gerrit.wikimedia.org/r/#/c/172101/3/languages/Language.php,unified. I'm not a language guru so I either need someone to tell me what to do to work around these issues or take over the patch.

Krinkle unsubscribed.Dec 21 2014, 4:38 AM

Unresolved issues raised by @PleaseStand in gerrit:

On the Wikimedia cluster, $wgLegacyEncoding (used in Revision::decompressRevisionText()) is set to 'windows-1252' for a handful of wikis, and false for the rest. The underlying libmbfl library does support that encoding, though there are some bugs in error handling (e.g. "\x81" becomes U+FFFE?)

However, some of the files in languages/messages specify a "$fallback8bitEncoding" (used in WebRequest when filtering input). Some of those encodings are supported, possibly under a slightly different name (e.g. "iso-8859-2" instead of "iso8859-2"). Others (e.g. windows-1255) are not.

MessagesAr.php:$fallback8bitEncoding = 'windows-1256';
MessagesBg.php:$fallback8bitEncoding = 'windows-1251';
MessagesBs.php:$fallback8bitEncoding = "iso-8859-2";
MessagesCkb.php:$fallback8bitEncoding = 'windows-1256';
MessagesCrh_cyrl.php:$fallback8bitEncoding = 'windows-1251';
MessagesCrh_latn.php:$fallback8bitEncoding = 'windows-1254';
MessagesCs.php:$fallback8bitEncoding = 'cp1250';
MessagesEl.php:$fallback8bitEncoding = 'iso-8859-7';
MessagesEn.php:$fallback8bitEncoding = 'windows-1252';
MessagesFa.php:$fallback8bitEncoding = 'windows-1256';
MessagesHe.php:$fallback8bitEncoding = 'windows-1255';
MessagesHr.php:$fallback8bitEncoding = 'iso-8859-2';
MessagesHu.php:$fallback8bitEncoding = "iso8859-2";
MessagesHy.php:$fallback8bitEncoding = 'UTF-8';
MessagesKaa.php:$fallback8bitEncoding = 'windows-1254';
MessagesKk_arab.php:$fallback8bitEncoding = 'windows-1256';
MessagesKk_cyrl.php:$fallback8bitEncoding = 'windows-1251';
MessagesKk_latn.php:$fallback8bitEncoding = 'windows-1254';
MessagesLbe.php:$fallback8bitEncoding = 'windows-1251';
MessagesLt.php:$fallback8bitEncoding = 'windows-1257';
MessagesMzn.php:$fallback8bitEncoding = 'windows-1256';
MessagesOs.php:$fallback8bitEncoding = 'windows-1251';
MessagesPl.php:$fallback8bitEncoding = 'iso-8859-2';
MessagesPnb.php:$fallback8bitEncoding = 'windows-1256';
MessagesRo.php:$fallback8bitEncoding = 'iso8859-2';
MessagesRu.php:$fallback8bitEncoding = 'windows-1251';
MessagesSd.php:$fallback8bitEncoding = 'windows-1256';
MessagesSl.php:$fallback8bitEncoding = "iso-8859-2";
MessagesTt_latn.php:$fallback8bitEncoding = "windows-1254";
MessagesTyv.php:$fallback8bitEncoding = "windows-1251";
MessagesUdm.php:$fallback8bitEncoding = 'windows-1251';
MessagesUk.php:$fallback8bitEncoding = 'windows-1251';
MessagesUr.php:$fallback8bitEncoding = 'windows-1256';
MessagesUz.php:$fallback8bitEncoding = 'windows-1252';
MessagesXal.php:$fallback8bitEncoding = "windows-1251";
MessagesZh_hans.php:$fallback8bitEncoding = 'windows-936';
MessagesZh_hant.php:$fallback8bitEncoding = 'windows-950';
MessagesZh_hk.php:$fallback8bitEncoding = 'Big5-HKSCS';

I posted a call for help to wikitech-l.

hashar unsubscribed.May 6 2015, 9:16 AM

Smalyshev subscribed.May 8 2015, 7:22 AM

Nikerabbit subscribed.May 8 2015, 8:47 PM

I've committed a fix for https://bugs.php.net/bug.php?id=48147 (see https://github.com/php/php-src/commit/473ec539a1c3d242c8b171dd6a5a98fa17e05c13). It's only 5.5+ though.

T98882 shows this error popping up in the latest internal WMF builds of HHVM which are based on the 3.6.1 upstream version.

The test passed with 3.3.1+dfsg1-1+wm3.1 but failed with 3.6.1+dfsg1-1+wm2

MarkAHershberger unsubscribed.May 12 2015, 4:16 PM

HHVM has an ini setting that works around this problem: hhvm.hack.lang.iconv_ignore_correct=true

The name of the setting makes it seem like it is for Hack mode only but in reality it applies in normal PHP mode as well. This setting has been changed for MediaWiki-Vagrant and WMF's production HHVM configs.

The fix @Smalyshev provided to PHP 5.5+ and the discovery of the HHVM hhvm.hack.lang.iconv_ignore_correct=true setting make this bug something that can be worked around in the PHP interpreter itself either by upgrading or via configuration.

I'm inclined to close this as resolved. Any objections?

JanZerebecki closed this task as Resolved.Jul 10 2015, 11:28 PM

JanZerebecki claimed this task.

Liuxinyu970226 unsubscribed.Jul 11 2015, 1:00 AM

Bugreporter mentioned this in T108560: Gerritbot doesn't work while abandoning gerrit changes which mention bugzilla ids.Aug 10 2015, 7:09 AM

PleaseStand merged a task: T116705: IPTCTest::testIPTCParseForcedUTFButInvalid failure in iptcparse().Oct 27 2015, 4:10 AM

PleaseStand added subscribers: saper, Aklapper.

But what about PHP 5.4 ? I'd propose to assume that malformed UTF-8 results in the empty output.

Legoktm mentioned this in T124574: IPTCTest::testIPTCParseForcedUTFButInvalid fails on trusty/PHP5.5 .Jan 24 2016, 4:59 PM

Noting this is a problem with the 5.5.9 in 14.04 still, see T124574

bd808 mentioned this in T125477: refreshCdbJsonFiles in scap fails on mira due to missing dba_open function in hhvm.Feb 2 2016, 11:20 PM