Problem with non-ASCII characters and long requests
Closed, ResolvedPublic

Description

frwiki says Version = 1.20wmf2 on May 14
A few days ago, it was working correctly.

API seems to have trouble with page titles with non-ASCII characters when the size of the API request is very long.

For example, a query request for templates (prop=templates) returns incorrect results for page titles with non-ASCII characters, when the list of pages (argument "titles=") contains several dozen pages : the API answer returns incorrect page names when there are non-ASCII characters (é instead of é, ...) and incorrect information (it doesn't use the correct page title).

But if the same request is made with a smaller list of pages, the API answer becomes correct again.

Request with incorrect answer

http://fr.wikipedia.org/w/api.php?prop=templates&action=query&titles=Acteur%7CAlbert%20Robbins%7CAnglais%7CAnn%20Donahue%7CAnthony%20E.%20Zuiker%7CCarol%20Mendelsohn%7CCatherine%20Willows%7CDavid%20Hodges%7CDavid%20Phillips%7CGil%20Grissom%7CGreg%20Sanders%7CHodges%7CInternet%20Movie%20Database%7CJim%20Brass%7CLady%20Heather%7CLes%20Experts%20(s%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e)%7CLes%20Experts%20:%20Manhattan%7CLes%20Experts%20:%20Miami%7CListe%20des%20personnages%20des%20Experts%7CListe%20des%20%C3%A9pisodes%20des%20Experts%7CMod%C3%A8le%20discussion:Palette%20Les%20Experts%7CNick%20Stokes%7CPersonnage%20de%20fiction%7CPersonnage%20fictif%7CPersonnage%20de%20fiction%7CPersonnages%20r%C3%A9currents%20dans%20Les%20Experts%7CRaymond%20Langston%7CRiley%20Adams%7CSaison%201%20des%20Experts%7CSaison%2010%20des%20Experts%7CSaison%2011%20des%20Experts%7CSaison%2012%20des%20Experts%7CSaison%202%20des%20Experts%7CSaison%203%20des%20Experts%7CSaison%204%20des%20Experts%7CSaison%205%20des%20Experts%7CSaison%206%20des%20Experts%7CSaison%207%20des%20Experts%7CSaison%208%20des%20Experts%7CSaison%209%20des%20Experts%7CSara%20Sidle%7CSofia%20Curtis%7CS%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e%7CWallace%20Langham%7CWarrick%20Brown%7CWendy%20Simms%7C%C3%89tats-Unis

Answer :
<?xml version="1.0"?>
<api>

<query>
  <pages>
    <page ns="0" title="Les Experts (série télévisée)" missing="" />
    <page ns="0" title="Liste des épisodes des Experts" missing="" />
    <page ns="0" title="Modèle discussion:Palette Les Experts" missing="" />
    <page ns="0" title="Personnages récurrents dans Les Experts" missing="" />
    <page ns="0" title="Série télévisée" missing="" />
    <page ns="0" title="Wallace Langham" missing="" />
    <page ns="0" title="États-Unis" missing="" />
    <page pageid="43579" ns="0" title="Acteur" />

Request with correct answer

http://fr.wikipedia.org/w/api.php?prop=templates&action=query&titles=Acteur%7CAlbert%20Robbins%7CAnglais%7CAnn%20Donahue%7CAnthony%20E.%20Zuiker%7CCarol%20Mendelsohn%7CCatherine%20Willows%7CDavid%20Hodges%7CDavid%20Phillips%7CGil%20Grissom%7CGreg%20Sanders%7CHodges%7CInternet%20Movie%20Database%7CJim%20Brass%7CLady%20Heather%7CLes%20Experts%20(s%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e)%7CLes%20Experts%20:%20Manhattan%7CLes%20Experts%20:%20Miami%7CListe%20des%20personnages%20des%20Experts%7CListe%20des%20%C3%A9pisodes%20des%20Experts%7CMod%C3%A8le%20discussion:Palette%20Les%20Experts%7CPersonnages%20r%C3%A9currents%20dans%20Les%20Experts%7CS%C3%A9rie%20t%C3%A9l%C3%A9vis%C3%A9e%7C%C3%89tats-Unis

Answer :
<api>

<query>
  <pages>
    <page ns="0" title="Modèle discussion:Palette Les Experts" missing="" />
    <page pageid="43579" ns="0" title="Acteur" />
    <page pageid="4307505" ns="0" title="Albert Robbins" />
    <page pageid="5539" ns="0" title="Anglais">

Version: 1.20.x
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=45669

bzimport added a subscriber: wikibugs-l.
bzimport set Reference to bz36839.
NicoV created this task.Via LegacyMay 14 2012, 5:50 PM
Krenair added a comment.Via ConduitMay 14 2012, 6:00 PM

Confirmed, upped the severity of this bug.

According to OP, this problem seems to have started in the past few days - maybe it was introduced in 1.20wmf2?

Umherirrender added a comment.Via ConduitMay 14 2012, 6:38 PM

Are POST Request also affected, where titles= is part of the POST data?

Anomie added a comment.Via ConduitMay 14 2012, 7:20 PM

It appears that "very long" here corresponds to "512 or more bytes in the parameter". It does not seem to have anything to do with the number of titles, the total length of all parameters, or the number of characters in the URL-encoded representation.

NicoV added a comment.Via ConduitMay 14 2012, 7:50 PM

The problem seems to be more general than the simple example I provided first.

For example, on frwiki, if I try to retrieve the list of templates in Hacker, it orks :
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&titles=Hacker

<?xml version="1.0"?>
<api>

<query>
  <pages>
    <page pageid="1830125" ns="0" title="Hacker">
      <templates>
        <tl ns="10" title="Modèle:Autres projets" />
        <tl ns="10" title="Modèle:Homonymie" />
      </templates>
    </page>
  </pages>
</query>

</api>

If I had a tltemplates argument with only "Modèle:Homonymie", it works also :
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AHomonymie&titles=Hacker

<?xml version="1.0"?>
<api>

<query>
  <pages>
    <page pageid="1830125" ns="0" title="Hacker">
      <templates>
        <tl ns="10" title="Modèle:Homonymie" />
      </templates>
    </page>
  </pages>
</query>

</api>

But, when I put all disambiguation templates in the tltemplates argument, the result is wrong (Modèle:Homonymie is not returned as one of the templates in Hacker) :
/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AArrondissements%20homonymes%7CMod%C3%A8le%3ABandeau%20standard%20pour%20page%20d'homonymie%7CMod%C3%A8le%3ABatailles%20homonymes%7CMod%C3%A8le%3ACantons%20homonymes%7CMod%C3%A8le%3ACommunes%20fran%C3%A7aises%20homonymes%7CMod%C3%A8le%3AFilms%20homonymes%7CMod%C3%A8le%3AGouvernements%20homonymes%7CMod%C3%A8le%3AGuerres%20homonymes%7CMod%C3%A8le%3AHomonymie%7CMod%C3%A8le%3AHomonymie%20bateau%7CMod%C3%A8le%3AHomonymie%20d'%C3%A9tablissements%20scolaires%20ou%20universitaires%7CMod%C3%A8le%3AHomonymie%20d'%C3%AEles%7CMod%C3%A8le%3AHomonymie%20de%20clubs%20sportifs%7CMod%C3%A8le%3AHomonymie%20de%20comt%C3%A9s%7CMod%C3%A8le%3AHomonymie%20de%20monument%7CMod%C3%A8le%3AHomonymie%20de%20nom%20romain%7CMod%C3%A8le%3AHomonymie%20de%20parti%20politique%7CMod%C3%A8le%3AHomonymie%20de%20route%7CMod%C3%A8le%3AHomonymie%20dynastique%7CMod%C3%A8le%3AHomonymie%20vid%C3%A9oludique%7CMod%C3%A8le%3AHomonymie%20%C3%A9difice%20religieux%7CMod%C3%A8le%3AInternationalisation%7CMod%C3%A8le%3AIsom%C3%A9rie%7CMod%C3%A8le%3AParonymie%7CMod%C3%A8le%3APatronyme%7CMod%C3%A8le%3APatronyme%20basque%7CMod%C3%A8le%3APatronyme%20italien%7CMod%C3%A8le%3APatronymie%7CMod%C3%A8le%3APersonnes%20homonymes%7CMod%C3%A8le%3ASaints%20homonymes%7CMod%C3%A8le%3ATitres%20homonymes%7CMod%C3%A8le%3AToponymie%7CMod%C3%A8le%3AUnit%C3%A9s%20homonymes%7CMod%C3%A8le%3AVilles%20homonymes%7CMod%C3%A8le%3A%C3%89difices%20religieux%20homonymes&titles=Hacker

<?xml version="1.0"?>
<api>

<query>
  <pages>
    <page pageid="1830125" ns="0" title="Hacker" />
  </pages>
</query>

</api>

NicoV added a comment.Via ConduitMay 14 2012, 7:51 PM

Link to last example was:
http://fr.wikipedia.org/w/api.php?action=query&prop=templates&format=xml&tltemplates=Mod%C3%A8le%3AArrondissements%20homonymes%7CMod%C3%A8le%3ABandeau%20standard%20pour%20page%20d'homonymie%7CMod%C3%A8le%3ABatailles%20homonymes%7CMod%C3%A8le%3ACantons%20homonymes%7CMod%C3%A8le%3ACommunes%20fran%C3%A7aises%20homonymes%7CMod%C3%A8le%3AFilms%20homonymes%7CMod%C3%A8le%3AGouvernements%20homonymes%7CMod%C3%A8le%3AGuerres%20homonymes%7CMod%C3%A8le%3AHomonymie%7CMod%C3%A8le%3AHomonymie%20bateau%7CMod%C3%A8le%3AHomonymie%20d'%C3%A9tablissements%20scolaires%20ou%20universitaires%7CMod%C3%A8le%3AHomonymie%20d'%C3%AEles%7CMod%C3%A8le%3AHomonymie%20de%20clubs%20sportifs%7CMod%C3%A8le%3AHomonymie%20de%20comt%C3%A9s%7CMod%C3%A8le%3AHomonymie%20de%20monument%7CMod%C3%A8le%3AHomonymie%20de%20nom%20romain%7CMod%C3%A8le%3AHomonymie%20de%20parti%20politique%7CMod%C3%A8le%3AHomonymie%20de%20route%7CMod%C3%A8le%3AHomonymie%20dynastique%7CMod%C3%A8le%3AHomonymie%20vid%C3%A9oludique%7CMod%C3%A8le%3AHomonymie%20%C3%A9difice%20religieux%7CMod%C3%A8le%3AInternationalisation%7CMod%C3%A8le%3AIsom%C3%A9rie%7CMod%C3%A8le%3AParonymie%7CMod%C3%A8le%3APatronyme%7CMod%C3%A8le%3APatronyme%20basque%7CMod%C3%A8le%3APatronyme%20italien%7CMod%C3%A8le%3APatronymie%7CMod%C3%A8le%3APersonnes%20homonymes%7CMod%C3%A8le%3ASaints%20homonymes%7CMod%C3%A8le%3ATitres%20homonymes%7CMod%C3%A8le%3AToponymie%7CMod%C3%A8le%3AUnit%C3%A9s%20homonymes%7CMod%C3%A8le%3AVilles%20homonymes%7CMod%C3%A8le%3A%C3%89difices%20religieux%20homonymes&titles=Hacker

Anomie added a comment.Via ConduitMay 14 2012, 8:05 PM

Probably the same underlying problem: it's searching for "Modèle:Homonymie" instead of "Modèle:Homonymie". Trim the value of tltemplates down to 511 bytes or less (before URL-encoding) and it'll probably work. Or try it with a POST, as Umherirrender suggested.

NicoV added a comment.Via ConduitMay 14 2012, 8:21 PM

I just tried with POST instead of GET : it works.
So the problem seems to appear only with GET requests.

MarkAHershberger added a comment.Via ConduitMay 15 2012, 12:57 PM

Bumping to highest as this appears to be a regression with the recent rollout. Notifying Robla and Sumana.

Krenair added a comment.Via ConduitMay 15 2012, 2:51 PM
  • Bug 36799 has been marked as a duplicate of this bug. ***
Anomie added a comment.Via ConduitMay 15 2012, 4:40 PM

I think I might have figured this out.

In a post on enwiki from May 11,[1] we are told that Roan changed the "PCRE recursion limit" from the default 100k to 1k. I assume this is referring to PHP's "pcre.recursion_limit" setting,[2] which indeed has a default of 100000.

One thing the recursion limit affects how often regexes with subexpressions like "(x)+" can match. It seems that each match by "+" there uses up 2 of the recursion limit; with a value of 1024, it can match at most 511 times. If it would match 512 times, preg_match will return false instead. You can test this easily enough if you have a recent-enough command-line PHP:

php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/", str_repeat("x", 511)));'
php -r 'ini_set("pcre.recursion_limit", 1024); var_dump(preg_match("/(x)+/", str_repeat("x", 512)));'

The first will succeed, while the second will fail. But if you bump the 1024 to 1026, the second will start working.

So what seems to be going on is this: The API uses the methods in WebRequest to get the parameters from the client, all of which seem to come down to getGPCVal. For any parameter that exists in $_GET (even if overridden by $_POST), getGPCVal passes the value through Language::checkTitleEncoding to make sure it's valid UTF-8. And due to the low recursion limit, the regex in Language::checkTitleEncoding that tries to check whether the value is valid UTF-8 will now think it is ''not'' valid if the value is more than 511 characters long, so it will treat it as the fallback 8-bit encoding (windows-1252 for most languages), which gives the familiar "è" mojibake.

If I'm right, the fix for this bug would be to revert Roan's change to the "pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in some other way), or at least turn it up to something more reasonable than 1024. I'd expect this is causing problems in other areas of the code, too.

[1]: https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Database_reports&diff=491927371&oldid=491919743
[2]: http://us.php.net/manual/en/pcre.configuration.php#ini.pcre.recursion-limit

bzimport added a comment.Via ConduitMay 15 2012, 6:33 PM

sumanah wrote:

Keywording i18n.

Umherirrender added a comment.Via ConduitMay 18 2012, 6:44 PM

(In reply to comment #10)

I'd expect this is causing problems in other areas of the code, too.

And it causing problems for many clients, which using titles= with 50 urlencoded titles. Please use a temp fix soon (reset the limit) and look into the real problem after that.

Thanks.

Lupo added a comment.Via ConduitMay 22 2012, 7:37 AM

The regexp recursion limit aside, is using a regexp to check for UTF-8 appropriate? Why not use mb_check_encoding() if available? Other operations in Language.php do make use of the mb_* functions...

http://php.net/manual/en/function.mb-check-encoding.php

Umherirrender added a comment.Via ConduitMay 25 2012, 2:26 PM
  • Bug 37021 has been marked as a duplicate of this bug. ***
MarkAHershberger added a comment.Via ConduitMay 26 2012, 5:47 PM

(In reply to comment #10)

If I'm right, the fix for this bug would be to revert Roan's change to the
"pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in
some other way), or at least turn it up to something more reasonable than 1024.
I'd expect this is causing problems in other areas of the code, too.

Sam, have you had a chance to look at this yet?

Reedy added a comment.Via ConduitMay 27 2012, 12:26 PM

(In reply to comment #15)

(In reply to comment #10)
> If I'm right, the fix for this bug would be to revert Roan's change to the
> "pcre.recursion_limit" setting (and fix whatever PageTriage's problem is in
> some other way), or at least turn it up to something more reasonable than 1024.
> I'd expect this is causing problems in other areas of the code, too.

Sam, have you had a chance to look at this yet?

Nope.

We can increase the pcre.recursion_limit again, the value Roan set was overly conservative.

I'm not sure what the availability of various Ops staff is going to be to push this through

(In reply to comment #13)

The regexp recursion limit aside, is using a regexp to check for UTF-8
appropriate? Why not use mb_check_encoding() if available? Other operations in
Language.php do make use of the mb_* functions...

http://php.net/manual/en/function.mb-check-encoding.php

I'm guessing legacy reasons, and the code was never updated.

Please feel free to submit a patch to BZ or a commit to Gerrit. Or is just changing:

		$isutf8 = preg_match( '/^([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|' .
				'[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})+$/', $s );
		if ( $isutf8 ) {
			return $s;
		}

to

		if ( mb_check_encoding( $s, 'UTF-8' )  ) {
			return $s;
		}

enough?

Language::checkTitleEncoding has no unit tests written for it either.

Reedy added a comment.Via ConduitMay 31 2012, 10:50 AM

For anyone not following along... Lupo created https://gerrit.wikimedia.org/r/#/c/9126/ and added some unit tests

Reedy added a comment.Via ConduitMay 31 2012, 12:49 PM

Code is now live on Wikimedia wikis. Both the regex fix, and the addition of mb_check_encoding() are there. Should be quite a bit quicker. Shouldn't need the recursion limit increasing again either

Anomie added a comment.Via ConduitMay 31 2012, 2:59 PM

(In reply to comment #18)

Shouldn't need the recursion limit increasing again either

Are you sure that this low recursion limit isn't breaking things elsewhere in the code too?

Reedy added a comment.Via ConduitMay 31 2012, 3:58 PM

(In reply to comment #19)

(In reply to comment #18)
> Shouldn't need the recursion limit increasing again either

Are you sure that this low recursion limit isn't breaking things elsewhere in
the code too?

Nope. But leaving it low will bring them up, and maybe we can fix those aswell? ;)

gerritbot added a comment.Via ConduitApr 9 2013, 3:59 PM

Related URL: https://gerrit.wikimedia.org/r/58306 (Gerrit Change I27203c767d1d3f2f0999b1b1d8a06e8cf68c19ed)

Add Comment

Column Prototype
This is a very early prototype of a persistent column. It is not expected to work yet, and leaving it open will activate other new features which will break things. Press "\" (backslash) on your keyboard to close it now.