Page MenuHomePhabricator

Python issue 10254
Closed, ResolvedPublic

Description

Pywikibot faced some unicode issues in 2010-11 experienced with interwiki.py primarily.

This task is created to capture some of the history of that, as it is a key factor in some versions being listed as unsupported on the support matrix at https://www.mediawiki.org/wiki/Manual:Pywikibot/Version_table

Python issue 10254 was initially raised as http://sourceforge.net/p/pywikipediabot/bugs/1246 , affecting Python 2.6.6, Python 2.7.0 and Python 3.0. The fix was merged into 2.6.7, Python 2.7.2, but was not backported into 2.7.1, and into Python 3 long before the lowest Pywikibot supported version of Python 3.3. Therefore this bug affects Pywikibot support on Python 2.6.6, Python 2.7.0 and Python 2.7.1.

http://sourceforge.net/p/pywikipediabot/bugs/1382/ was a related issue, but the solution was "upgrade to 2.7.2", so there is little information about the cause or solution. It is listed as the reason that Python 2.5 was de-supported.

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb subscribed.
jayvdb claimed this task.

Just speculating, but http://bugs.python.org/issue8024 (Unicode 5.2) may have been the problem with Python 2.5 and early Python 2.6 releases.

Change 218884 had a related patch set uploaded (by John Vandenberg):
Decommission support for Python 2.7.0 and 2.7.1

https://gerrit.wikimedia.org/r/218884

This bug does affect Python 2.6.6 , but was fixed in the first patch that went into 2.6.7 :/
To make things worse, Redhat Enterprise Linux has not fixed this bug in their Python 2.6.6 (which is currently at its 52 revision).

Change 225900 had a related patch set uploaded (by XZise):
[IMPROV] Remove exception because of unicode bug

https://gerrit.wikimedia.org/r/225900

Python issue 10254 occurs within Link.__init__ where it calls t = unicodedata.normalize('NFC', t) , which dates back to 2006 with ed5e7395, when MediaWiki was version 1.5.

As RHEL Python 2.6.6 does not include the fix for unicodedata.normalize, yet our RHEL users are not complaining, this bug is obviously not affecting normal use. This is most likely because:

  1. they are not using languages which have this bug, and/or
  2. they are using MediaWiki versions which the API provides titles which do not need to be normalised.

We can feature detect one or both of those, to prevent this bug from occurring.

It could be that MediaWiki versions 1.14+ do not need unicodedata.normalize at all , which means we can simply remove this line.

http://bugs.python.org/issue10254 refers to three example strings that cause the problem:

  1. u'Li\u030dt-s\u1e73\u0301' = Li̍t-sṳ́
  2. u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917' = मार्क ज़ुकेरबर्ग
  3. u'\u0915\u093f\u0930\u094d\u0917\u093f\u091c\u093c\u0938\u094d\u0924\u093e\u0928' = किर्गिज़स्तान (api langlinks from en.wp)

Python issue 10254 is entirely about strings which unicodedata.normalize should return unmodified. i.e. no normalisation is necessary, but it returns an incorrectly normalised string, or crashes on 2.7.1!

To verify the API is currently emitting NFC normalised langlink titles, I put the following into User:John_Vandenberg/test

[[fa:Ω]]

[[fi:Ω]]

https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=User:John_Vandenberg/test returns

{
    "query": {
        "pages": {
            "40071800": {
                "pageid": 40071800,
                "ns": 2,
                "title": "User:John Vandenberg/test",
                "langlinks": [
                    {
                        "lang": "fa",
                        "*": "\u03a9"
                    },
                    {
                        "lang": "fi",
                        "*": "\u03a9"
                    }
                ]
            }
        }
    }
}

(also tested other formats, such as dbg, and all return actual unicode.)

The only way to get unnormalised html references out is to use export, like https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=User:John_Vandenberg/test&export=1 :

{
    "batchcomplete": "",
    "query": {
        "normalized": [
            {
                "from": "User:John_Vandenberg/test",
                "to": "User:John Vandenberg/test"
            }
        ],
        "pages": {
            "40071800": {
                "pageid": 40071800,
                "ns": 2,
                "title": "User:John Vandenberg/test",
                "langlinks": [
                    {
                        "lang": "fa",
                        "*": "\u03a9"
                    },
                    {
                        "lang": "fi",
                        "*": "\u03a9"
                    }
                ]
            }
        },
        "export": {
            "*": "<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.10/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd\" version=\"0.10\" xml:lang=\"en\">\n  <siteinfo>\n    <sitename>Wikipedia</sitename>\n    <dbname>enwiki</dbname>\n    <base>https://en.wikipedia.org/wiki/Main_Page</base>\n    <generator>MediaWiki 1.26wmf14</generator>\n    <case>first-letter</case>\n    <namespaces>\n      <namespace key=\"-2\" case=\"first-letter\">Media</namespace>\n      <namespace key=\"-1\" case=\"first-letter\">Special</namespace>\n      <namespace key=\"0\" case=\"first-letter\" />\n      <namespace key=\"1\" case=\"first-letter\">Talk</namespace>\n      <namespace key=\"2\" case=\"first-letter\">User</namespace>\n      <namespace key=\"3\" case=\"first-letter\">User talk</namespace>\n      <namespace key=\"4\" case=\"first-letter\">Wikipedia</namespace>\n      <namespace key=\"5\" case=\"first-letter\">Wikipedia talk</namespace>\n      <namespace key=\"6\" case=\"first-letter\">File</namespace>\n      <namespace key=\"7\" case=\"first-letter\">File talk</namespace>\n      <namespace key=\"8\" case=\"first-letter\">MediaWiki</namespace>\n      <namespace key=\"9\" case=\"first-letter\">MediaWiki talk</namespace>\n      <namespace key=\"10\" case=\"first-letter\">Template</namespace>\n      <namespace key=\"11\" case=\"first-letter\">Template talk</namespace>\n      <namespace key=\"12\" case=\"first-letter\">Help</namespace>\n      <namespace key=\"13\" case=\"first-letter\">Help talk</namespace>\n      <namespace key=\"14\" case=\"first-letter\">Category</namespace>\n      <namespace key=\"15\" case=\"first-letter\">Category talk</namespace>\n      <namespace key=\"100\" case=\"first-letter\">Portal</namespace>\n      <namespace key=\"101\" case=\"first-letter\">Portal talk</namespace>\n      <namespace key=\"108\" case=\"first-letter\">Book</namespace>\n      <namespace key=\"109\" case=\"first-letter\">Book talk</namespace>\n      <namespace key=\"118\" case=\"first-letter\">Draft</namespace>\n      <namespace key=\"119\" case=\"first-letter\">Draft talk</namespace>\n      <namespace key=\"446\" case=\"first-letter\">Education Program</namespace>\n      <namespace key=\"447\" case=\"first-letter\">Education Program talk</namespace>\n      <namespace key=\"710\" case=\"first-letter\">TimedText</namespace>\n      <namespace key=\"711\" case=\"first-letter\">TimedText talk</namespace>\n      <namespace key=\"828\" case=\"first-letter\">Module</namespace>\n      <namespace key=\"829\" case=\"first-letter\">Module talk</namespace>\n      <namespace key=\"2600\" case=\"first-letter\">Topic</namespace>\n    </namespaces>\n  </siteinfo>\n  <page>\n    <title>User:John Vandenberg/test</title>\n    <ns>2</ns>\n    <id>40071800</id>\n    <revision>\n      <id>672324473</id>\n      <parentid>671175129</parentid>\n      <timestamp>2015-07-20T20:42:08Z</timestamp>\n      <contributor>\n        <username>John Vandenberg</username>\n        <id>101140</id>\n      </contributor>\n      <comment>[[WP:AES|\u2190]]Replaced content with '[[fa:&amp;#x2126;]]  [[fi:&amp;#937;]]'</comment>\n      <model>wikitext</model>\n      <format>text/x-wiki</format>\n      <text xml:space=\"preserve\" bytes=\"30\">[[fa:&amp;#x2126;]]\n\n[[fi:&amp;#937;]]</text>\n      <sha1>qmuv4xpjkjen1lug58vlwifs20xuk81</sha1>\n    </revision>\n  </page>\n</mediawiki>\n"
        }
    }
}

The MediaWiki code in question is APIResult::cleanUpUTF8(), which calls Language::normalize() since ad19c032, but has called UtfNormal::cleanUp() directly or indirectly since 5559e3f2 (MW 1.15).

However f6307768a shows it wasnt working for some revisions (need to determine whether any official releases contained this bug).

Even if the current MediaWiki implementation is returning normalized strings, don't we still need to normalize any other title not returned by the API?

Even if the current MediaWiki implementation is returning normalized strings, don't we still need to normalize any other title not returned by the API?

There will certainly be problems if a Link has a unnormalised unicode title.
The main problems will exist because the API does unicode normalisation of all input parameters and output result, but does not provide information about normalisation that occurred: T29849: API: add normalized info also for unicode normalization of titles .

To workaround that, Pywikibot would need to ask the API about each Link to determine the Link has the correct normalised title; i.e. its title matches the API's title for the same patch. I see that for Malayalam and Arabic, Language::normalize is performing other conversions, which are probably not included in unicodedata.normalize. We already have another case of title normalisation occurring server side which broke our Link object as it doesnt and couldnt detect the correct title - T101597: Page.exists(): Cannot auto detect whether a page title in different variant exists. .

If we are going to not call unicodedata.normalize, without implementing a (inefficient and difficult) "normalise every title using the API", the alternative is to add a simple precaution to prevent use of Link with titles which would cause issue 10254.

It seems this is possible with a broad check to see if the title contains any combining character, like so

>>> if PYTHON_VERSION == (2, 6, 6) and any(unicodedata.combining(c) for c in title):
...     raise UnicodeEncodeError('%s contains combining characters, which are not supported on Python 2.6.6' % title)

@Multichill, would this be acceptable to you?

Reopening, as there are several viable approaches to workaround this problem reliably, and we need one or more of them until Python 2.6.6 is no longer supported (e.g. T103063: Drop py2.6 support).

Another solution is to use the backport unicodedata2 if it has been installed. I've requested that they distribute wheels to make installation easier.

Change 218884 merged by jenkins-bot:
Python issue #10254

https://gerrit.wikimedia.org/r/218884

Change 243349 had a related patch set uploaded (by John Vandenberg):
Desupport Python 2.6 for Pywikibot 2.0 release branch

https://gerrit.wikimedia.org/r/243349

Change 432763 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [cleanup] remove unicodedata2 dependency

https://gerrit.wikimedia.org/r/432763

Change 432763 merged by jenkins-bot:
[pywikibot/core@master] [cleanup] remove unicodedata2 dependency

https://gerrit.wikimedia.org/r/432763

Change 225900 abandoned by Xqt:
[IMPROV] Remove exception because of unicode bug

Reason:
Old Python is no longer supported and Python 2 will be dropped soon

https://gerrit.wikimedia.org/r/225900