Page MenuHomePhabricator

Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)
Closed, ResolvedPublic

Description

While I was testing https://gerrit.wikimedia.org/r/#/c/201446 I got the following error:

WARNING: loadpageinfo: Query on [[ar:قالب:ﻲﺘﻴﻣﺓ]] returned data on 'قالب:يتيمة'
Traceback (most recent call last):
  File "pwb.py", line 213, in <module>
    run_python_file(filename, argv, argvu, file_package)
  File "pwb.py", line 82, in run_python_file
    main_mod.__dict__)
  File "./scripts/lonelypages.py", line 280, in <module>
    main()
  File "./scripts/lonelypages.py", line 276, in main
    bot = LonelyPagesBot(generator, **options)
  File "./scripts/lonelypages.py", line 148, in __init__
    self._exception = orphan_template.generate(self.site)
  File "./scripts/lonelypages.py", line 99, in generate
    if not pywikibot.Page(site, self._name, template_ns.id).exists():
  File "/home/xzise/Programms/pywikibot/core/pywikibot/page.py", line 604, in exists
    return self.site.page_exists(self)
  File "/home/xzise/Programms/pywikibot/core/pywikibot/site.py", line 2415, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'

It seems odd that it returns the data for a different page.

Event Timeline

XZise raised the priority of this task from to Needs Triage.
XZise updated the task description. (Show Details)
XZise added a project: Pywikibot.
XZise subscribed.
Restricted Application added subscribers: Aklapper, Unknown Object (MLST). · View Herald TranscriptApr 2 2015, 12:08 PM
XZise set Security to None.

looks like a spelling variant solved by the query result. The same bug occurred if there are html tags inside the page title e.g.

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'Eaton&nbsp;Corporation')
WARNING: loadpageinfo: Query on [[de:Eaton Corporation]] returned data on 'Eaton Corporation'

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    p.exists()
  File "pywikibot\page.py", line 671, in exists
    x = self.site.page_exists(self)
  File "pywikibot\site.py", line 2901, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'
Xqt triaged this task as High priority.Nov 17 2015, 3:01 PM

I'm pretty sure that the HTML tags problem is much easier to solve, as we can do the same transform on the client side easily.

The Arabic may be unicode normalisation, in which case we can try to do the same transform on the client side. The API doesnt give any hints, and converttitles doesnt work so this isnt T101597. That may even mean this is a case where the MediaWiki-Action-API should be, but is not, informing the client about a changed title.

Need help from someone who understands Arabic, or a MediaWiki person who understands the transform which is being done on this title.

I can confirm that this is a Unicode normalization.
For example ARABIC LETTER YEH FINAL FORM is being converted to ARABIC LETTER YEH.
You can confirm this in python as follows:

>>> unicodedata.normalize('NFKD', 'قالب:ﻲﺘﻴﻣﺓ') == unicodedata.normalize('NFKD', 'قالب:يتيمة')
True

P.S. Even though Persian and Arabic have a very similar writing system, such normalization does not exist on Persian Wikipedia.

It seems that Arabic normalization is controlled by $wgFixArabicUnicode in MediaWiki. T11413 and rSVN60599 provide more details. There is also $wgFixMalayalamUnicode for Malayalam.

The actual normalization data are found in serialized/normalize-ar.ser. Are the data exposed in the MediaWiki API?

I dont believe that is exposed via the API.
Is serialized/normalize-ar.ser generated from another source. Maybe a package already exists in PyPI with the same data?

The source data is extracted from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt.

I think it is reasonable to ask the API to expose whether $wgFixArabicUnicode is true or not (that is, to normalize or not).

The normalization procedure can be replicated in Python using the source data. MediaWiki selectively applies normalization, as shown in the PHP code of ./maintenance/language/generateNormalizerDataAr.php:

if ( ( $code >= 0xFB50 && $code <= 0xFDFF ) # Arabic presentation forms A
   || ( $code >= 0xFE70 && $code <= 0xFEFF ) # Arabic presentation forms B

As for an existing package to do the same - I'm not sure. We'd probably want something like a language-specific version of unicodedata.

whym renamed this task from Data returned for another page to Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular).Jul 2 2016, 12:00 AM

Great!

Do we need to support older versions of MediaWiki?

How frequently do these normalisations/crashes occur in these languages?

Are these many non-wikimedia wikis in these languages?

What values should be used on older wikis before this change?

I don't know the answers to the questions, but here is what I suggest.

A simple solution would be to treat all older wikis as "normalization off" and keep crashing for them when unexpected normalization happens. We know at least that that will not make things worse than now for them.

This is simple because we can always accept what the API gives to us. For older wikis, Pywikibot will probably have to fail in a manner similar to how it crashes now, but that's at least not making things worse.

As a next step to that, we could implement a Pywikibot option to enforce assuming normalizations regardless of API values for older wikis. This might have to be a per-wiki configuration.

Why shouldn't w just print the warning but instead continuing with the next pageitem inside _update_page() we could just update the page to get pageid and the new title. Isn't it?

Xqt's comment above was presumably about https://gerrit.wikimedia.org/r/#/c/293957/ (sorry for not linking this earlier here.)

I have replied there, as it was more relevant to the code I submitted there than the crash issue discussed here.

Xqt claimed this task.

Solved already:

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'Eaton&nbsp;Corporation')
>>> p
Page('Eaton Corporation')
>>>

The current behavior is:

>>> s = pywikibot.Site('ar')
>>> p = pywikibot.Page(s, 'قالب:ﻲﺘﻴﻣﺓ')
>>> p
Page('قالب:ﻲﺘﻴﻣﺓ')
>>> p.exists()
WARNING: API warning (query): The value passed for "titles" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r).
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    p.exists()
  File "C:\pwb\GIT\core\pywikibot\page\__init__.py", line 718, in exists
    return self.pageid > 0
  File "C:\pwb\GIT\core\pywikibot\page\__init__.py", line 265, in pageid
    self.site.loadpageinfo(self)
  File "C:\pwb\GIT\core\pywikibot\site\_apisite.py", line 1110, in loadpageinfo
    self._update_page(page, query)
  File "C:\pwb\GIT\core\pywikibot\site\_apisite.py", line 1087, in _update_page
    raise InconsistentTitleError(page, pageitem['title'])
pywikibot.exceptions.InconsistentTitleError: Query on [[ar:قالب:ﻲﺘﻴﻣﺓ]] returned data on 'قالب:يتيمة'
>>>

Can this task be closed? Or is sth left to do here?

Yes, let's close. It might have been slightly better if we made the bahavior configurable, but after 4+ years, I guess the benefit of doing so now is little to none. (Most people would not be using the older versions of MediaWiki any more.)