Page MenuHomePhabricator

Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)
Open, HighPublic

Description

While I was testing https://gerrit.wikimedia.org/r/#/c/201446 I got the following error:

WARNING: loadpageinfo: Query on [[ar:قالب:ﻲﺘﻴﻣﺓ]] returned data on 'قالب:يتيمة'
Traceback (most recent call last):
  File "pwb.py", line 213, in <module>
    run_python_file(filename, argv, argvu, file_package)
  File "pwb.py", line 82, in run_python_file
    main_mod.__dict__)
  File "./scripts/lonelypages.py", line 280, in <module>
    main()
  File "./scripts/lonelypages.py", line 276, in main
    bot = LonelyPagesBot(generator, **options)
  File "./scripts/lonelypages.py", line 148, in __init__
    self._exception = orphan_template.generate(self.site)
  File "./scripts/lonelypages.py", line 99, in generate
    if not pywikibot.Page(site, self._name, template_ns.id).exists():
  File "/home/xzise/Programms/pywikibot/core/pywikibot/page.py", line 604, in exists
    return self.site.page_exists(self)
  File "/home/xzise/Programms/pywikibot/core/pywikibot/site.py", line 2415, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'

It seems odd that it returns the data for a different page.

Event Timeline

XZise created this task.Apr 2 2015, 12:08 PM
XZise raised the priority of this task from to Needs Triage.
XZise updated the task description. (Show Details)
XZise added a project: Pywikibot.
XZise added a subscriber: XZise.
Restricted Application added subscribers: Aklapper, Unknown Object (MLST). · View Herald TranscriptApr 2 2015, 12:08 PM
XZise updated the task description. (Show Details)Apr 2 2015, 12:10 PM
XZise set Security to None.
Xqt added a subscriber: Xqt.Nov 17 2015, 3:01 PM

looks like a spelling variant solved by the query result. The same bug occurred if there are html tags inside the page title e.g.

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'Eaton&nbsp;Corporation')
WARNING: loadpageinfo: Query on [[de:Eaton Corporation]] returned data on 'Eaton Corporation'

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    p.exists()
  File "pywikibot\page.py", line 671, in exists
    x = self.site.page_exists(self)
  File "pywikibot\site.py", line 2901, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 17 2015, 3:01 PM
Xqt triaged this task as High priority.Nov 17 2015, 3:01 PM

I'm pretty sure that the HTML tags problem is much easier to solve, as we can do the same transform on the client side easily.

The Arabic may be unicode normalisation, in which case we can try to do the same transform on the client side. The API doesnt give any hints, and converttitles doesnt work so this isnt T101597. That may even mean this is a case where the MediaWiki-API should be, but is not, informing the client about a changed title.

Need help from someone who understands Arabic, or a MediaWiki person who understands the transform which is being done on this title.

Dalba added a comment.EditedDec 15 2015, 12:34 PM

I can confirm that this is a Unicode normalization.
For example ARABIC LETTER YEH FINAL FORM is being converted to ARABIC LETTER YEH.
You can confirm this in python as follows:

>>> unicodedata.normalize('NFKD', 'قالب:ﻲﺘﻴﻣﺓ') == unicodedata.normalize('NFKD', 'قالب:يتيمة')
True

P.S. Even though Persian and Arabic have a very similar writing system, such normalization does not exist on Persian Wikipedia.

Maybe it's related to T87645

Maybe it's related to T87645

No.

whym added a subscriber: whym.EditedMay 28 2016, 7:56 AM

It seems that Arabic normalization is controlled by $wgFixArabicUnicode in MediaWiki. T11413 and rSVN60599 provide more details. There is also $wgFixMalayalamUnicode for Malayalam.

The actual normalization data are found in serialized/normalize-ar.ser. Are the data exposed in the MediaWiki API?

I dont believe that is exposed via the API.
Is serialized/normalize-ar.ser generated from another source. Maybe a package already exists in PyPI with the same data?

whym added a comment.Jun 12 2016, 10:16 AM

The source data is extracted from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt.

I think it is reasonable to ask the API to expose whether $wgFixArabicUnicode is true or not (that is, to normalize or not).

The normalization procedure can be replicated in Python using the source data. MediaWiki selectively applies normalization, as shown in the PHP code of ./maintenance/language/generateNormalizerDataAr.php:

if ( ( $code >= 0xFB50 && $code <= 0xFDFF ) # Arabic presentation forms A
   || ( $code >= 0xFE70 && $code <= 0xFEFF ) # Arabic presentation forms B

As for an existing package to do the same - I'm not sure. We'd probably want something like a language-specific version of unicodedata.

whym renamed this task from Data returned for another page to Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular).Jul 2 2016, 12:00 AM
whym added a comment.Jul 2 2016, 12:02 AM

The variables $wgFixArabicUnicode, $wgFixMalayalamUnicode and $wgAllUnicodeFixes are now exposed:

jayvdb added a comment.Jul 2 2016, 8:16 AM

Great!

Do we need to support older versions of MediaWiki?

How frequently do these normalisations/crashes occur in these languages?

Are these many non-wikimedia wikis in these languages?

What values should be used on older wikis before this change?

whym added a comment.Jul 4 2016, 7:35 AM

I don't know the answers to the questions, but here is what I suggest.

A simple solution would be to treat all older wikis as "normalization off" and keep crashing for them when unexpected normalization happens. We know at least that that will not make things worse than now for them.

This is simple because we can always accept what the API gives to us. For older wikis, Pywikibot will probably have to fail in a manner similar to how it crashes now, but that's at least not making things worse.

As a next step to that, we could implement a Pywikibot option to enforce assuming normalizations regardless of API values for older wikis. This might have to be a per-wiki configuration.

Xqt added a comment.Jul 31 2016, 11:37 AM

Why shouldn't w just print the warning but instead continuing with the next pageitem inside _update_page() we could just update the page to get pageid and the new title. Isn't it?

whym added a comment.Aug 12 2016, 12:57 PM

Xqt's comment above was presumably about https://gerrit.wikimedia.org/r/#/c/293957/ (sorry for not linking this earlier here.)

I have replied there, as it was more relevant to the code I submitted there than the crash issue discussed here.

cscott added a subscriber: cscott.Oct 19 2018, 7:37 PM
Restricted Application added a subscriber: alanajjar. · View Herald TranscriptOct 19 2018, 7:37 PM