Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	XZise
	Apr 2 2015, 12:08 PM

Description

While I was testing https://gerrit.wikimedia.org/r/#/c/201446 I got the following error:

WARNING: loadpageinfo: Query on [[ar:قالب:ﻲﺘﻴﻣﺓ]] returned data on 'قالب:يتيمة'
Traceback (most recent call last):
  File "pwb.py", line 213, in <module>
    run_python_file(filename, argv, argvu, file_package)
  File "pwb.py", line 82, in run_python_file
    main_mod.__dict__)
  File "./scripts/lonelypages.py", line 280, in <module>
    main()
  File "./scripts/lonelypages.py", line 276, in main
    bot = LonelyPagesBot(generator, **options)
  File "./scripts/lonelypages.py", line 148, in __init__
    self._exception = orphan_template.generate(self.site)
  File "./scripts/lonelypages.py", line 99, in generate
    if not pywikibot.Page(site, self._name, template_ns.id).exists():
  File "/home/xzise/Programms/pywikibot/core/pywikibot/page.py", line 604, in exists
    return self.site.page_exists(self)
  File "/home/xzise/Programms/pywikibot/core/pywikibot/site.py", line 2415, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'

It seems odd that it returns the data for a different page.

Related Objects

Mentioned In: T223157: missing _page_id attribute when loading a page
Mentioned Here: T11413: Normalization of Arabic presentation forms
rSVN60599: Fix for bug 9413 and the related Malayalam issue reported on wikitech-l.
T87645: Existing pages without ability to reach and obviously wrong namespace
T101597: Page.exists(): Cannot auto detect whether a page title in different variant exists.

Event Timeline

XZise created this task.Apr 2 2015, 12:08 PM

XZise raised the priority of this task from to Needs Triage.

XZise updated the task description. (Show Details)

XZise added a project: Pywikibot.

XZise subscribed.

Restricted Application added subscribers: Aklapper, Unknown Object (MLST). · View Herald TranscriptApr 2 2015, 12:08 PM

XZise updated the task description. (Show Details)Apr 2 2015, 12:10 PM

XZise set Security to None.

looks like a spelling variant solved by the query result. The same bug occurred if there are html tags inside the page title e.g.

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'Eaton&nbsp;Corporation')
WARNING: loadpageinfo: Query on [[de:EatonÂ Corporation]] returned data on 'Eaton Corporation'

Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    p.exists()
  File "pywikibot\page.py", line 671, in exists
    x = self.site.page_exists(self)
  File "pywikibot\site.py", line 2901, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 17 2015, 3:01 PM

Xqt triaged this task as High priority.Nov 17 2015, 3:01 PM

I'm pretty sure that the HTML tags problem is much easier to solve, as we can do the same transform on the client side easily.

The Arabic may be unicode normalisation, in which case we can try to do the same transform on the client side. The API doesnt give any hints, and converttitles doesnt work so this isnt T101597. That may even mean this is a case where the MediaWiki-Action-API should be, but is not, informing the client about a changed title.

Need help from someone who understands Arabic, or a MediaWiki person who understands the transform which is being done on this title.

I can confirm that this is a Unicode normalization.
For example ARABIC LETTER YEH FINAL FORM is being converted to ARABIC LETTER YEH.
You can confirm this in python as follows:

>>> unicodedata.normalize('NFKD', 'قالب:ﻲﺘﻴﻣﺓ') == unicodedata.normalize('NFKD', 'قالب:يتيمة')
True

P.S. Even though Persian and Arabic have a very similar writing system, such normalization does not exist on Persian Wikipedia.

Maybe it's related to T87645

In T94826#1880606, @Ladsgroup wrote:

Maybe it's related to T87645

No.

It seems that Arabic normalization is controlled by $wgFixArabicUnicode in MediaWiki. T11413 and rSVN60599 provide more details. There is also $wgFixMalayalamUnicode for Malayalam.

The actual normalization data are found in serialized/normalize-ar.ser. Are the data exposed in the MediaWiki API?

I dont believe that is exposed via the API.
Is serialized/normalize-ar.ser generated from another source. Maybe a package already exists in PyPI with the same data?

The source data is extracted from http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt.

I think it is reasonable to ask the API to expose whether $wgFixArabicUnicode is true or not (that is, to normalize or not).

The normalization procedure can be replicated in Python using the source data. MediaWiki selectively applies normalization, as shown in the PHP code of ./maintenance/language/generateNormalizerDataAr.php:

if ( ( $code >= 0xFB50 && $code <= 0xFDFF ) # Arabic presentation forms A
   || ( $code >= 0xFE70 && $code <= 0xFEFF ) # Arabic presentation forms B

As for an existing package to do the same - I'm not sure. We'd probably want something like a language-specific version of unicodedata.

whym merged a task: T136698: AttributeError: 'Page' object has no attribute '_pageid'.Jun 19 2016, 3:56 AM

whym added subscribers: Zppix, Thibaut120094.

whym renamed this task from Data returned for another page to Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular).Jul 2 2016, 12:00 AM

The variables $wgFixArabicUnicode, $wgFixMalayalamUnicode and $wgAllUnicodeFixes are now exposed:

Great!

Do we need to support older versions of MediaWiki?

How frequently do these normalisations/crashes occur in these languages?

Are these many non-wikimedia wikis in these languages?

What values should be used on older wikis before this change?

I don't know the answers to the questions, but here is what I suggest.

A simple solution would be to treat all older wikis as "normalization off" and keep crashing for them when unexpected normalization happens. We know at least that that will not make things worse than now for them.

This is simple because we can always accept what the API gives to us. For older wikis, Pywikibot will probably have to fail in a manner similar to how it crashes now, but that's at least not making things worse.

As a next step to that, we could implement a Pywikibot option to enforce assuming normalizations regardless of API values for older wikis. This might have to be a per-wiki configuration.

Why shouldn't w just print the warning but instead continuing with the next pageitem inside _update_page() we could just update the page to get pageid and the new title. Isn't it?

Xqt's comment above was presumably about https://gerrit.wikimedia.org/r/#/c/293957/ (sorry for not linking this earlier here.)

I have replied there, as it was more relevant to the code I submitted there than the crash issue discussed here.

cscott subscribed.Oct 19 2018, 7:37 PM

Restricted Application added a subscriber: alaa. · View Herald TranscriptOct 19 2018, 7:37 PM

Dvorapa mentioned this in T223157: missing _page_id attribute when loading a page.May 14 2019, 12:53 PM

Solved already:

>>> import pwb, pywikibot as py
>>> s = py.Site()
>>> p = py.Page(s, u'Eaton&nbsp;Corporation')
>>> p
Page('Eaton Corporation')
>>>

The current behavior is:

>>> s = pywikibot.Site('ar')
>>> p = pywikibot.Page(s, 'قالب:ﻲﺘﻴﻣﺓ')
>>> p
Page('قالب:ﻲﺘﻴﻣﺓ')
>>> p.exists()
WARNING: API warning (query): The value passed for "titles" contains invalid or non-normalized data. Textual data should be valid, NFC-normalized Unicode without C0 control characters other than HT (\t), LF (\n), and CR (\r).
Traceback (most recent call last):
  File "<pyshell#42>", line 1, in <module>
    p.exists()
  File "C:\pwb\GIT\core\pywikibot\page\__init__.py", line 718, in exists
    return self.pageid > 0
  File "C:\pwb\GIT\core\pywikibot\page\__init__.py", line 265, in pageid
    self.site.loadpageinfo(self)
  File "C:\pwb\GIT\core\pywikibot\site\_apisite.py", line 1110, in loadpageinfo
    self._update_page(page, query)
  File "C:\pwb\GIT\core\pywikibot\site\_apisite.py", line 1087, in _update_page
    raise InconsistentTitleError(page, pageitem['title'])
pywikibot.exceptions.InconsistentTitleError: Query on [[ar:قالب:ﻲﺘﻴﻣﺓ]] returned data on 'قالب:يتيمة'
>>>

Can this task be closed? Or is sth left to do here?

Yes, let's close. It might have been slightly better if we made the bahavior configurable, but after 4+ years, I guess the benefit of doing so now is little to none. (Most people would not be using the older versions of MediaWiki any more.)

Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)
Closed, ResolvedPublic
Actions