Page MenuHomePhabricator

UnicodeDecodeError (py2) or RuntimeError (py3) on .getRedirectTarget() with page « Ꜫ »
Closed, ResolvedPublic

Description

tools.framabot@tools-bastion-02:~$ python -V
Python 2.7.6
tools.framabot@tools-bastion-02:~$ python /shared/pywikibot/core/pwb.py redirect -lang:fr double -namespace:0 -namespace:2 -always
Retrieving double redirect special page...
Retrieving 25 pages from wikipedia:fr.


>>> Les Luniens <<<
   Links to: [[E = mc² (recueil)#Les Luniens]].
Skipping: Redirect target [[E = mc² (recueil)#Les Luniens]] is not a redirect.


>>> Madrid-Barajas <<<
   Links to: [[Aéroport Adolfo Suárez Madrid-Barajas]].
Skipping: Redirect target [[Aéroport Adolfo Suárez Madrid-Barajas]] is not a redirect.


>>> Madrid Barajas <<<
   Links to: [[Aéroport Adolfo Suárez Madrid-Barajas]].
Skipping: Redirect target [[Aéroport Adolfo Suárez Madrid-Barajas]] is not a redirect.


>>> Musée du Second Empire (Compiègne) <<<
   Links to: [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]].
Skipping: Redirect target [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]] is not a redirect.


>>> Musée national du château de Compiègne <<<
   Links to: [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]].
Skipping: Redirect target [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]] is not a redirect.


>>> Ꜫ <<<

5 pages read
0 pages written
Execution time: 1 seconds
Read operation time: 0 seconds
Script terminated by exception:

ERROR: UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
Traceback (most recent call last):
  File "/shared/pywikibot/core/pwb.py", line 253, in <module>
    if not main():
  File "/shared/pywikibot/core/pwb.py", line 246, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "/shared/pywikibot/core/pwb.py", line 115, in run_python_file
    main_mod.__dict__)
  File "/shared/pywikibot/core/scripts/redirect.py", line 806, in <module>
    main()
  File "/shared/pywikibot/core/scripts/redirect.py", line 802, in main
    bot.run()
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/bot.py", line 1505, in run
    self.treat(page)
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/bot.py", line 1733, in treat
    self.treat_page()
  File "/shared/pywikibot/core/scripts/redirect.py", line 716, in treat_page
    self.action_treat(self.current_page)
  File "/shared/pywikibot/core/scripts/redirect.py", line 586, in fix_1_double_redirect
    targetPage = newRedir.getRedirectTarget()
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/page.py", line 1666, in getRedirectTarget
    return self.site.getredirtarget(self)
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/site.py", line 3186, in getredirtarget
    % title.encode(self.encoding()))
UnicodeDecodeError: 'ascii' codec can''t decode byte 0xea in position 0: ordinal not in range(128)
<type 'exceptions.UnicodeDecodeError'>
CRITICAL: Closing network session.
tools.framabot@tools-bastion-02:~$ python3 -V
Python 3.4.3
tools.framabot@tools-bastion-02:~$ python3 /shared/pywikibot/core/pwb.py redirect -lang:fr double -namespace:0 -namespace:2 -always
Retrieving double redirect special page...
Retrieving 25 pages from wikipedia:fr.


>>> Les Luniens <<<
   Links to: [[E = mc² (recueil)#Les Luniens]].
Skipping: Redirect target [[E = mc² (recueil)#Les Luniens]] is not a redirect.


>>> Madrid-Barajas <<<
   Links to: [[Aéroport Adolfo Suárez Madrid-Barajas]].
Skipping: Redirect target [[Aéroport Adolfo Suárez Madrid-Barajas]] is not a redirect.


>>> Madrid Barajas <<<
   Links to: [[Aéroport Adolfo Suárez Madrid-Barajas]].
Skipping: Redirect target [[Aéroport Adolfo Suárez Madrid-Barajas]] is not a redirect.


>>> Musée du Second Empire (Compiègne) <<<
   Links to: [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]].
Skipping: Redirect target [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]] is not a redirect.


>>> Musée national du château de Compiègne <<<
   Links to: [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]].
Skipping: Redirect target [[Palais de Compiègne#Musée du Second Empire et Musée de l'Impératrice]] is not a redirect.


>>> Ꜫ <<<

5 pages read
0 pages written
Execution time: 1 seconds
Read operation time: 0 seconds
Script terminated by exception:

ERROR: RuntimeError: getredirtarget: No 'redirects' found for page b'\xea\x9c\xaa'.
Traceback (most recent call last):
  File "/shared/pywikibot/core/pwb.py", line 253, in <module>
    if not main():
  File "/shared/pywikibot/core/pwb.py", line 246, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "/shared/pywikibot/core/pwb.py", line 115, in run_python_file
    main_mod.__dict__)
  File "/shared/pywikibot/core/scripts/redirect.py", line 806, in <module>
    main()
  File "/shared/pywikibot/core/scripts/redirect.py", line 802, in main
    bot.run()
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/bot.py", line 1505, in run
    self.treat(page)
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/bot.py", line 1733, in treat
    self.treat_page()
  File "/shared/pywikibot/core/scripts/redirect.py", line 716, in treat_page
    self.action_treat(self.current_page)
  File "/shared/pywikibot/core/scripts/redirect.py", line 586, in fix_1_double_redirect
    targetPage = newRedir.getRedirectTarget()
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/page.py", line 1666, in getRedirectTarget
    return self.site.getredirtarget(self)
  File "/mnt/nfs/labstore-secondary-tools-project/pywikibot/public_html/core/pywikibot/site.py", line 3186, in getredirtarget
    % title.encode(self.encoding()))
RuntimeError: getredirtarget: No 'redirects' found for page b'\xea\x9c\xaa'.
<class 'RuntimeError'>
CRITICAL: Closing network session.
tools.framabot@tools-bastion-02:~$

Event Timeline

Apparently pywikibot is normalizing the redirect title into its target internally:

>>> '\ua72b' == Page(Site('fr', 'wikipedia'), '\ua72b').title()
False
Dalba triaged this task as High priority.Jul 25 2018, 6:54 PM

It's happening within first_upper function.

def first_upper(string):
	    """
	    Return a string with the first character capitalized.
	
	    Empty strings are supported. The original string is not changed.
	
	    Warning: Python 2 and 3 capitalize "ß" differently. MediaWiki does
	    not capitalize ß at the beginning. See T179115.
	    """
	    first = string[:1]
	    if first != 'ß':
	        first = first.upper()
	    return first + string[1:]

Testing on English Wikipedia I found a total of 802 characters(!) that my Python 3.7.0 str.upper handles differently. The following dict maps those characters to their actual (MW wise) uppercase forms:

{'ß': 'ß', 'ʼn': 'ʼn', 'ƀ': 'ƀ', 'ƚ': 'ƚ', 'Dž': 'Dž', 'dž': 'Dž', 'Lj': 'Lj', 'lj': 'Lj', 'Nj': 'Nj', 'nj': 'Nj', 'ǰ': 'ǰ', 'Dz': 'Dz', 'dz': 'Dz', 'ȼ': 'ȼ', 'ȿ': 'ȿ', 'ɀ': 'ɀ', 'ɂ': 'ɂ', 'ɇ': 'ɇ', 'ɉ': 'ɉ', 'ɋ': 'ɋ', 'ɍ': 'ɍ', 'ɏ': 'ɏ', 'ɐ': 'ɐ', 'ɑ': 'ɑ', 'ɒ': 'ɒ', 'ɜ': 'ɜ', 'ɡ': 'ɡ', 'ɥ': 'ɥ', 'ɦ': 'ɦ', 'ɪ': 'ɪ', 'ɫ': 'ɫ', 'ɬ': 'ɬ', 'ɱ': 'ɱ', 'ɽ': 'ɽ', 'ʇ': 'ʇ', 'ʉ': 'ʉ', 'ʌ': 'ʌ', 'ʝ': 'ʝ', 'ʞ': 'ʞ', 'ͅ': 'ͅ', 'ͱ': 'ͱ', 'ͳ': 'ͳ', 'ͷ': 'ͷ', 'ͻ': 'ͻ', 'ͼ': 'ͼ', 'ͽ': 'ͽ', 'ΐ': 'ΐ', 'ΰ': 'ΰ', 'ϗ': 'ϗ', 'ϲ': 'Σ', 'ϳ': 'ϳ', 'ϸ': 'ϸ', 'ϻ': 'ϻ', 'ӏ': 'ӏ', 'ӷ': 'ӷ', 'ӻ': 'ӻ', 'ӽ': 'ӽ', 'ӿ': 'ӿ', 'ԑ': 'ԑ', 'ԓ': 'ԓ', 'ԕ': 'ԕ', 'ԗ': 'ԗ', 'ԙ': 'ԙ', 'ԛ': 'ԛ', 'ԝ': 'ԝ', 'ԟ': 'ԟ', 'ԡ': 'ԡ', 'ԣ': 'ԣ', 'ԥ': 'ԥ', 'ԧ': 'ԧ', 'ԩ': 'ԩ', 'ԫ': 'ԫ', 'ԭ': 'ԭ', 'ԯ': 'ԯ', 'և': 'և', 'ა': 'ა', 'ბ': 'ბ', 'გ': 'გ', 'დ': 'დ', 'ე': 'ე', 'ვ': 'ვ', 'ზ': 'ზ', 'თ': 'თ', 'ი': 'ი', 'კ': 'კ', 'ლ': 'ლ', 'მ': 'მ', 'ნ': 'ნ', 'ო': 'ო', 'პ': 'პ', 'ჟ': 'ჟ', 'რ': 'რ', 'ს': 'ს', 'ტ': 'ტ', 'უ': 'უ', 'ფ': 'ფ', 'ქ': 'ქ', 'ღ': 'ღ', 'ყ': 'ყ', 'შ': 'შ', 'ჩ': 'ჩ', 'ც': 'ც', 'ძ': 'ძ', 'წ': 'წ', 'ჭ': 'ჭ', 'ხ': 'ხ', 'ჯ': 'ჯ', 'ჰ': 'ჰ', 'ჱ': 'ჱ', 'ჲ': 'ჲ', 'ჳ': 'ჳ', 'ჴ': 'ჴ', 'ჵ': 'ჵ', 'ჶ': 'ჶ', 'ჷ': 'ჷ', 'ჸ': 'ჸ', 'ჹ': 'ჹ', 'ჺ': 'ჺ', 'ჽ': 'ჽ', 'ჾ': 'ჾ', 'ჿ': 'ჿ', 'ᏸ': 'ᏸ', 'ᏹ': 'ᏹ', 'ᏺ': 'ᏺ', 'ᏻ': 'ᏻ', 'ᏼ': 'ᏼ', 'ᏽ': 'ᏽ', 'ᲀ': 'ᲀ', 'ᲁ': 'ᲁ', 'ᲂ': 'ᲂ', 'ᲃ': 'ᲃ', 'ᲄ': 'ᲄ', 'ᲅ': 'ᲅ', 'ᲆ': 'ᲆ', 'ᲇ': 'ᲇ', 'ᲈ': 'ᲈ', 'ᵹ': 'ᵹ', 'ᵽ': 'ᵽ', 'ẖ': 'ẖ', 'ẗ': 'ẗ', 'ẘ': 'ẘ', 'ẙ': 'ẙ', 'ẚ': 'ẚ', 'ỻ': 'ỻ', 'ỽ': 'ỽ', 'ỿ': 'ỿ', 'ὐ': 'ὐ', 'ὒ': 'ὒ', 'ὔ': 'ὔ', 'ὖ': 'ὖ', 'ά': 'Ά', 'έ': 'Έ', 'ή': 'Ή', 'ί': 'Ί', 'ό': 'Ό', 'ύ': 'Ύ', 'ώ': 'Ώ', 'ᾀ': 'ᾈ', 'ᾁ': 'ᾉ', 'ᾂ': 'ᾊ', 'ᾃ': 'ᾋ', 'ᾄ': 'ᾌ', 'ᾅ': 'ᾍ', 'ᾆ': 'ᾎ', 'ᾇ': 'ᾏ', 'ᾈ': 'ᾈ', 'ᾉ': 'ᾉ', 'ᾊ': 'ᾊ', 'ᾋ': 'ᾋ', 'ᾌ': 'ᾌ', 'ᾍ': 'ᾍ', 'ᾎ': 'ᾎ', 'ᾏ': 'ᾏ', 'ᾐ': 'ᾘ', 'ᾑ': 'ᾙ', 'ᾒ': 'ᾚ', 'ᾓ': 'ᾛ', 'ᾔ': 'ᾜ', 'ᾕ': 'ᾝ', 'ᾖ': 'ᾞ', 'ᾗ': 'ᾟ', 'ᾘ': 'ᾘ', 'ᾙ': 'ᾙ', 'ᾚ': 'ᾚ', 'ᾛ': 'ᾛ', 'ᾜ': 'ᾜ', 'ᾝ': 'ᾝ', 'ᾞ': 'ᾞ', 'ᾟ': 'ᾟ', 'ᾠ': 'ᾨ', 'ᾡ': 'ᾩ', 'ᾢ': 'ᾪ', 'ᾣ': 'ᾫ', 'ᾤ': 'ᾬ', 'ᾥ': 'ᾭ', 'ᾦ': 'ᾮ', 'ᾧ': 'ᾯ', 'ᾨ': 'ᾨ', 'ᾩ': 'ᾩ', 'ᾪ': 'ᾪ', 'ᾫ': 'ᾫ', 'ᾬ': 'ᾬ', 'ᾭ': 'ᾭ', 'ᾮ': 'ᾮ', 'ᾯ': 'ᾯ', 'ᾲ': 'ᾲ', 'ᾳ': 'ᾼ', 'ᾴ': 'ᾴ', 'ᾶ': 'ᾶ', 'ᾷ': 'ᾷ', 'ᾼ': 'ᾼ', 'ῂ': 'ῂ', 'ῃ': 'ῌ', 'ῄ': 'ῄ', 'ῆ': 'ῆ', 'ῇ': 'ῇ', 'ῌ': 'ῌ', 'ῒ': 'ῒ', 'ΐ': 'ΐ', 'ῖ': 'ῖ', 'ῗ': 'ῗ', 'ῢ': 'ῢ', 'ΰ': 'ΰ', 'ῤ': 'ῤ', 'ῦ': 'ῦ', 'ῧ': 'ῧ', 'ῲ': 'ῲ', 'ῳ': 'ῼ', 'ῴ': 'ῴ', 'ῶ': 'ῶ', 'ῷ': 'ῷ', 'ῼ': 'ῼ', 'ⅎ': 'ⅎ', 'ⅰ': 'ⅰ', 'ⅱ': 'ⅱ', 'ⅲ': 'ⅲ', 'ⅳ': 'ⅳ', 'ⅴ': 'ⅴ', 'ⅵ': 'ⅵ', 'ⅶ': 'ⅶ', 'ⅷ': 'ⅷ', 'ⅸ': 'ⅸ', 'ⅹ': 'ⅹ', 'ⅺ': 'ⅺ', 'ⅻ': 'ⅻ', 'ⅼ': 'ⅼ', 'ⅽ': 'ⅽ', 'ⅾ': 'ⅾ', 'ⅿ': 'ⅿ', 'ↄ': 'ↄ', 'ⓐ': 'ⓐ', 'ⓑ': 'ⓑ', 'ⓒ': 'ⓒ', 'ⓓ': 'ⓓ', 'ⓔ': 'ⓔ', 'ⓕ': 'ⓕ', 'ⓖ': 'ⓖ', 'ⓗ': 'ⓗ', 'ⓘ': 'ⓘ', 'ⓙ': 'ⓙ', 'ⓚ': 'ⓚ', 'ⓛ': 'ⓛ', 'ⓜ': 'ⓜ', 'ⓝ': 'ⓝ', 'ⓞ': 'ⓞ', 'ⓟ': 'ⓟ', 'ⓠ': 'ⓠ', 'ⓡ': 'ⓡ', 'ⓢ': 'ⓢ', 'ⓣ': 'ⓣ', 'ⓤ': 'ⓤ', 'ⓥ': 'ⓥ', 'ⓦ': 'ⓦ', 'ⓧ': 'ⓧ', 'ⓨ': 'ⓨ', 'ⓩ': 'ⓩ', 'ⰰ': 'ⰰ', 'ⰱ': 'ⰱ', 'ⰲ': 'ⰲ', 'ⰳ': 'ⰳ', 'ⰴ': 'ⰴ', 'ⰵ': 'ⰵ', 'ⰶ': 'ⰶ', 'ⰷ': 'ⰷ', 'ⰸ': 'ⰸ', 'ⰹ': 'ⰹ', 'ⰺ': 'ⰺ', 'ⰻ': 'ⰻ', 'ⰼ': 'ⰼ', 'ⰽ': 'ⰽ', 'ⰾ': 'ⰾ', 'ⰿ': 'ⰿ', 'ⱀ': 'ⱀ', 'ⱁ': 'ⱁ', 'ⱂ': 'ⱂ', 'ⱃ': 'ⱃ', 'ⱄ': 'ⱄ', 'ⱅ': 'ⱅ', 'ⱆ': 'ⱆ', 'ⱇ': 'ⱇ', 'ⱈ': 'ⱈ', 'ⱉ': 'ⱉ', 'ⱊ': 'ⱊ', 'ⱋ': 'ⱋ', 'ⱌ': 'ⱌ', 'ⱍ': 'ⱍ', 'ⱎ': 'ⱎ', 'ⱏ': 'ⱏ', 'ⱐ': 'ⱐ', 'ⱑ': 'ⱑ', 'ⱒ': 'ⱒ', 'ⱓ': 'ⱓ', 'ⱔ': 'ⱔ', 'ⱕ': 'ⱕ', 'ⱖ': 'ⱖ', 'ⱗ': 'ⱗ', 'ⱘ': 'ⱘ', 'ⱙ': 'ⱙ', 'ⱚ': 'ⱚ', 'ⱛ': 'ⱛ', 'ⱜ': 'ⱜ', 'ⱝ': 'ⱝ', 'ⱞ': 'ⱞ', 'ⱡ': 'ⱡ', 'ⱥ': 'ⱥ', 'ⱦ': 'ⱦ', 'ⱨ': 'ⱨ', 'ⱪ': 'ⱪ', 'ⱬ': 'ⱬ', 'ⱳ': 'ⱳ', 'ⱶ': 'ⱶ', 'ⲁ': 'ⲁ', 'ⲃ': 'ⲃ', 'ⲅ': 'ⲅ', 'ⲇ': 'ⲇ', 'ⲉ': 'ⲉ', 'ⲋ': 'ⲋ', 'ⲍ': 'ⲍ', 'ⲏ': 'ⲏ', 'ⲑ': 'ⲑ', 'ⲓ': 'ⲓ', 'ⲕ': 'ⲕ', 'ⲗ': 'ⲗ', 'ⲙ': 'ⲙ', 'ⲛ': 'ⲛ', 'ⲝ': 'ⲝ', 'ⲟ': 'ⲟ', 'ⲡ': 'ⲡ', 'ⲣ': 'ⲣ', 'ⲥ': 'ⲥ', 'ⲧ': 'ⲧ', 'ⲩ': 'ⲩ', 'ⲫ': 'ⲫ', 'ⲭ': 'ⲭ', 'ⲯ': 'ⲯ', 'ⲱ': 'ⲱ', 'ⲳ': 'ⲳ', 'ⲵ': 'ⲵ', 'ⲷ': 'ⲷ', 'ⲹ': 'ⲹ', 'ⲻ': 'ⲻ', 'ⲽ': 'ⲽ', 'ⲿ': 'ⲿ', 'ⳁ': 'ⳁ', 'ⳃ': 'ⳃ', 'ⳅ': 'ⳅ', 'ⳇ': 'ⳇ', 'ⳉ': 'ⳉ', 'ⳋ': 'ⳋ', 'ⳍ': 'ⳍ', 'ⳏ': 'ⳏ', 'ⳑ': 'ⳑ', 'ⳓ': 'ⳓ', 'ⳕ': 'ⳕ', 'ⳗ': 'ⳗ', 'ⳙ': 'ⳙ', 'ⳛ': 'ⳛ', 'ⳝ': 'ⳝ', 'ⳟ': 'ⳟ', 'ⳡ': 'ⳡ', 'ⳣ': 'ⳣ', 'ⳬ': 'ⳬ', 'ⳮ': 'ⳮ', 'ⳳ': 'ⳳ', 'ⴀ': 'ⴀ', 'ⴁ': 'ⴁ', 'ⴂ': 'ⴂ', 'ⴃ': 'ⴃ', 'ⴄ': 'ⴄ', 'ⴅ': 'ⴅ', 'ⴆ': 'ⴆ', 'ⴇ': 'ⴇ', 'ⴈ': 'ⴈ', 'ⴉ': 'ⴉ', 'ⴊ': 'ⴊ', 'ⴋ': 'ⴋ', 'ⴌ': 'ⴌ', 'ⴍ': 'ⴍ', 'ⴎ': 'ⴎ', 'ⴏ': 'ⴏ', 'ⴐ': 'ⴐ', 'ⴑ': 'ⴑ', 'ⴒ': 'ⴒ', 'ⴓ': 'ⴓ', 'ⴔ': 'ⴔ', 'ⴕ': 'ⴕ', 'ⴖ': 'ⴖ', 'ⴗ': 'ⴗ', 'ⴘ': 'ⴘ', 'ⴙ': 'ⴙ', 'ⴚ': 'ⴚ', 'ⴛ': 'ⴛ', 'ⴜ': 'ⴜ', 'ⴝ': 'ⴝ', 'ⴞ': 'ⴞ', 'ⴟ': 'ⴟ', 'ⴠ': 'ⴠ', 'ⴡ': 'ⴡ', 'ⴢ': 'ⴢ', 'ⴣ': 'ⴣ', 'ⴤ': 'ⴤ', 'ⴥ': 'ⴥ', 'ⴧ': 'ⴧ', 'ⴭ': 'ⴭ', 'ꙁ': 'ꙁ', 'ꙃ': 'ꙃ', 'ꙅ': 'ꙅ', 'ꙇ': 'ꙇ', 'ꙉ': 'ꙉ', 'ꙋ': 'ꙋ', 'ꙍ': 'ꙍ', 'ꙏ': 'ꙏ', 'ꙑ': 'ꙑ', 'ꙓ': 'ꙓ', 'ꙕ': 'ꙕ', 'ꙗ': 'ꙗ', 'ꙙ': 'ꙙ', 'ꙛ': 'ꙛ', 'ꙝ': 'ꙝ', 'ꙟ': 'ꙟ', 'ꙡ': 'ꙡ', 'ꙣ': 'ꙣ', 'ꙥ': 'ꙥ', 'ꙧ': 'ꙧ', 'ꙩ': 'ꙩ', 'ꙫ': 'ꙫ', 'ꙭ': 'ꙭ', 'ꚁ': 'ꚁ', 'ꚃ': 'ꚃ', 'ꚅ': 'ꚅ', 'ꚇ': 'ꚇ', 'ꚉ': 'ꚉ', 'ꚋ': 'ꚋ', 'ꚍ': 'ꚍ', 'ꚏ': 'ꚏ', 'ꚑ': 'ꚑ', 'ꚓ': 'ꚓ', 'ꚕ': 'ꚕ', 'ꚗ': 'ꚗ', 'ꚙ': 'ꚙ', 'ꚛ': 'ꚛ', 'ꜣ': 'ꜣ', 'ꜥ': 'ꜥ', 'ꜧ': 'ꜧ', 'ꜩ': 'ꜩ', 'ꜫ': 'ꜫ', 'ꜭ': 'ꜭ', 'ꜯ': 'ꜯ', 'ꜳ': 'ꜳ', 'ꜵ': 'ꜵ', 'ꜷ': 'ꜷ', 'ꜹ': 'ꜹ', 'ꜻ': 'ꜻ', 'ꜽ': 'ꜽ', 'ꜿ': 'ꜿ', 'ꝁ': 'ꝁ', 'ꝃ': 'ꝃ', 'ꝅ': 'ꝅ', 'ꝇ': 'ꝇ', 'ꝉ': 'ꝉ', 'ꝋ': 'ꝋ', 'ꝍ': 'ꝍ', 'ꝏ': 'ꝏ', 'ꝑ': 'ꝑ', 'ꝓ': 'ꝓ', 'ꝕ': 'ꝕ', 'ꝗ': 'ꝗ', 'ꝙ': 'ꝙ', 'ꝛ': 'ꝛ', 'ꝝ': 'ꝝ', 'ꝟ': 'ꝟ', 'ꝡ': 'ꝡ', 'ꝣ': 'ꝣ', 'ꝥ': 'ꝥ', 'ꝧ': 'ꝧ', 'ꝩ': 'ꝩ', 'ꝫ': 'ꝫ', 'ꝭ': 'ꝭ', 'ꝯ': 'ꝯ', 'ꝺ': 'ꝺ', 'ꝼ': 'ꝼ', 'ꝿ': 'ꝿ', 'ꞁ': 'ꞁ', 'ꞃ': 'ꞃ', 'ꞅ': 'ꞅ', 'ꞇ': 'ꞇ', 'ꞌ': 'ꞌ', 'ꞑ': 'ꞑ', 'ꞓ': 'ꞓ', 'ꞗ': 'ꞗ', 'ꞙ': 'ꞙ', 'ꞛ': 'ꞛ', 'ꞝ': 'ꞝ', 'ꞟ': 'ꞟ', 'ꞡ': 'ꞡ', 'ꞣ': 'ꞣ', 'ꞥ': 'ꞥ', 'ꞧ': 'ꞧ', 'ꞩ': 'ꞩ', 'ꞵ': 'ꞵ', 'ꞷ': 'ꞷ', 'ꞹ': 'ꞹ', 'ꭓ': 'ꭓ', 'ꭰ': 'ꭰ', 'ꭱ': 'ꭱ', 'ꭲ': 'ꭲ', 'ꭳ': 'ꭳ', 'ꭴ': 'ꭴ', 'ꭵ': 'ꭵ', 'ꭶ': 'ꭶ', 'ꭷ': 'ꭷ', 'ꭸ': 'ꭸ', 'ꭹ': 'ꭹ', 'ꭺ': 'ꭺ', 'ꭻ': 'ꭻ', 'ꭼ': 'ꭼ', 'ꭽ': 'ꭽ', 'ꭾ': 'ꭾ', 'ꭿ': 'ꭿ', 'ꮀ': 'ꮀ', 'ꮁ': 'ꮁ', 'ꮂ': 'ꮂ', 'ꮃ': 'ꮃ', 'ꮄ': 'ꮄ', 'ꮅ': 'ꮅ', 'ꮆ': 'ꮆ', 'ꮇ': 'ꮇ', 'ꮈ': 'ꮈ', 'ꮉ': 'ꮉ', 'ꮊ': 'ꮊ', 'ꮋ': 'ꮋ', 'ꮌ': 'ꮌ', 'ꮍ': 'ꮍ', 'ꮎ': 'ꮎ', 'ꮏ': 'ꮏ', 'ꮐ': 'ꮐ', 'ꮑ': 'ꮑ', 'ꮒ': 'ꮒ', 'ꮓ': 'ꮓ', 'ꮔ': 'ꮔ', 'ꮕ': 'ꮕ', 'ꮖ': 'ꮖ', 'ꮗ': 'ꮗ', 'ꮘ': 'ꮘ', 'ꮙ': 'ꮙ', 'ꮚ': 'ꮚ', 'ꮛ': 'ꮛ', 'ꮜ': 'ꮜ', 'ꮝ': 'ꮝ', 'ꮞ': 'ꮞ', 'ꮟ': 'ꮟ', 'ꮠ': 'ꮠ', 'ꮡ': 'ꮡ', 'ꮢ': 'ꮢ', 'ꮣ': 'ꮣ', 'ꮤ': 'ꮤ', 'ꮥ': 'ꮥ', 'ꮦ': 'ꮦ', 'ꮧ': 'ꮧ', 'ꮨ': 'ꮨ', 'ꮩ': 'ꮩ', 'ꮪ': 'ꮪ', 'ꮫ': 'ꮫ', 'ꮬ': 'ꮬ', 'ꮭ': 'ꮭ', 'ꮮ': 'ꮮ', 'ꮯ': 'ꮯ', 'ꮰ': 'ꮰ', 'ꮱ': 'ꮱ', 'ꮲ': 'ꮲ', 'ꮳ': 'ꮳ', 'ꮴ': 'ꮴ', 'ꮵ': 'ꮵ', 'ꮶ': 'ꮶ', 'ꮷ': 'ꮷ', 'ꮸ': 'ꮸ', 'ꮹ': 'ꮹ', 'ꮺ': 'ꮺ', 'ꮻ': 'ꮻ', 'ꮼ': 'ꮼ', 'ꮽ': 'ꮽ', 'ꮾ': 'ꮾ', 'ꮿ': 'ꮿ', 'ff': 'ff', 'fi': 'fi', 'fl': 'fl', 'ffi': 'ffi', 'ffl': 'ffl', 'ſt': 'ſt', 'st': 'st', 'ﬓ': 'ﬓ', 'ﬔ': 'ﬔ', 'ﬕ': 'ﬕ', 'ﬖ': 'ﬖ', 'ﬗ': 'ﬗ', '𐑎': '𐑎', '𐑏': '𐑏', '𐓘': '𐓘', '𐓙': '𐓙', '𐓚': '𐓚', '𐓛': '𐓛', '𐓜': '𐓜', '𐓝': '𐓝', '𐓞': '𐓞', '𐓟': '𐓟', '𐓠': '𐓠', '𐓡': '𐓡', '𐓢': '𐓢', '𐓣': '𐓣', '𐓤': '𐓤', '𐓥': '𐓥', '𐓦': '𐓦', '𐓧': '𐓧', '𐓨': '𐓨', '𐓩': '𐓩', '𐓪': '𐓪', '𐓫': '𐓫', '𐓬': '𐓬', '𐓭': '𐓭', '𐓮': '𐓮', '𐓯': '𐓯', '𐓰': '𐓰', '𐓱': '𐓱', '𐓲': '𐓲', '𐓳': '𐓳', '𐓴': '𐓴', '𐓵': '𐓵', '𐓶': '𐓶', '𐓷': '𐓷', '𐓸': '𐓸', '𐓹': '𐓹', '𐓺': '𐓺', '𐓻': '𐓻', '𐳀': '𐳀', '𐳁': '𐳁', '𐳂': '𐳂', '𐳃': '𐳃', '𐳄': '𐳄', '𐳅': '𐳅', '𐳆': '𐳆', '𐳇': '𐳇', '𐳈': '𐳈', '𐳉': '𐳉', '𐳊': '𐳊', '𐳋': '𐳋', '𐳌': '𐳌', '𐳍': '𐳍', '𐳎': '𐳎', '𐳏': '𐳏', '𐳐': '𐳐', '𐳑': '𐳑', '𐳒': '𐳒', '𐳓': '𐳓', '𐳔': '𐳔', '𐳕': '𐳕', '𐳖': '𐳖', '𐳗': '𐳗', '𐳘': '𐳘', '𐳙': '𐳙', '𐳚': '𐳚', '𐳛': '𐳛', '𐳜': '𐳜', '𐳝': '𐳝', '𐳞': '𐳞', '𐳟': '𐳟', '𐳠': '𐳠', '𐳡': '𐳡', '𐳢': '𐳢', '𐳣': '𐳣', '𐳤': '𐳤', '𐳥': '𐳥', '𐳦': '𐳦', '𐳧': '𐳧', '𐳨': '𐳨', '𐳩': '𐳩', '𐳪': '𐳪', '𐳫': '𐳫', '𐳬': '𐳬', '𐳭': '𐳭', '𐳮': '𐳮', '𐳯': '𐳯', '𐳰': '𐳰', '𐳱': '𐳱', '𐳲': '𐳲', '𑣀': '𑣀', '𑣁': '𑣁', '𑣂': '𑣂', '𑣃': '𑣃', '𑣄': '𑣄', '𑣅': '𑣅', '𑣆': '𑣆', '𑣇': '𑣇', '𑣈': '𑣈', '𑣉': '𑣉', '𑣊': '𑣊', '𑣋': '𑣋', '𑣌': '𑣌', '𑣍': '𑣍', '𑣎': '𑣎', '𑣏': '𑣏', '𑣐': '𑣐', '𑣑': '𑣑', '𑣒': '𑣒', '𑣓': '𑣓', '𑣔': '𑣔', '𑣕': '𑣕', '𑣖': '𑣖', '𑣗': '𑣗', '𑣘': '𑣘', '𑣙': '𑣙', '𑣚': '𑣚', '𑣛': '𑣛', '𑣜': '𑣜', '𑣝': '𑣝', '𑣞': '𑣞', '𑣟': '𑣟', '𖹠': '𖹠', '𖹡': '𖹡', '𖹢': '𖹢', '𖹣': '𖹣', '𖹤': '𖹤', '𖹥': '𖹥', '𖹦': '𖹦', '𖹧': '𖹧', '𖹨': '𖹨', '𖹩': '𖹩', '𖹪': '𖹪', '𖹫': '𖹫', '𖹬': '𖹬', '𖹭': '𖹭', '𖹮': '𖹮', '𖹯': '𖹯', '𖹰': '𖹰', '𖹱': '𖹱', '𖹲': '𖹲', '𖹳': '𖹳', '𖹴': '𖹴', '𖹵': '𖹵', '𖹶': '𖹶', '𖹷': '𖹷', '𖹸': '𖹸', '𖹹': '𖹹', '𖹺': '𖹺', '𖹻': '𖹻', '𖹼': '𖹼', '𖹽': '𖹽', '𖹾': '𖹾', '𖹿': '𖹿', '𞤢': '𞤢', '𞤣': '𞤣', '𞤤': '𞤤', '𞤥': '𞤥', '𞤦': '𞤦', '𞤧': '𞤧', '𞤨': '𞤨', '𞤩': '𞤩', '𞤪': '𞤪', '𞤫': '𞤫', '𞤬': '𞤬', '𞤭': '𞤭', '𞤮': '𞤮', '𞤯': '𞤯', '𞤰': '𞤰', '𞤱': '𞤱', '𞤲': '𞤲', '𞤳': '𞤳', '𞤴': '𞤴', '𞤵': '𞤵', '𞤶': '𞤶', '𞤷': '𞤷', '𞤸': '𞤸', '𞤹': '𞤹', '𞤺': '𞤺', '𞤻': '𞤻', '𞤼': '𞤼', '𞤽': '𞤽', '𞤾': '𞤾', '𞤿': '𞤿', '𞥀': '𞥀', '𞥁': '𞥁', '𞥂': '𞥂', '𞥃': '𞥃'}

Unfortunately the dict may differ from site to site and even on different namespaces of a site. More importantly it also depends on the Python version. For example when I tested on 3.6.3, the number of entries in the dict was 723. (I think this is because of the unicodedata.unidata_version, which is 11.0.0 on 3.7 and 9.0.0 on 3.6).

On python 2.7.6 (unidata_version == 5.2.0) the len of the dict is 337 and is still a subset of the above dict.

Here is the script that wrote to obtain the above dictionary and do some tests on different wikimedia sites: P7450

It's quite possible that I've missed some details, but I believe adding the above dict to first_upper would significantly reduce the probability of errors. We can fix other issues as they get reported in the future.

I'll soon send a patch for review.

Change 452187 had a related patch set uploaded (by Dalba; owner: dalba):
[pywikibot/core@master] pywikibot.tools: Add a set of characters as exceptions for first_upper

https://gerrit.wikimedia.org/r/452187

Change 452187 merged by jenkins-bot:
[pywikibot/core@master] pywikibot.tools: Add exceptions for first_upper

https://gerrit.wikimedia.org/r/452187

Got a similar problem here:

C:\pwb\GIT\core>py -3 pwb.py redirect -lang:tg do -simulate
Retrieving double redirect special page...
Retrieving 24 pages from wikipedia:tg.

>>> Лоиҳа:Тоҷикистон <<<

1 pages read
0 pages written
Execution time: 0 seconds
Read operation time: 0 seconds
Script terminated by exception:

ERROR: RuntimeError: getredirtarget: No 'redirects' found for page Лоиҳа:Тоҷикис
тон.
Traceback (most recent call last):
  File "pwb.py", line 253, in <module>
    if not main():
  File "pwb.py", line 246, in main
    run_python_file(filename, [filename] + args, argvu, file_package)
  File "pwb.py", line 115, in run_python_file
    main_mod.__dict__)
  File ".\scripts\redirect.py", line 806, in <module>
    main()
  File ".\scripts\redirect.py", line 802, in main
    bot.run()
  File "C:\pwb\GIT\core\pywikibot\bot.py", line 1505, in run
    self.treat(page)
  File "C:\pwb\GIT\core\pywikibot\bot.py", line 1737, in treat
    self.treat_page()
  File ".\scripts\redirect.py", line 715, in treat_page
    self.action_treat(self.current_page)
  File ".\scripts\redirect.py", line 585, in fix_1_double_redirect
    targetPage = newRedir.getRedirectTarget()
  File "C:\pwb\GIT\core\pywikibot\page.py", line 1668, in getRedirectTarget
    return self.site.getredirtarget(self)
  File "C:\pwb\GIT\core\pywikibot\site.py", line 3208, in getredirtarget
    .format(title))
RuntimeError: getredirtarget: No 'redirects' found for page Лоиҳа:Тоҷикистон.
<class 'RuntimeError'>
CRITICAL: Closing network session.

C:\pwb\GIT\core>

Change 457494 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Handle getRedirectTarget() exception smoothly

https://gerrit.wikimedia.org/r/457494

That sounds like a different issue (not a pywikibot normalization). https://tg.wikipedia.org/wiki/%D0%92%D0%B8%D0%B6%D0%B0:DoubleRedirects has the following entry:

Лоиҳа:Тоҷикистон (edit) →‎ Википедиа:Лоиҳа:Тоҷикистон →‎ Лоиҳа:Тоҷикистон

Naturally pywikibot is looking for the redirect target of Лоиҳа:Тоҷикистон (which is supposed to be Википедиа:Лоиҳа:Тоҷикистон), but Лоиҳа:Тоҷикистонis not a redirect in the first place: https://tg.wikipedia.org/wiki/%D0%9B%D0%BE%D0%B8%D2%B3%D0%B0:%D0%A2%D0%BE%D2%B7%D0%B8%D0%BA%D0%B8%D1%81%D1%82%D0%BE%D0%BD

That sounds like a different issue (not a pywikibot normalization).

Yes. That issue is T130911.

Change 457494 abandoned by Xqt:
[bugfix] Handle getRedirectTarget() exception smoothly

Reason:
see T130911
https://gerrit.wikimedia.org/r/#/c/pywikibot/core/ /279589/

https://gerrit.wikimedia.org/r/457494

Change 950801 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Add unidata.py script to mainenance scripts

https://gerrit.wikimedia.org/r/950801

Change 950801 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] Add unidata.py script to mainenance scripts

https://gerrit.wikimedia.org/r/950801