Page MenuHomePhabricator

cosmetic_changes.py: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 6: invalid continuation byte
Closed, ResolvedPublicBUG REPORT

Description

Command line:

python pwb.py cosmetic_changes -page:"نقاش:السلفية/أرشيف 1" -lang:ar

Output:

Retrieving 1 pages from wikipedia:ar.

>>> نقاش:السلفية/أرشيف 1 <<<

1 read operation
Execution time: 1 seconds
Read operation time: 1.0 seconds
Script terminated by exception:

ERROR: 'utf-8' codec can't decode byte 0xd9 in position 6: invalid continuation byte (UnicodeDecodeError)
Traceback (most recent call last):
  File "C:\Users\Mohammed\Downloads\core\pwb.py", line 39, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pwb.py", line 35, in main
    runpy.run_path(str(path), run_name='__main__')
  File "<frozen runpy>", line 291, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\Mohammed\Downloads\core\pywikibot\scripts\wrapper.py", line 513, in <module>
    main()
  File "C:\Users\Mohammed\Downloads\core\pywikibot\scripts\wrapper.py", line 497, in main
    if not execute():
           ^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\scripts\wrapper.py", line 484, in execute
    run_python_file(filename, script_args, module)
  File "C:\Users\Mohammed\Downloads\core\pywikibot\scripts\wrapper.py", line 147, in run_python_file
    exec(compile(source, filename, 'exec', dont_inherit=True),
  File "C:\Users\Mohammed\Downloads\core\scripts\cosmetic_changes.py", line 131, in <module>
    main()
  File "C:\Users\Mohammed\Downloads\core\scripts\cosmetic_changes.py", line 127, in main
    bot.run()
  File "C:\Users\Mohammed\Downloads\core\pywikibot\bot.py", line 1671, in run
    self.treat(page)
  File "C:\Users\Mohammed\Downloads\core\pywikibot\bot.py", line 1924, in treat
    self.treat_page()
  File "C:\Users\Mohammed\Downloads\core\scripts\cosmetic_changes.py", line 84, in treat_page
    new_text = cc_toolkit.change(old_text)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\cosmetic_changes.py", line 302, in change
    new_text = self._change(text)
               ^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\cosmetic_changes.py", line 296, in _change
    text = self.safe_execute(method, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\cosmetic_changes.py", line 283, in safe_execute
    result = method(text)
             ^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\cosmetic_changes.py", line 645, in cleanUpLinks
    text = textlib.replaceExcept(text, linkR, handleOneLink,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\textlib.py", line 452, in replaceExcept
    replacement = new(match)
                  ^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\cosmetic_changes.py", line 527, in handleOneLink
    is_interwiki = self.site.isInterwikiLink(titleWithSection)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\site\_basesite.py", line 336, in isInterwikiLink
    linkfam, linkcode = pywikibot.Link(text, self).parse_site()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\page\_links.py", line 300, in __init__
    self._text = pywikibot.tools.chars.url2string(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Mohammed\Downloads\core\pywikibot\tools\chars.py", line 136, in url2string
    raise first_exception
  File "C:\Users\Mohammed\Downloads\core\pywikibot\tools\chars.py", line 128, in url2string
    result = t.decode(enc)
             ^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 6: invalid continuation byte
CRITICAL: Exiting due to uncaught exception UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd9 in position 6: invalid continuation byte

What should have happened instead?:

When encountering such error, the bot should have skipped the page and continued working on other pages instead of crashing which forces me to restart the bot run.

Note:

This task is similar to T304288: reflinks.py: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 18: invalid continuation byte which was fixed in rPWBC4298a6cd362f82fdb3af2794cafa9aea13ad7859 but for the script cosmetic_changes.py instead of reflinks.py

Software version:

Pywikibot: [https] r-pywikibot-core (6ef2645, g17994, 2023/07/20, 13:19:10, master)
Release version: 8.3.0.dev0
setuptools version: 68.0.0
mwparserfromhell version: 0.6.4
wikitextparser version: n/a
requests version: 2.31.0
    certificate test: ok
Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]

Event Timeline

@Meno25: could the first part be 'ﻅ٣ﻋ▒ﻋ·ﻋ٪8ﻅﺏ'. I get it with IBM864 encoding.

Seems this this bad title link causes this exception:

https://ar.wikipedia.org/wiki/%D8%B3%D9%84%D9%81%D9%8%D8%A9#.D8.A7.D9.84.D8.AD.D9.83.D9.85_.D8.A8.D8.A7.D9.84.D8.B4.D8.B1.D9.8A.D8.B9.D8.A9_.D8.A7.D9.84.D8.A5.D8.B3.D9.84.D8.A7.D9.85.D9.8A.D8.A9

Unfortunately the link is cropped by fixSyntaxSave() method first before the the cleanUpLinks fails.

Good catch. Yes, this link caused the problem. It contained

%D9%8

which should have been

%D9%8A

(Encoding of the Arabic character "ي")
I fixed it in this edit and after this the bot worked as expected.

Change 942603 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Convert URL-encoded characters also for links outside main namespace

https://gerrit.wikimedia.org/r/942603

Change 942603 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] Convert URL-encoded characters also for links outside main namespace

https://gerrit.wikimedia.org/r/942603

Xqt claimed this task.

Change 942753 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [bugfix] fix CosmeticChangesToolkit.cleanUpLinks

https://gerrit.wikimedia.org/r/942753

Change 942753 merged by Xqt:

[pywikibot/core@master] [bugfix] fix CosmeticChangesToolkit.cleanUpLinks

https://gerrit.wikimedia.org/r/942753