Page MenuHomePhabricator

reflinks.py: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 18: invalid continuation byte
Closed, ResolvedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • venv/bin/python3 $HOME/pywikibot-core/pwb.py reflinks -start:Мороз -v -debug

I wasn't able to identify the exact page where the script crashes as there is no such info in the terminal even with debug option, maybe @Xqt will have a suggestion how to localise the issue if needed

What happens?:

Retrieving 50 pages from wikipedia:ru.
Dropped throttle(s).

4906 pages read
0 pages written
0 pages skipped
Execution time: 279 seconds
Read operation time: 0.1 seconds
Script terminated by exception:

ERROR: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 18: invalid continuation byte
Traceback (most recent call last):
  File "/data/project/rubin16/pywikibot-core/pwb.py", line 496, in <module>
    main()
  File "/data/project/rubin16/pywikibot-core/pwb.py", line 480, in main
    if not execute():
  File "/data/project/rubin16/pywikibot-core/pwb.py", line 463, in execute
    run_python_file(filename, script_args, module)
  File "/data/project/rubin16/pywikibot-core/pwb.py", line 144, in run_python_file
    main_mod.__dict__)
  File "/data/project/rubin16/pywikibot-core/scripts/reflinks.py", line 803, in <module>
    main()
  File "/data/project/rubin16/pywikibot-core/scripts/reflinks.py", line 799, in main
    bot.run()
  File "/mnt/nfs/labstore-secondary-tools-project/rubin16/pywikibot-core/pywikibot/bot.py", line 1571, in run
    self.treat(page)
  File "/data/project/rubin16/pywikibot-core/scripts/reflinks.py", line 663, in treat
    tag = meta_content.group().decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 18: invalid continuation byte
Dropped throttle(s).
Closing network session.
CRITICAL: Exiting due to uncaught exception <class 'UnicodeDecodeError'>
Network session closed.

What should have happened instead?:
No crash of the script

Event Timeline

Xqt triaged this task as Medium priority.

The related page is [[ru:Мосты Воронежа]]

The related page is [[ru:Мосты Воронежа]]

Can you, please, explain me for future cases how you were able to find it?

The related page is [[ru:Мосты Воронежа]]

Can you, please, explain me for future cases how you were able to find it?

I just added a print statement for debugging purpose:

old code

tag = meta_content.group().decode()

new code

try:
    tag = meta_content.group().decode()
except UnicodeDecodeError:
    print(page)
    raise

Change 772400 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [bugfix] Solve UnicodeDecodeError in reflinks.py

https://gerrit.wikimedia.org/r/772400

Change 772400 merged by jenkins-bot:

[pywikibot/core@master] [bugfix] Solve UnicodeDecodeError in reflinks.py

https://gerrit.wikimedia.org/r/772400