Page MenuHomePhabricator

unicodeDecodeError in url2unicode()
Open, HighPublic

Description

Historia de Cerdeña -> corresponding page is Història de Sardenya
Traceback (most recent call last):
  File "C:\pwb\core\pwb.py", line 135, in <module>
    run_python_file(fn, argv, argvu)
  File "C:\pwb\core\pwb.py", line 67, in run_python_file
    exec(compile(source, filename, "exec"), main_mod.__dict__)
  File "C:\pwb\core\scripts\featured.py", line 683, in <module>
    main()
  File "C:\pwb\core\scripts\featured.py", line 676, in main
    bot.run()
  File "C:\pwb\core\scripts\featured.py", line 285, in run
    self.run_good()
  File "C:\pwb\core\scripts\featured.py", line 320, in run_good
    self.treat(code, task)
  File "C:\pwb\core\scripts\featured.py", line 372, in treat
    self.featuredWithInterwiki(fromsite, process)
  File "C:\pwb\core\scripts\featured.py", line 612, in featuredWithInterwiki
    atrans.put(text, comment)
  File "C:\pwb\core\pywikibot\page.py", line 906, in put
    async=async, callback=callback, **kwargs)
  File "C:\pwb\core\pywikibot\page.py", line 827, in save
    **kwargs)
  File "C:\pwb\core\pywikibot\page.py", line 834, in _save
    comment = self._cosmetic_changes_hook(comment) or comment
  File "C:\pwb\core\pywikibot\page.py", line 884, in _cosmetic_changes_hook
    self.text = ccToolkit.change(old)
  File "C:\pwb\core\scripts\cosmetic_changes.py", line 174, in change
    text = self.cleanUpLinks(text)
  File "C:\pwb\core\scripts\cosmetic_changes.py", line 510, in cleanUpLinks
    'startspace'])
  File "C:\pwb\core\pywikibot\textlib.py", line 208, in replaceExcept
    replacement = new(match)
  File "C:\pwb\core\scripts\cosmetic_changes.py", line 396, in handleOneLink
    if not self.site.isInterwikiLink(titleWithSection):
  File "C:\pwb\core\pywikibot\site.py", line 381, in isInterwikiLink
    linkfam, linkcode = pywikibot.Link(text, self).parse_site()
  File "C:\pwb\core\pywikibot\page.py", line 3102, in __init__
    t = url2unicode(t, site=self._source)
  File "C:\pwb\core\pywikibot\page.py", line 3546, in url2unicode
    raise firstException
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 12: invalid start byte

Unfortunately the title is not printed to specify that bug


Version: core-(2.0)
Severity: normal

Details

Reference
bz58574

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:28 AM
bzimport set Reference to bz58574.
bzimport added a subscriber: Unknown Object (????).
Xqt created this task.Dec 17 2013, 12:00 PM
jayvdb updated the task description. (Show Details)Jul 5 2015, 10:52 PM
jayvdb set Security to None.
XZise added a subscriber: XZise.Jul 22 2015, 6:13 PM

I'm not sure what we can do here. Alone from the traceback I can't tell anything. I think we need to raise a custom exception message which also contains the original text.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2015, 6:13 PM
Xqt changed the task status from Open to Stalled.Apr 25 2017, 12:36 PM

Is this still valid for master branch?

Xqt raised the priority of this task from Medium to High.May 28 2017, 11:53 AM
Xqt changed the task status from Stalled to Open.Jun 28 2017, 8:56 AM
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptOct 29 2017, 10:38 PM
Dvorapa added a comment.EditedJun 3 2018, 11:02 AM

On Python 2 I get:

$ python2 pwb.py shell
>>> pywikibot.Link('Historia de Cerdeña')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/pavel/pywikibot-test/pywikibot/page.py", line 5476, in __init__
    if u"|" in self._text:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)
>>> pywikibot.Link('Història de Sardenya')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/pavel/pywikibot-test/pywikibot/page.py", line 5476, in __init__
    if u"|" in self._text:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

If I sort Link.__init__ a little bit, I can reproduce exactly the same.

This comment was removed by Dvorapa.
Dalba added a subscriber: Dalba.EditedJun 3 2018, 11:22 AM

@Dvorapa, You should pass Unicode objects to Link, not byte-strings. Try u'Historia de Cerdeña' instead of 'Historia de Cerdeña' or from __future__ import unicode_literals in your script.

Dvorapa added a comment.EditedJun 3 2018, 12:09 PM

I see. So this must be solved since 1e54a7d6886d and 5795ed5b816b in 2015

Or there is non-string link title passed to Link() by any of the tracebacked lines (T66958)