Page MenuHomePhabricator

harvest_template.py fails on non-breaking space in wikilink
Closed, ResolvedPublic

Description

pwb.py harvest_template -template:Infobox_-_panovnice -lang:cs (...) otec P22

problem is, when wikilink contains   and does not exist
| otec = Baron [[Jan Šembera z Boskovic a Černé Hory]]

>>> Anna Marie Šemberová z Boskovic a Černé Hory <<<
Adding P26 --> [[wikidata:Q506533]]
Adding P53 --> [[wikidata:Q698530]]
WARNING: API warning (wbcreateclaim) of unknown format: {u'messages': [{u'html': {u'*': u'Va\u0161e \xfaprava byla za\u010dlen\u011bna do nejnov\u011bj\u0161\xed verze.'}, u'name': u'wikibase-conflict-patched', u'parameters': []}]}
WARNING: loadpageinfo: Query on [[cs:Jan Šembera z Boskovic a Černé Hory]] returned data on 'Jan Šembera z Boskovic a Černé Hory'
ERROR: 'Page' object has no attribute '_pageid'
Traceback (most recent call last):
  File "C:\pwb\pywikibot\bot.py", line 1922, in run
    self.treat(page, item)
  File ".\scripts\harvest_template3.py", line 183, in treat
    linked_item = self._template_link_target(item, link_text)
  File ".\scripts\harvest_template3.py", line 105, in _template_link_target
    if not linked_page.exists():
  File "C:\pwb\pywikibot\page.py", line 707, in exists
    return self.site.page_exists(self)
  File "C:\pwb\pywikibot\site.py", line 2956, in page_exists
    return page._pageid > 0
AttributeError: 'Page' object has no attribute '_pageid'

Event Timeline

Change 323184 had a related patch set uploaded (by Matěj Suchánek):
Remove non-breaking spaces when tidying up a link

https://gerrit.wikimedia.org/r/323184

I was trying to recreate issue above. From provided logs it looks I found particular page, that was edited. We are talking about Anna Marie Šemberová z Boskovic a Černé Hory). After error ocured @JAnD correct it in this edit.

Since March 2016 there were one change in source code (see: 1) and we are getting something else right now:

>>> import pywikibot
>>> cswiki = pywikibot.Site('cs', 'wikipedia')
>>> correct_page = pywikibot.Page(cswiki, u'Jan Šembera z Boskovic a Černé Hory')
>>> correct_page.exists()
False
>>> correct_page._pageid
0
>>> nbsp_page = pywikibot.Page(cswiki, u'Jan Šembera z&nbsp;Boskovic a Černé Hory')
>>> nbsp_page.exists()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pywikibot/page.py", line 756, in exists
    return self.site.page_exists(self)
  File "pywikibot/site.py", line 2988, in page_exists
    return page.pageid > 0
  File "pywikibot/page.py", line 255, in pageid
    self.site.loadpageinfo(self)
  File "pywikibot/site.py", line 2914, in loadpageinfo
    self._update_page(page, query)
  File "pywikibot/site.py", line 2900, in _update_page
    raise InconsistentTitleReceived(page, pageitem['title'])
pywikibot.exceptions.InconsistentTitleReceived: Query on [[wikipedia:cs:Jan Šembera z Boskovic a Černé Hory]] returned data on 'Jan Šembera z Boskovic a Černé Hory'
>>> nbsp_page._pageid
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Page' object has no attribute '_pageid'

BTW @JAnD what's scripts\harvest_template3.py? I wasn't able to find it in our repository?

and we are getting something else right now

Basically the same: when pywikibot recieves the answer from wiki's API, it compares the output with the current title via Site.sametitle(). The non-breaking space makes the method return False which is the main problem.

BTW @JAnD what's scripts\harvest_template3.py? I wasn't able to find it in our repository?

Looks like a fork of scripts\harvest_template.py living on a wiki page or so.

matej_suchanek renamed this task from harvest_template.py fails on &nbsp; on wikilink to harvest_template.py fails on non-breaking space in wikilink.May 9 2017, 7:42 AM

Change 323184 merged by jenkins-bot:
[pywikibot/core@master] Remove non-breaking spaces when tidying up a link

https://gerrit.wikimedia.org/r/323184