@APerson wrote:I'm getting a strange InvalidTitle error while iterating through each of the articles in the English Wikipedia's "Unprintworthy redirects" category using the articles() function.
In particular, if you run this code:
import pywikibot site = pywikibot.Site("en", "wikipedia"); site.login() cat = pywikibot.Category(site, "Category:Unprintworthy redirects") for each_article in cat.articles(namespaces=(0)): print(each_article.title(withNamespace=True), each_article.pageid)Then it'll run for a while, printing out a bunch of titles and page IDs, and then crash:
Traceback (most recent call last): File "/data/project/apersonbot/test-redir-bann.py", line 5, in <module> print(each_article.title(withNamespace=True), each_article.pageid) File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1446, in wrapper return obj(*__args, **__kw) File "/shared/pywikipedia/core/pywikibot/page.py", line 322, in title title = self._link.canonical_title() File "/shared/pywikipedia/core/pywikibot/page.py", line 5737, in canonical_title if self.namespace != Namespace.MAIN: File "/shared/pywikipedia/core/pywikibot/page.py", line 5698, in namespace self.parse() File "/shared/pywikipedia/core/pywikibot/page.py", line 5669, in parse raise pywikibot.InvalidTitle("The link does not contain a page " pywikibot.exceptions.InvalidTitle: The link does not contain a page title CRITICAL: Closing network session.Any ideas? I don't think this is expected behavior, but I could be wrong.
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
[bugfix] Do not strip all whitespaces from title | pywikibot/core | master | +3 -3 |
Related Objects
Related Objects
Event Timeline
Comment Actions
Changing the loop to the below tells me the first problematic pageid is 28644448, which is the character \x85.
>>> for each_article in cat.articles(namespaces=(0)): ... try: ... print(each_article.title(withNamespace=True), each_article.pageid) ... except pywikibot.exceptions.InvalidTitle: ... print(each_article.pageid) ... raise ...
str.strip() removes this character resulting an empty string, so the exception is raised. (page.py#L5666-L5670)
Since \x85 is a valid MediaWiki page title, pywikibot should also accept it as valid.
Comment Actions
Please try https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/395154/, I think I fixed also this error there.
Comment Actions
u'\x85' is a control sign and it doesn't look valid. You neither can link to the the redirect from redirect target nor from special page. The only reachable view is to edit the page [1]. I am wondering why mw accepts this as a page title; very strange!
[1] https://en.wikipedia.org/w/index.php?title=%C2%85&action=edit
Comment Actions
I still get the exception. (Python 3.6.3)
>>> import pywikibot >>> pywikibot.Page(pywikibot.Site('en', 'wikipedia'), '\x85') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 395, in __repr__ title = repr(self.title()) File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\tools\__init__.py", line 1446, in wrapper return obj(*__args, **__kw) File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 325, in title title = self._link.canonical_title() File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5731, in canonical_title if self.namespace != Namespace.MAIN: File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5692, in namespace self.parse() File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5634, in parse .format(self._text)) pywikibot.exceptions.InvalidTitle: does not contain a page title.
I linked to the title here. You can view the redirect page here.
Comment Actions
The problem is chr(133) is a whitespace, defined in unicodedata and will be stripped.
>>> '\x85'.isspace() True >>> '\x85'.strip() ''
How does MW handle this?
Probably we can use string-whitespace:
>>> '\x85'.strip(string.whitespace) '\x85'
Comment Actions
Change 640929 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title
Comment Actions
Change 640929 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title