Page MenuHomePhabricator

InvalidTitle raised for title that is a newline
Closed, ResolvedPublic

Description

@APerson wrote:

I'm getting a strange InvalidTitle error while iterating through each of the articles in the English Wikipedia's "Unprintworthy redirects" category using the articles() function.

In particular, if you run this code:

import pywikibot
site = pywikibot.Site("en", "wikipedia"); site.login()
cat = pywikibot.Category(site, "Category:Unprintworthy redirects")
for each_article in cat.articles(namespaces=(0)):
    print(each_article.title(withNamespace=True), each_article.pageid)

Then it'll run for a while, printing out a bunch of titles and page IDs, and then crash:

Traceback (most recent call last):
  File "/data/project/apersonbot/test-redir-bann.py", line 5, in <module>
    print(each_article.title(withNamespace=True), each_article.pageid)
  File "/shared/pywikipedia/core/pywikibot/tools/__init__.py", line 1446,
in wrapper
    return obj(*__args, **__kw)
  File "/shared/pywikipedia/core/pywikibot/page.py", line 322, in title
    title = self._link.canonical_title()
  File "/shared/pywikipedia/core/pywikibot/page.py", line 5737, in
canonical_title
    if self.namespace != Namespace.MAIN:
  File "/shared/pywikipedia/core/pywikibot/page.py", line 5698, in namespace
    self.parse()
  File "/shared/pywikipedia/core/pywikibot/page.py", line 5669, in parse
    raise pywikibot.InvalidTitle("The link does not contain a page "
pywikibot.exceptions.InvalidTitle: The link does not contain a page title
CRITICAL: Closing network session.

Any ideas? I don't think this is expected behavior, but I could be wrong.

Event Timeline

Changing the loop to the below tells me the first problematic pageid is 28644448, which is the character \x85.

>>> for each_article in cat.articles(namespaces=(0)):
...     try:
...         print(each_article.title(withNamespace=True), each_article.pageid)
...     except pywikibot.exceptions.InvalidTitle:
...         print(each_article.pageid)
...         raise
...

str.strip() removes this character resulting an empty string, so the exception is raised. (page.py#L5666-L5670)

Since \x85 is a valid MediaWiki page title, pywikibot should also accept it as valid.

Scratch that

>>> u'\x85'.strip()
u''

u'\x85' is a control sign and it doesn't look valid. You neither can link to the the redirect from redirect target nor from special page. The only reachable view is to edit the page [1]. I am wondering why mw accepts this as a page title; very strange!

[1] https://en.wikipedia.org/w/index.php?title=%C2%85&action=edit

Please try https://gerrit.wikimedia.org/r/#/c/pywikibot/core/+/395154/, I think I fixed also this error there.

I still get the exception. (Python 3.6.3)

>>> import pywikibot
>>> pywikibot.Page(pywikibot.Site('en', 'wikipedia'), '\x85')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 395, in __repr__
    title = repr(self.title())
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\tools\__init__.py", line 1446, in wrapper
    return obj(*__args, **__kw)
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 325, in title
    title = self._link.canonical_title()
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5731, in canonical_title
    if self.namespace != Namespace.MAIN:
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5692, in namespace
    self.parse()
  File "F:\Code\Wikimedia Gerrit\pywikibot\pywikibot\page.py", line 5634, in parse
    .format(self._text))
pywikibot.exceptions.InvalidTitle:  does not contain a page title.

You neither can link to the the redirect from redirect target nor from special page. The only reachable view is to edit the page [1]. I am wondering why mw accepts this as a page title; very strange!

I linked to the title here. You can view the redirect page here.

Dalba triaged this task as Medium priority.Jun 22 2018, 2:57 AM
Vvjjkkii renamed this task from InvalidTitle raised for title that is a newline to 9oaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 9oaaaaaaaa to InvalidTitle raised for title that is a newline.Jul 2 2018, 1:40 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

The problem is chr(133) is a whitespace, defined in unicodedata and will be stripped.

>>> '\x85'.isspace()
True
>>> '\x85'.strip()
''

How does MW handle this?

Probably we can use string-whitespace:

>>> '\x85'.strip(string.whitespace)
'\x85'

Change 640929 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title

https://gerrit.wikimedia.org/r/640929

Change 640929 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Do not strip all whitespaces from title

https://gerrit.wikimedia.org/r/640929