Page MenuHomePhabricator

'-titleregex' does not handle namespaces other than 0
Closed, ResolvedPublic

Description

When -titleregex is called with a namespace other than 0 (with GeneratorFactory), no pages are returned. Only pages with namespace 0 are ever returned.

In GeneratorFactory, -titleregex calls RegexFilterPageGenerator (wich is really RegexFilter.titlefilter) with an argument of site.allpages(). allpages defaults to namespace 0. When getCombinedGenerator() is called, the generator (already filled with namespaces of 0) is not considered a pywikibot.data.api.QueryGenerator and is filtered out with NamespaceFilterPageGenerator for the appropriate namespaces, either resulting empty results or only with namespaces 0.

Event Timeline

Daviskr created this task.Dec 28 2014, 2:36 AM
Daviskr raised the priority of this task from to Medium.
Daviskr updated the task description. (Show Details)
Daviskr added a subscriber: Daviskr.
Daviskr set Security to None.
Daviskr added a comment.EditedDec 28 2014, 4:39 AM

Another side effect of the current setup is that it fetches all pages (that match the regex) before it applies limit*. This causes extreme slowdown as seen in this build.

*: As Mpaa said below, this only applies to namespaces other than zero.

Mpaa added a subscriber: Mpaa.Dec 28 2014, 5:46 PM

Another side effect of the current setup is that it fetches all pages (that match the regex) before it applies limit. This causes extreme slowdown as seen in this build.

One clarification: only if a namespace different from 0 is specified, as it is done in the test, for the reason explained above: no pages will be yielded at all, so the test will end only when all the pages in the specified ns have been fetched by allpages().

Mpaa added a comment.Dec 28 2014, 6:23 PM
This comment was removed by Mpaa.

Change 181993 had a related patch set uploaded (by Mpaa):
Pagegenerators.py: ns handling for titleregex option

https://gerrit.wikimedia.org/r/181993

Patch-For-Review

jayvdb added a subscriber: jayvdb.Dec 30 2014, 12:03 PM

After T57226 is merged, we can re-purpose this task to track the underlying problem; I suspect we want to wait until argparse has landed before fixing the real problem, as it should be much simpler then.

Mpaa closed this task as Resolved.Sep 27 2015, 2:01 PM
Mpaa added a comment.Oct 10 2015, 9:15 AM

See also T114015, where it is suggested to change how titleregex works.